Tag Archives: Research Support

Text and data mining services: an update

Text and Data Mining (TDM) is the process of digitally querying large collections of machine-readable material, extracting specific information and, by analysis, discovering new information about a topic.

In February 2017, a group University of Cambridge staff met to discuss “Text and Data Mining Services: What can Cambridge libraries offer?”  It was agreed that a future library Text and Data Mining (TDM) support service could include:

  • Access to data from our own collections
  • Advice on legal issues, what publishers allow, what data sets and tools are available
  • Registers on data provided for mining and TDM projects
  • Fostering agreements with publishers.

This blog reports on some of the activities, events and initiatives, involving libraries at the University of Cambridge, that have taken place or are in progress since this meeting (also summarised in these slides).  Raising awareness, educating, and teasing out the issues around the low uptake of this research process have been the main drivers for these activities.

March 2017: RLUK 2017 Conference Workshop

The Office of Scholarly Communication (OSC) and Jisc ran a workshop at the Research Libraries UK 2017 conference to discuss Research Libraries and TDM.  Issues raised included licencing, copyright, data management, perceived lack of demand, where to go for advice within an institution or publisher, policy and procedural development for handling TDM-related requests (and scaling this up across an institution) and the risk of lock-out from publishers’ content, as well as the time it can take for a TDM contract to be finalised between an institution and publisher.  The group concluded that it is important to build mechanisms into TDM-specific licencing agreements between institutions and publishers where certain behaviours are expected.  For example, if suspicious activity is detected by a publisher’s website, it would be better not to automatically block the originating institution from accessing content, but investigate this first (although this may depend on systems in place), or if lock-out happens and the activity is legal, participants suggested that institutions should explore compensation for the time that access is lost if significant.

July 2017: University of Cambridge Text and Data Mining Libguide

Developed by the eResources Team, this LibGuide explains about Text and Data Mining (TDM): what it is, what the legal issues are, what you can do and what you should not try to do. It also provides a list of online journals under license for TDM at the University of Cambridge and a list of digital archives for text mining that can be supplied to the University researchers on a disc copy. Any questions our researchers may have about a TDM project, not answered through the LibGuide, can be submitted to the eResources Team via an enquiry form.

July 2017: TDM Symposium

The OSC hosted this symposium to provide as much information as possible to the attendees regarding TDM.  Internal and external speakers, experienced in the field, spoke about what TDM is and what the issues are; research projects in which TDM was used; TDM tools; how a particular publisher supports TDM; and how librarians can support TDM.

At the end of the day a whole-group discussion drew out issues around why more TDM is not happening in the UK and it was agreed that there was a need for more visibility on what TDM looks like (e.g. a need for some hands-on sessions) and increased stakeholder communication: i.e. between publishers, librarians and researchers.

November 2017: Stakeholder communication and the TDM Test Kitchen

This pilot project involves a publisher, librarians and researchers. It is providing practical insight into the issues arising for each of the stakeholders: e.g. researchers providing training on TDM methods and analysis tools, library support managing content accessibility and funding for this, and content licencing and agreements for the publisher. We’ll take a more in-depth look at this pilot in an upcoming blog on TDM – watch this space.

January 2018: Cambridge University Library Deputy Director visits Yale

The Yale University Library Digital Humanities Laboratory provides physical space, resources and a community within the Library for Yale researchers who are working with digital methods for humanities research and teaching. In January this year Dr Danny Kingsley visited the facility to discuss approaches to providing TDM services to help planning here. The Yale DH Lab staff help out with projects in a variety of ways, one example being to help researchers get to grips with digital tools and methods.  Researchers wanting to carry out TDM on particular collections can visit the lab to do their TDM: off-line discs containing published material for mining can be used in-situ. In 2018, the libraries at Cambridge have begun building up a collection of offline discs of specific collections for the same purpose.

June 2018: Text and Data Mining online course

The OSC collaborated with the EU OpenMinTeD project on this Foster online course: Introduction to Text and Data Mining.  The course helps a learner understand the key concepts around TDM, explores how Research Support staff can help with TDM and there are some practical activities that even allow those with non-technical skills try out some mining concepts for themselves.  By following these activities, you can find out a bit more about sentence segmentation, tokenization, stemming and other processing techniques.

October 2018: Gale Digital Scholar Lab

The University of Cambridge has trial access to this platform until the end of December: it provides TDM tools at a front end to digital archives from Gale Cengage.  You can find out more about this trial in this ejournals@cambridge blog.

In summary…

Following the initial meeting to discuss research support services for TDM, there have been efforts and achievements to raise awareness of TDM and the possibilities it can bring to the research process as well as to explore the issues around the low usage of TDM in the research community at large.  This is an on-going task, with the goal of increased researcher engagement with TDM.

Published 23 October 2018
Written by Dr Debbie Hansen
Creative Commons License

Electronic lab notebooks – a report from a SLA meeting

In preparation for our the “Electronic Lab Notebooks: Solutions for Paperless Research” we decided to re-blog this post* on the subject written by Niamh Tumelty, Head of STEM Libraries at the University of Cambridge.

Roundtable on Electronic Laboratory Notebooks

A significant part of my role involves research support, but so far I have not been involved with lab notebooks, electronic or otherwise. I registered for this session at the Special Libraries Association meeting in 2014 mainly out of curiosity, hoping to find out more about what products others are using, how they’re finding them and whether or not they would be of interest to my Department.

What is an ELN?

Simon Coles set the scene with an overview of the development of electronic lab notebooks (ELNs) to date and frank assessment of their value in different contexts.  Simon has been working on developing ELNs since 1996 and has been with for Amphora for 11 years.   Amphora identified three problems to solve: capturing information from busy scientists, preserving data in complex contexts and being able to provide evidence in court, for example to prove the origin of an idea. They work with a wide range of customers with some of the largest and smallest implementations of electronic lab notebooks.

There is no single definition of ELN so we look carefully at what we need. We need to be wary of what exactly was meant by other case studies, since what they implemented may not be relevant or comparable at all.  Researchers naturally expect that lab notebooks would be tailored to their research workflows, and since there are very different workflows in different areas of science it is unlikely that one solution will be appropriate for an entire organisation.

Another key point is that an ELN doesn’t have to be a complex purchased product.  MS Word and WordPress have been successfully used and there is a real danger of finding yourself in ‘consulting heaven’ with some of the commercial products, with costly ongoing support turning out to be essential. If introducing an ELN we need to consider a number of questions:

  • Do we want it to be about record-keeping and collaboration or is it about doing bits of science?
  • Does it need to enforce certain processes?
  • Is it something specific to a group of scientists or generic (bearing in mind that even the same scientist’s needs will change over time)?
  • How large and diverse is your user base?

The university view

Daureen Nesdill is Data Curation Librarian at the University of Utah and has been involved with the USTAR project. They conducted a study on campus to see what was already happening in terms of ELNs and found that they were already being used in some areas (including in the Humanities) and one person already had a standalone implementation of CambridgeSoft.  Daureen set up a Libguide on ELNs to share information about them.

A working group was set up to look more closely at the options but they haven’t implemented a solution campus-wide because no one tool will work for the whole campus.  Other barriers include the expense of acquiring an ELN (purchasing software, local hosting and support or cloud hosting), the question of who pays for this and the amount of time it takes to roll out (a few months for a lab of 50 people!)  There are also concerns about security, import-export loss and if using a cloud solution, awareness of where your data is being stored.

Daureen outlined a number of requirements for an ELN in a University:

  • ability to control access at an individual level;
  • recording of provenance (all needs to be documented in case there is any future question of who did the work) and this information needs to be included in data exports;
  • Both cloud and client-based with syncing
  • Compatible with any platform
  • No chemistry stuff as standard features, instead templates available for all subject areas – let researcher select tools for their research!
  • Education pricing for classroom use
  • Assistance with addressing complex research protocols
  • Integration with mouse colony management system
  • Connectors – get data from any equipment used to flow easily into the ELN and out of it into the institutional repository
  • Messaging system to allow quick communication between collaborators
  • Reminders for PIs to check work of team
  • Integrated domain-specific metadata

 Corporate perspective

Denise Callihan from PPG Industries provided the corporate perspective. Her company has looked at options for ELNs every five years since the 1980s because their researchers were finding paper lab notebooks were time-consuming and inconvenient. They needed to be able to provide research support for patent purposes to make sure researchers were following the procedures required.

A committee was formed to identify the requirements of three disparate groups: IT and records management, legal and IP, and researchers. A pilot started in 2005 with ten research scientists using Amphora PatentSafe, some in favour of the introduction of ELNs and some against. PPG Industries were early adopters of Amphora PatentSafe so the vendor was very responsive to issues that were arising. The roll-out was managed by researchers, department by department, with the library providing support and administration.  Adoption was initially voluntary, then encouraged and is now mandatory for all researchers.

The implementation has been successful and researchers have found that the ELN is easy to use and works with their existing workflows.  Amphora PatentSafe uses a virtual printer driver to create searchable notebook pages – anything that can be printed can be imported into the ELN.  Email alerting helps them keep track of when they need to sign or witness documents, speeding up this part of the process. The ELN simplifies information sharing and collaboration and eliminates size constraints on documents. It is set up for long-term storage and reduces risks associated with researchers managing this individually.  Data visualisation and reporting are built in so it’s easy to see how research is progressing and to check document submission rates when necessary.

PPG Industries found that researchers need to be looking for an ELN solution rather than feeling that one is being imposed on them. Strong support was required from leadership, along with a clear understanding of what drives the need for the ELN.  The product they’ve selected focuses on providing an electronic version of the print notebook, but the raw data still needs to be kept separately for future use.

Wrap up

Overall I found this session extremely useful and I now feel much better informed about electronic lab notebooks.  I really appreciated the fact that this session, like others at SLA, presented a balanced view of the issues around electronic lab notebooks, with speakers representing vendors, corporate librarians and academic librarians.  I now plan to investigate some of the ELN options further so that I am in a position to support possible future implementations of ELNs, but I will wait until the researchers express their need for one rather than suggesting that the Department considers rolling on out across the board.

*Originally published in 2014 at Sci-Tech News, 68(3), 26–28

Published on 12 January 2017
Written by Niamh Tumelty
Creative Commons License

Show me the money – the path to a sustainable Research Data Facility

Like many institutions in the UK, Cambridge University has responded to research funders’ requirements for data management and  sharing with a concerted effort to support our research community in good data management and sharing practice through our Research Data Facility. We have written a few times on this blog and presented to describe our services. This blog is a description of the process we have undertaken to support these services in the long term.

Funders expect  that researchers make the data underpinning their research available and provide a link to this data in the paper itself. The EPSRC started checking compliance with their data sharing requirement on 1 May 2015. When we first created the Research Data Facility we spoke to many researchers across the institution and two things became very clear. One was that there was considerable confusion about what actually counts as data, and the second was that sharing data on publication is not something that can be easily done as an afterthought if the data was not properly managed in the first place.

We have approached these issues separately. To try and determine what is actually required from funders beyond the written policies we have invited representatives from our funders to come to discussions and forums with our researchers to work out the details. So far we have hosted Ben Ryan from the EPSRC, Michael Ball from the BBSRC and most recently David Carr and Jamie Enoch from the Wellcome Trust and CRUK respectively.

Dealing with the need for awareness of research data management has been more complex. To raise awareness of good practice in data management and sharing we embarked on an intense advocacy programme and in the past 15 months have organised 71 information sessions about data sharing (speaking with over 1,700 researchers). But we also needed to ensure the research community was managing its data from the beginning of the research process. To assist this we have developed workshops on various aspects of data management (hosting 32 workshops in the past year), a comprehensive website, a service to support researchers with their development of their research data management plans and a data management consultancy service.

So far, so good. We have had a huge response to our work, and while we encourage researchers to use the data repository that best suits their material, we do offer our institutional repository Apollo as an option. We are as of today, hosting 499 datasets in the repository. The message is clearly getting through.


The word sustainability (particularly in the scholarly communication world) is code for ‘money’. And money has become quite a sticking point in the area of data management. The way Cambridge started the Research Data Facility was by employing a single person, Dr Marta Teperek for one year, supported by the remnants of the RCUK Transition Fund. It became quickly obvious that we needed more staff to manage the workload and now the Facility employs half an Events and Outreach Coordinator and half a Repository Manager plus a Research Data Adviser who looks after the bulk of the uploading of data sets into the repository.

Clearly there was a need to work out the longer term support for staffing the Facility – a service for which there are no signs of demand slowing. Early last year we started scouting around for options.  In April 2013 the RCUK released some guidance that said it was permissible to recover costs from grants through direct charges or overheads – but noted institutions could not charge twice. This guidance also mentioned that it was permissible for institutions to recover costs of RDM Facilities as other Small Research Facilities, “provided that such facilities are transparently charged to all projects that use them”.


On the basis of that advice we established a Research Data Facility as a Small Research Facility according to the Transparent Approach to Costing (TRAC) methodology. Our proposal was that Facility’s costs will be recovered from grants as directly allocated costs. We chose this option rather than overheads because of the advantage of transparency to the funder of our activities. By charging grants this way it meant a bigger advocacy and education role for the Facility. But the advantage is that it would make researchers aware that they need to consider research data management seriously, that this involves both time and money, and that it is an integral part of a grant proposal.

Dr Danny Kingsley has argued before (for example in a paper ‘Paying for publication: issues and challenges for research support services‘) that by centralising payments for article processing charges, the researchers remain ignorant of the true economics of the open access system in the way that they are generally unaware of the amounts spent on subscriptions. If we charged the costs of the Facility into overheads, it becomes yet another hidden cost and another service that ‘magically’ happens behind the scenes from the researcher’s point of view.

In terms of the actual numbers, direct costs of the Research Data Facility included salaries for 3.2 FTEs (a Research Data Facility Manager, Research Data Adviser, 0.5 Outreach and Engagement Coordinator, 0.5 Repository Manager, 0.2 Senior Management time), hardware and hardware maintenance costs, software licences, costs of organising events as well as the costs of staff training and conference attendance. The total direct annual cost of our Facility was less than £200,000. These are the people cost of the Facility and are not to be confused with the repository costs (for which we do charge directly).

Determining how much to charge

Throughout this process we have explored many options for trying to assess a way of graduating the costing in relation to what support might be required. Ideally, we would want to ensure that the Facility costs can be accurately measured based on what the applicant indicated in their data management plan. However, not all funders require data management plans. Additionally, while data management plans provide some indication of the quantity of data (storage) to be generated, they do not allow a direct estimate of the amount of data management assistance required during the lifetime of the grant. Because we could not assess the level of support required for a particular research project from a data management plan, we looked at an alternative charging strategy.

We investigated charging according to the number of people on a team, given that the training component of the Facility is measurable by attendees to workshops. However, after investigation we were unable to easily extract that type of information about grants and this also created a problem for charging for collaborative grants. We then looked at charging a small flat charge on every grant requiring the assistance of the Facility and at charging proportionally to the size (percentage of value) of the grant. Since we did not have any compelling evidence that bigger grants require more Facility assistance, we proposed a model of flat charging on all grants, which require Facility assistance. This model was also the most cost-effective from an administrative point of view.

As an indicator of the amount of work involved in the development of the Business Case, and the level of work and input that we have received relating to it, the document is now up to version 18 – each version representing a recalculation of the costings.

Collaborative process

A proposal such as we were suggesting – that we charge the costs of the Facility as a direct charge against grants – is reasonably radical. It was important that we ensure the charges would be seen as fair and reasonable by the research community and the funders. To that end we have spent the best part of a year in conversation with both communities.

Within the University we had useful feedback from the Open Access Project Board (OAPB) when we first discussed the option in July last year. We are also grateful to the members of our community who subsequently met with us in one on one meetings to discuss the merits of the Facility and the options for supporting it. At the November 2015 OAPB meeting, we presented a mature Business Case. We have also had to clear the Business Case through meetings of the Resource Management Committee (RMC).

Clearly we needed to ensure that our funders were prepared to support our proposal. Once we were in a position to share a Business Case with the funders we started a series of meetings and conversations with them.

The Wellcome Trust was immediate in its response – they would not allow direct charging to grants as they consider this to be an overhead cost, which they do not pay. We met with Cancer Research UK (CRUK) in January 2016 and there was a positive response about our transparent approach to costing and the comprehensiveness of services that the Facility provides to researchers at Cambridge. These issues are now being discussed with senior management at CRUK and discussions with CRUK are still ongoing at the time of writing this report (May 2016). [Update 24 May: CRUK agreed to consider research data management costs as direct costs on grant applications on a case by case basis, if justified appropriately in the context of the proposed research].

We encourage open dialogue with the RCUK funders about data management. In May 2015 we invited Ben Ryan to come to the University to talk about the EPSRC expectations on data management and how Cambridge meets these requirements. In August 2015 Michael Ball from the BBSRC came to talk to our community. We had an indication from the RCUK that our proposal was reasonable in principle. Once we were in a position to show our Business Case to the RCUK we invited Mark Thorley to discuss the issue and he has been in discussion with the individual councils for their input to give us a final answer.

Administrative issues

Timing in a decision like this is challenging because of the large number of systems within the institution that would be affected if a change were to occur. In anticipation of a positive response we started the process of ensuring our management and financial systems were prepared and able to manage the costing into grants – to ensure that if a green light were given we would be prepared.  To that end we have held many discussions with the Research Office on the practicalities of building the costing into our systems to make sure the charge is easy to add in our grant costing tool. We also had numerous discussions on how to embed these procedures in their workflows for validating whether the Facility services are needed and what to do if researchers forget to add them. The development has now been done.

A second consideration is the necessity to ensure all of the administrative staff involved in managing research grants (at Cambridge this is a  group of over 100 people) are aware of the change and how to manage both the change to the grant management system and also manage the questions from their research community. Simultaneously we were also involved in numerous discussions with our invaluable TRAC team at the Finance Division at the University who helped us validate all the Facility costs (to ensure that none of the costs are charged twice) and establishing costs centres and workflows for recovering money from grants.

Meanwhile we have had to keep our Facility staff on temporary contracts until we are in a position to advertise the roles. There is a huge opportunity cost in training people up in this area.


As it happened, the RCUK has come back to us to say that we can charge this cost to grants but as an overhead rather than direct cost. Having this decision means we can advertise the positions and secure our staffing situation. But we won’t be needing the administrative amendments to the system, nor the advocacy programme.

It has been a long process given we began preparing the Business Case in March 2015. The consultation throughout the University and the engagement of our community (both research and funder) has given us an opportunity to discuss the issues of research data management more widely. It is a shame – from our perspective – that we will not be able to be transparent about the costs of managing data effectively.

The funders and the University are all working towards a shared goal – we are wanting a culture change towards more open research, including the sharing of research data. To achieve this we need a more aware and engaged research community on these matters.  There is much advocacy to do.

Published 8 May 2016
Written by Dr Danny Kingsley and Dr Marta Teperek
Creative Commons License