Text and data mining services: an update

Text and Data Mining (TDM) is the process of digitally querying large collections of machine-readable material, extracting specific information and, by analysis, discovering new information about a topic.

In February 2017, a group University of Cambridge staff met to discuss “Text and Data Mining Services: What can Cambridge libraries offer?”  It was agreed that a future library Text and Data Mining (TDM) support service could include:

  • Access to data from our own collections
  • Advice on legal issues, what publishers allow, what data sets and tools are available
  • Registers on data provided for mining and TDM projects
  • Fostering agreements with publishers.

This blog reports on some of the activities, events and initiatives, involving libraries at the University of Cambridge, that have taken place or are in progress since this meeting (also summarised in these slides).  Raising awareness, educating, and teasing out the issues around the low uptake of this research process have been the main drivers for these activities.

March 2017: RLUK 2017 Conference Workshop

The Office of Scholarly Communication (OSC) and Jisc ran a workshop at the Research Libraries UK 2017 conference to discuss Research Libraries and TDM.  Issues raised included licencing, copyright, data management, perceived lack of demand, where to go for advice within an institution or publisher, policy and procedural development for handling TDM-related requests (and scaling this up across an institution) and the risk of lock-out from publishers’ content, as well as the time it can take for a TDM contract to be finalised between an institution and publisher.  The group concluded that it is important to build mechanisms into TDM-specific licencing agreements between institutions and publishers where certain behaviours are expected.  For example, if suspicious activity is detected by a publisher’s website, it would be better not to automatically block the originating institution from accessing content, but investigate this first (although this may depend on systems in place), or if lock-out happens and the activity is legal, participants suggested that institutions should explore compensation for the time that access is lost if significant.

July 2017: University of Cambridge Text and Data Mining Libguide

Developed by the eResources Team, this LibGuide explains about Text and Data Mining (TDM): what it is, what the legal issues are, what you can do and what you should not try to do. It also provides a list of online journals under license for TDM at the University of Cambridge and a list of digital archives for text mining that can be supplied to the University researchers on a disc copy. Any questions our researchers may have about a TDM project, not answered through the LibGuide, can be submitted to the eResources Team via an enquiry form.

July 2017: TDM Symposium

The OSC hosted this symposium to provide as much information as possible to the attendees regarding TDM.  Internal and external speakers, experienced in the field, spoke about what TDM is and what the issues are; research projects in which TDM was used; TDM tools; how a particular publisher supports TDM; and how librarians can support TDM.

At the end of the day a whole-group discussion drew out issues around why more TDM is not happening in the UK and it was agreed that there was a need for more visibility on what TDM looks like (e.g. a need for some hands-on sessions) and increased stakeholder communication: i.e. between publishers, librarians and researchers.

November 2017: Stakeholder communication and the TDM Test Kitchen

This pilot project involves a publisher, librarians and researchers. It is providing practical insight into the issues arising for each of the stakeholders: e.g. researchers providing training on TDM methods and analysis tools, library support managing content accessibility and funding for this, and content licencing and agreements for the publisher. We’ll take a more in-depth look at this pilot in an upcoming blog on TDM – watch this space.

January 2018: Cambridge University Library Deputy Director visits Yale

The Yale University Library Digital Humanities Laboratory provides physical space, resources and a community within the Library for Yale researchers who are working with digital methods for humanities research and teaching. In January this year Dr Danny Kingsley visited the facility to discuss approaches to providing TDM services to help planning here. The Yale DH Lab staff help out with projects in a variety of ways, one example being to help researchers get to grips with digital tools and methods.  Researchers wanting to carry out TDM on particular collections can visit the lab to do their TDM: off-line discs containing published material for mining can be used in-situ. In 2018, the libraries at Cambridge have begun building up a collection of offline discs of specific collections for the same purpose.

June 2018: Text and Data Mining online course

The OSC collaborated with the EU OpenMinTeD project on this Foster online course: Introduction to Text and Data Mining.  The course helps a learner understand the key concepts around TDM, explores how Research Support staff can help with TDM and there are some practical activities that even allow those with non-technical skills try out some mining concepts for themselves.  By following these activities, you can find out a bit more about sentence segmentation, tokenization, stemming and other processing techniques.

October 2018: Gale Digital Scholar Lab

The University of Cambridge has trial access to this platform until the end of December: it provides TDM tools at a front end to digital archives from Gale Cengage.  You can find out more about this trial in this ejournals@cambridge blog.

In summary…

Following the initial meeting to discuss research support services for TDM, there have been efforts and achievements to raise awareness of TDM and the possibilities it can bring to the research process as well as to explore the issues around the low usage of TDM in the research community at large.  This is an on-going task, with the goal of increased researcher engagement with TDM.

Published 23 October 2018
Written by Dr Debbie Hansen
Creative Commons License

Cambridge Open Access spend 2013-2018

Since 2013, the Open Access Team has been helping Cambridge researchers, funded by Research Councils UK (RCUK) and the consortium of biomedical funders which make up the Charity Open Access Fund (COAF), to meet their Open Access obligations. Both RCUK (now part of UKRI) and COAF have Open Access policies which have a preference for ‘gold’, i.e. the published work should be Open Access immediately at the time of publication. Implementing these policies has come at a significant cost. In this time, Cambridge has been awarded just over £10 million from RCUK and COAF to implement their Open Access policies, and the Open Access Team has diligently used this funding to maximum effect.

Figure 1. Comparison of combined RCUK/COAF grant spend and available funds, April 2013 – March 2018.

Initially, expenditure was slow which allowed the Open Access Team to maintain a healthy balance that could guarantee funding for almost any paper which met a few basic requirements. However, since January 2016 expenditure has gradually been catching up on the available funds which has made funding decisions more difficult (specifically Open Access deals tied to multi-year publisher subscriptions). In the first three months of 2018 average monthly expenditure on the RCUK block grant alone exceeded £160,000. We are quickly reaching the point where expenditure will outstrip the available grants.

One technical change which has particularly affected our management of the block grants was RCUK’s decision last year to move away from a direct cash award (which could be rolled over year to year) to a more tightly managed research grant. In the past, carrying over underspend has given us some flexibility in the management of the RCUK funds, whereas the more restrictive style of research grant will mean that any underspend will need to be returned at the end of the grant period, while any overspend cannot be deferred into the next grant period. As we are now dealing with a fixed budget, the Open Access Team will need to ensure that expenditure is kept within the limits of the grant. This is difficult when we have no control over where or when our researchers publish.

Funding from COAF (which is also managed as though it is a research grant) has generally matched our total annual spend quite closely, but the strict grant management rules have caused some problems, especially in the transition period between one grant and another. However, unlike RCUK, the Wellcome Trust will provide supplementary funding in addition to the main COAF award if it is exhausted, and the other COAF partners have similar procedures in place to manage Open Access payments beyond the end of the grant.

Where does it all go?

Most of our expenditure (91%) goes on article processing charges (APCs), as perhaps one might expect, but the block grants are also used to support the staff of the Open Access Team (3%), helpdesk and repository systems (2%), page and colour charges (2%), and publisher memberships (1%) (where this results in a reduced APC). The majority of APCs we’ve paid go towards hybrid journals, which represent approximately 80% of total APC spend.

So let’s take a look at which publishers have received the most funds. We’ve tried to match as much of our raw financial information we have to specific papers, although some of our data is either incomplete or we can’t easily link a payment back to a specific article, particularly if we look back to 2013-2015 when our processes were still developing. Nonetheless, the average APC paid over the last 5 years was £2,291 (inc. 20% VAT), but as can be seen from Table 1, average APCs have been rising year on year at a rate of 7% p.a., significantly higher than inflation. Price increases at this rate are not sustainable in the long term – by 2022 we could be paying on average £3000 per article.

Table 1. Average APC by publication year of article (where known).

Year of publication Average APC paid (£)
2013  £1,794
2014  £1,935
2015  £2,044
2017  £2,187
2018  £2,336

Elsevier has been by far the largest recipient of block grant funds, receiving 29.4% of all APC expenditure from the RCUK and COAF awards (over £2.5 million), though only accounting for 25.5% of articles. In the same time SpringerNature also received in excess of £1 million (which as we’ll see below has mostly been spent on two titles). With such a substantial set of data we can now begin to explore the relative value that each publisher offers. Take for example Taylor & Francis (£107,778 for 120 articles) compared to Wolters Kluwer (£119,551 for 35 articles). Both publishers operate mostly hybrid OA journals and yet the relative value is significantly different. What is so fundamentally different between publishers that such extreme examples as this should exist?

Table 2. Top 20 publishers by combined total RCUK/COAF APC spend 2013-2018.

Value of APCs paid Number of APCs paid Avg. APC paid
Publisher £ % N % £
Elsevier £2,559,736 29.4% 971 25.5% £2,636
SpringerNature £1,050,774 12.1% 402 10.6% £2,614
Wiley £808,847 9.3% 279 7.3% £2,899
American Chemical Society £411,027 4.7% 251 6.6% £1,638
Oxford University Press £379,647 4.4% 169 4.4% £2,246
PLOS £267,940 3.1% 168 4.4% £1,595
BioMed Central £245,006 2.8% 153 4.0% £1,601
Institute of Physics £189,434 2.2% 98 2.6% £1,933
Royal Society of Chemistry £156,018 1.8% 106 2.8% £1,472
BMJ Publishing £144,001 1.7% 68 1.8% £2,118
Company of Biologists £140,609 1.6% 50 1.3% £2,812
Wolters Kluwer £119,551 1.4% 35 0.9% £3,416
Taylor & Francis £107,778 1.2% 120 3.2% £898
Frontiers £103,011 1.2% 61 1.6% £1,689
Cambridge University Press £77,139 0.9% 38 1.0% £2,030
Royal Society £73,890 0.8% 52 1.4% £1,421
Society for Neuroscience £69,943 0.8% 26 0.7% £2,690
American Society for Microbiology £63,056 0.7% 36 0.9% £1,752
American Heart Association £53,696 0.6% 14 0.4% £3,835
Optical Society of America £39,463 0.5% 17 0.4% £2,321
All other articles £1,654,228 19.0% 690 18.1% £2,397
Grand Total £8,714,794 100.0% 3,804 100.0% £2,291

Next, journal level metrics. The most popular journal that we pay APCs for is Nature Communications, followed closely by Scientific Reports. Both of these are SpringerNature titles, and indeed these two titles make up the bulk of our total APC spend with SpringerNature. Yet these two journals represent significantly different approaches to Open Access. Nature Communications, along with Cell and Cell Reports, are some of the most expensive routes to making research publications Open Access, whereas Scientific Reports and PLOS One sit at the lower end of the spectrum. It is interesting that we haven’t seen a particularly popular Open Access journal fill the niche between Nature Communications and Scientific Reports.

Figure 2. APC number and total spend by journal. In the last five years, nearly £450,000 has been spent on articles published in Nature Communications.


Managing the future

While the OA block grants have kept pace with overall expenditure so far, continuing monthly expenditure of £160,000 would risk overspending on the RCUK grant for 2018/19. To counter this possible outcome the University has agreed a set of funding guidelines to manage the RCUK (from now on known as Research Councils) and COAF awards. For Research Councils’ funded papers the new guidelines place an emphasis on fully Open Access journals and hybrid journals where the publisher is taking a sustainable approach to managing the transition to Open Access. We’ve spent a lot of money over the last five years, yet it’s not clear that the influx of cash from RCUK and COAF has had any meaningful impact on the overall publishing landscape. Many publishers continue to reap huge windfalls via hybrid APCs, yet they are not serious about their commitment to Open Access.

In the future, we’ll be demanding better deals from publishers before we support payments to hybrid journals so that we can effect a faster transition to a fully Open Access world.

Published 22 October 2018
Written by Dr Arthur Smith
Creative Commons License

So, what is it really like to work in research support and scholarly communication?

Followers of this blog will remember various posts over the last year about the OSC’s work as part of the Scholarly Communication Professional Development group. This group of interested parties from across the information sector aims to look at the training needs of those working in (or wanting to work in) scholarly communication and how these can best be met.

Although it’s an expanding area of academic librarianship with a rapidly growing number of roles, those hiring for these positions often report that it can be hard to recruit from within the library community. Too often those from a library background, both new and more established professionals, don’t know enough about these roles and the skills needed to be successful in their application (or even to apply at all!). These roles are actually a great fit for the library and information skill set but the terminology used in advertisements is not consistent and it can be hard to marry existing skills to what is being asked for.

One of the goals of our group is to demystify the range of jobs available in scholarly communication.  Obviously the exact nature of the role will vary between both countries and institutions but we hope that by using the combined power of the community we can cover most areas. This is where you come in…

Inspired by the 23 librarians programmes which took place across the UK a few years ago we would like to offer those who are working in scholarly communication and research support a chance to share their thoughts with us in the form of short vox pop video interviews. Designed to be no more than two to three minutes in length these videos would outline what your role is, the core tasks of your job and the skills you wish you had developed prior to working in this area. We will host these videos online and over time build up a useful resource for the community at large and those looking to work in this area. Videos can be filmed on phones, tablets or anything else you have to hand.

We are aiming to launch the first wave of vox pops during Open Access Week 2018 (22 – 28th October) but the contributions don’t need to be restricted to those working in Open Access roles. We welcome contributions from anyone who considers that they work in any aspect of scholarly communication or research support, whether this is in a library or a different environment. To help you we have put together both a sample video and a list of questions below.

 

What are the core tasks of your job? Tell us about one thing which surprised you about your current role?
What are the skills you need to do this job well? What is the best part of your role?
How did you develop these skills? What is the most challenging part of your role?
What do you need to do this job that you didn’t learn at [library] school? What advice do you have for others wanting to work in this area?

Don’t feel that you have to answer all of the questions – just choose two or three which you feel you would like to answer. Alternatively, if there is something else you would like to share about your role then this is your chance! The only thing we ask is that you give your name, job title and place of work at the beginning of your video. If you would prefer to write your answers rather than record a video we would also welcome your contribution.

You can contribute to the project here: http://bit.ly/ResearchSupportInterviews

We will share these interviews during Open Access Week so watch this blog for further details.

Published 02 October 2018
Written by Claire Sewell
Creative Commons License