All posts by Dominic Dixon

Informing the Elsevier negotiations: Dominic Dixon on the work of the Data Analysis Working Group

As part of our series of posts on the Elsevier negotiations, Dominic Dixon, Research Librarian at Cambridge University Libraries, explains the work of the library’s Data Analysis Working Group to access, understand and analyse the data relating to how researchers at Cambridge use Elsevier publications. These findings are also presented as a series of data visualisations on the recently launched Elsevier Data Dashboard [Cambridge University Raven account required].

Having a strong underpinning of data is critical to strengthening the University and sector position in negotiations with Elsevier. This post outlines our approach in the data analysis working group to gathering and presenting the data underpinning the negotiations, looks at some of the questions we have sought to answer, and shares some high-level findings from our analysis.

As with many data science projects, a large majority of the time has been spent on data cleaning. This is in part due to the way the exports from the platforms we used are structured but also to allow us to carry out a more fine-grained analysis than would have been possible with the data in its default state. Some of this work involved disambiguating publisher names, splitting and pivoting fields with multiple entries (e.g., funders, disciplines, and subjects), and enriching the records with metadata not included in the original files.

Publishing

To build a profile of research published by Cambridge researchers in Elsevier journals, we experimented with three platforms: Dimensions, Scopus, and Web of Science (WoS). Each of these platforms is commercial and each has varying levels of coverage and richness of metadata. A recent comparative analysis between WoS, Scopus and Dimensions found that Dimensions indexed 82.22% more journals than WoS and 48.17% more journals than Scopus. We decided to compare the coverage in each of these platforms for articles published between 2015 & 2020 by a Cambridge affiliated author. In this case, WoS (n=59, 587) returned 1% more results than Dimensions (n=58,908) and 32% more than Scopus (n=40,385).* However, filtering to Elsevier gave a different picture. We found that Dimensions (n=11,431) returned 16% more articles than WoS (n=9,504) and 44% more than Scopus (n=6,345). Given this and considering that our primary focus was research published by Elsevier, we opted to use Dimensions.

Of the 58,908 records exported from Dimensions, we found that 19% were published in Elsevier journals, making Elsevier the single most chosen publishing venue for Cambridge authors. Filtering to only articles with a Cambridge corresponding author, we again found that Elsevier was the most chosen publishing venue, with over 34% (n=4,564) of the articles published in Elsevier journals. Having looked at publishing levels more broadly, we then broke down the articles published with a Cambridge corresponding author by Open Access category. We found that 22% (n=1,137) of the articles were categorised as closed and therefore behind a paywall, 35%(n=1,585) were paid for via different routes including funder block grants administered by the University, 32% (n=1,467) were self-archived (Green OA), and 8% (n=375) were published in journals that do not charge APCs. Thus, the percentage of articles that are either behind a paywall, or are only available openly because an APC has been paid, is significantly higher than the amount that is published open access without any associated fees.

Another aspect of publishing we decided to focus on is funding, asking specifically “Who is funding the Cambridge research published with Elsevier?”. Given the inclusion of funder data in the Dimensions export, we were able to break down the articles by both funder and funder groups. This enabled us to determine who is funding the research. Looking at articles with a Cambridge affiliated author, Cambridge corresponding author, and articles resulting from grants we found that in each category over 70% were linked to at least one cOAlition S funder. The wider implication of this – specifically for the corresponding author articles – is that in the absence of a read and publish agreement, many of the funders would not pay the APCs associated with publishing in Elsevier journals.

Reading

To provide a picture of the extent to which articles published in Elsevier subscription journals are read at Cambridge, we gathered usage data from COUNTER and the Alma library management system. This allowed us to consider reading over the 6-year between 2015 and 2020 both overall and at a disciplinary level. We found that reading of Elsevier journals was consistently higher in each year than for any other publisher. Reading of Elsevier in 2020 represented 20% of all reading and was at its highest level in physical sciences and engineering. The single highest total of reading in the sub-categories within each discipline was in biochemistry, genetics, and molecular biology within the life sciences, with over 400,000 article downloads in 2020 alone.

Another question we considered is how frequently articles published in Elsevier journals are cited by researchers at Cambridge. To answer this, we took advantage of the Dimensions API to gather a dataset of the cited publications from articles published with a Cambridge affiliated author between 2015 and 2020. The resulting data set consisted of over 1.2m bibliographic records and revealed that 22% (n=269,917) of the cited articles were published by Elsevier. Interestingly, this percentage closely matches both the percentage of articles published in Elsevier journals by Cambridge affiliated authors (19%), the percentage of articles read at Cambridge (21.78%) (2015-20), as well as the percentage of publishing with Elsevier at the national level (20%). Using the Dimensions API to enrich the citation data with the open access category, we were able to see that 66% (over 174,000 publications) of the cited Elsevier content is currently paywalled. Elsevier is both the most cited and most paywalled publisher. This observation has wider implications for open research given that many of these articles would be inaccessible to those who are not affiliated with an institution that subscribes to the journals in which the articles appear.

Paying

One of the main questions we considered when looking at data relating to expenditure on Elsevier was how much we pay to publish with Elsevier journals. Our source for this data was OpenAPC – an initiative that aggregates data on open access expenditure and makes it openly available – combined with data from our internal compliance reports. Looking at the overall spend across all institutions that have contributed to the OpenAPC dataset, we can see that over €49,000,000 has been paid to Elsevier. This represents 19% of the total reported spend on article processing charges (APCs). Looking at data the data on Cambridge expenditure, we found that between 2015 and 2020, 30% (over £3,000,000) of our total spend on APCs from block grants was paid to Elsevier (the highest spend on any single publisher), with a single payment averaging at £3,302 and ranging between £450 and £7320.

Final notes

This post has covered just some of the questions we have been able to answer with the data. We think that overall, we have been able to demonstrate that Elsevier journals are among the most read and published in, but also consistently the most paywalled and expensive to publish in journals of all publishers. This serves to highlight the importance of the ongoing negotiations and of considering other options such as those explored in previous posts. Our complete findings are presented on a dashboard that is accessible to members of the University. Unfortunately, legal restrictions mean we are not able to share the dashboard or underlying datasets externally; however, we have made the Python code we used to gather the citation data available as a Jupyter notebook on Google Colab. This can be used to retrieve the dataset we used to carry out the citation analysis and is easily modifiable for other purposes (see the notebook for full details). We refer the interested reader to the Dimensions API Lab, and the ESAC guide to uncovering the publishing profile of your institution. The former was helpful for learning how to take advantage of the Dimensions API (as were the staff at Dimensions), and the latter has been useful in formulating our approach to the whole project. We are also happy to answer questions about any aspect of our work.

* The original percentage quoted here was 18%. This was incorrect and has now been corrected to 32%.

Data sharing and reuse case study: the Mammographic Image Society database

The Mammographic Image Society (MIAS) database is a set of mammograms put together in 1992 by a consortium of UK academic institutions and archived on 8mm DAT tape, copies of which were made openly available and posted to applicants for a small administration fee. The mammograms themselves were curated from the UK National Breast Screening Programme, a major screening program that was established in the late 80s offering routine screening every three years to women aged between 50-64.

The motivations for creating the database were to make a practical contribution to computer vision research – which sought to improve the ability of computers to interpret images – and to encourage the creation of more extensive datasets. In the peer-reviewed paper bundled with the dataset, the researchers note that “a common database is a positive step towards achieving consistency in performance comparison and testing of algorithms”.

Due to increased demand, the MIAS database was made available online via third parties, albeit in a lower resolution than the original. Despite no longer working in this area of research, the lead author, John Suckling – now Director of Research in the Department of Psychiatry, part of Cambridge Neuroscience –  started receiving emails asking for access to the images at the original resolution. This led him to dig out the original 8mm DAT tapes with the intention of making the images available openly in a higher resolution. The tapes were sent to the University Information Service (UIS), who were able to access the original 8mm tape and download higher resolution versions of the images. The images were subsequently deposited in Apollo and made available under a CC BY license, meaning researchers are permitted to reuse them for further research as long as appropriate credit is given. This is the most commonly used license for open datasets and is recommended by the majority of research funding agencies.

Motivations for sharing the MIAS database openly

The MIAS database was created with open access in mind from the outset. When asked whether he had any reservations about sharing the database openly, the lead author John Suckling noted:

There are two broad categories of data sharing; data acquired for an original purpose that is later shared for secondary use; data acquired primarily for sharing. This dataset is an example of the latter. Sharing data for secondary use is potentially more problematic especially in consortia where there are a number of continuing interests in using the data locally. However, most datasets are (or should be) superseded, and then value can only be extracted if they are combined to create something greater than the sum of the parts. Here, careful drafting of acknowledgement text can be helpful in ensuring proper credit is given to all contributors.”

This distinction – between data acquired for an original purpose that is later shared for secondary use and data acquired primarily for sharing – is one that is important and often overlooked. The true value of some data can only be fully realised if openly shared. In such cases, as Suckling notes, sufficient documentation can help ensure the original researchers are given credit where it is due, as well as ensuring it can be reused effectively. This is also made possible by depositing the data on an institutional repository such as Apollo, where it will be given a DOI and its reuse will be easier to track.

Impact of the MIAS database

As of August 2020, the MIAS database has received over 5500 downloads across 27 different countries, including some developing countries where breast cancer survival rates are lower. Google Scholar currently reports over 1500 citations for the accompanying article as well as 23 citations for the dataset itself. A review of a sample of the 1500 citations revealed that many were examples of the data being reused rather than simply citations of the article. Additionally, a systematic review published in 2018 cited the MIAS database as one of the most widely used for applying breast cancer classification methods in computer aided diagnosis using machine learning, and a benchmarking review of databases used in mammogram research identified it as the most easily accessible mammographic image database. The reasons cited for this included the quality of the images, the wide coverage of types of abnormalities, and the supporting data which provides the specific locations of the abnormalities in each image.

The high impact of the MIAS database is something Suckling credits to the open, unrestricted access to the database, which has been the case since it was first created. When asked whether he has benefited from this personally, Suckling stated “Direct benefits have only been the citations of the primary article (on which I am first author). However, considerable efforts were made by a large number of early-career researchers using complex technologies and digital infrastructure that was in its infancy, and it is extremely gratifying to know that this work has had such an impact for such a large number of scientists.”. Given that the database continues to be widely cited and has been downloaded from Apollo 1358 times since January 2020, it is still clearly the case that the MIAS database is having a wide impact.

The MIAS Database Reused

As mentioned above, the MIAS database has been widely reused by researchers working in the field of medical image analysis. While originally intended for use in computer vision research, one of the main ways in which the dataset has been used is in the area of computer aided diagnosis (CAD), for which researchers have used the mammographic images to experiment with and train deep learning algorithms. CAD aims to augment manual inspection of medical images by medical professionals in order to increase the probability of making an accurate diagnosis.

A 2019 review of recent developments in medical image analysis identified lack of good quality data as one of the main barriers researchers in this area face. Not only is good quality data a necessity but it must also be well documented as this review also identified inappropriately annotated datasets as a core challenge in CAD. The MIAS database is accompanied by a peer-reviewed paper explaining its creation and content as well as a read me PDF which explains the file naming convention used for the images as well as the annotations used to indicate the presence of any abnormalities and classify them based on their severity. The presence of this extensive documentation combined with it having been openly available from the outset could explain why the database continues to be so widely used.

Reuse example: Applying Deep Learning for the Detection of Abnormalities in Mammograms

This research, published in 2019 in Information Science and Applications, looked at improving some of the current methods used in CAD and attempted to address some inherent shortcomings and increase the competency level of deep learning models when it comes the minimisation of false positives when applying CAD to mammographic imaging. The researchers used the MIAS database alongside another larger dataset in order to evaluate the performance of two existing convolutional neural networks (CNN), which are deep learning models used specifically for classifying images. Using these datasets, they were able to demonstrate that versions of two prominent CNNs were able to detect and classify the severity of abnormalities on the mammographic images with a high degree of accuracy.

While the researchers were able to make good use of the MIAS database to carry out their experiments, due to the inclusion of appropriate documentation and labelling, they do note that since it is a relatively small dataset it is not possible to rule out “overfitting”, where a deep learning model is highly accurate on the data used to train the model, but may not generalise well to other datasets. This highlights the importance of making such data openly available as it is only possible to improve the accuracy of CAD if sufficient data is available for researchers to carry out further experiments and improve the accuracy of their models. ­

Reuse example: Computer aided diagnosis system for automatic two stages classification of breast mass in digital mammogram images

This research, published in 2019 in Biomedical Engineering: Applications, Basis and Communications, used the MIAS database along with the Breast Cancer Digital Repository to test a CAD system based on a probabilistic neural network – a machine learning model that predicts the probability distribution of a given outcome –  developed to automate classification of breast masses on mammographic images. Unlike previously developed models, their model was able to segment and then carry out a two-stage classification of breast masses. This meant that rather than classifying masses into either benign or malignant, they were able to develop a system which carried out a more fine-grained classification consisting of seven different categories. Combining the two different databases allowed for an increased confidence level in the results gained from their model, again raising the importance of the open sharing of mammographic image datasets. After testing their model on images from these databases, they were able to demonstrate a significantly higher level of accuracy at detecting abnormalities than had been demonstrated by two similar models used for evaluation. On images from the MIAS Database and Breast Cancer Digital Repository their model was able to detect abnormalities with an accuracy of 99.8% and 97.08%, respectively. This was also accompanied by increased sensitivity (ability to correctly classify true positives) and specificity (ability to correctly classify false negatives).

Conclusion

Many areas of research can only move forward if sufficient data is available and if it is shared openly. This, as we have seen, is particularly true in medical imaging where despite datasets such as the MIAS database being openly available, there is a data deficiency which needs to be addressed in order to improve the accuracy of the models used in computer-aided diagnosis. The MIAS database is a clear example of a dataset that has enabled an important area of research to move forward by enabling researchers to carry out experiments and improve the accuracy of deep learning models developed for computer-aided diagnosis in medical imaging. The sharing and reuse of the MIAS database provides an excellent model for how and why future researchers should make their data openly available.

Published 20th August 2020
Written by Dominic Dixon

CCBY icon