Tag Archives: open data

Cambridge Data Week 2020 day 5: How do we peer review data? New sustainable and effective models

Cambridge Data Week 2020 was an event run by the Office of Scholarly Communication at Cambridge University Libraries from 23–27 November 2020. In a series of talks, panel discussions and interactive Q&A sessions, researchers, funders, publishers and other stakeholders explored and debated different approaches to research data management. This blog is part of a series summarising each event:   

The rest of the blogs comprising this series are as follows:
Cambridge Data Week day 1 blog
Cambridge Data Week day 2 blog
Cambridge Data Weekday 3 blog
Cambridge Data Week day 4 blog


Cambridge Data Week 2020 concluded on 27 November with a discussion between Dr Lauren Cadwallader (PLOS), Professor Stephen Eglen (University of Cambridge) and Kiera McNeice (Cambridge University Press) on models of data peer review. The peer review process around data is still emerging despite the increase in data sharing. This session explored how peer review of data could be approached from both a publishing and a research perspective. 

The discussion focused on three main questions and here are a few snippets of what was said. If you’d like to explore the speakers’ answers in full, see the recording and transcript below.  

Why is it important to peer review datasets?

Are we in a post-truth world where claims can be made without needing to back them up? What if data could replace articles as the main output of research? What key criteria should peer review adopt?

Word cloud created by the audience in response to “Why is it important to peer review datasets?” The four most prominent words are: integrity, quality, trust, reproducibility.
Figure 1: Word cloud created by the audience in response to “Why is it important to peer review datasets?”

How should data review be done?

Can we drive the spread of Open Data by initially setting an incredibly low bar, encouraging everyone to share data even in its messy state? Are we reviewing to ensure reusability, or do we want to go further and check quality and reproducibility? Is data review a one-off event, or a continuous process involving everyone who reuses the data?

Are journals exclusively responsible for data review, or should authors, repository managers and other organisations be involved? Where will the money come from? What’s in it for researchers who volunteer as data reviewers? How do we introduce the peer review of data in a fair and equitable way? 

Who should be doing the work?

Are journals exclusively responsible for data review, or should authors, repository managers and other organisations be involved? Where will the money come from? What’s in it for researchers who volunteer as data reviewers? How do we introduce the peer review of data in a fair and equitable way?

Watch the session 

The video recording of the webinar can be found below and the transcript is present in Apollo, the University of Cambridge repository

Bonus material 

After the end of the session, Lauren, Kiera and Stephen continued the discussion, prompted by a question from the audience about whether there should be some form of template or checklist for peer reviewing code. Here is what they said. 

Lauren Cadwallader  That’s an interesting idea, though of course code is written for different reasons, software, analysis, figures, and so on. Inevitably there will be different ways of reviewing it. Stephen can you tell us more about your experience with CODECHECK? 

Stephen Eglen At CODECHECK we have a process to help codecheckers run research code and award a “certificate of executable computation”, like this example of a report. If doing nothing else, then copying whatever files you’ve got onto some repository, dirty and unstructured as that might seem is still gold dust to the next researcher that comes along. Initially we can set the standards low, and from there we can come up with a whole range of more advanced quality checks. One question is ‘what are researchers willing to accept?’ I know of a couple of pilots that tried requiring more work from researchers in preparing and checking their files and code, such as the Code Ocean pilot that Kiera mentioned. I think that we have a community that understand the importance of this and is willing to put in some effort.  

Kiera McNeice There’s value in having checklists that are not extremely specialised, but tailored somewhat towards different subject areas. For instance, the American Journal of Political Science has two separate checklists, one for quantitative data and one for qualitative data. Certainly, some of our HSS editors have been saying that some policies developed for quantitative data do not work for their authors.  

Lauren Cadwallader  It might be easy to start with places where there are communities that are already engaged and have a framework for data sharing, so the peer review system would check that. What do you think? 

Kiera McNeice I guess there is a ‘chicken and egg’ issue: does this have to be driven from the top down, from publishers and funders, or does it come from the bottom up, with research communities initiating it? As journals, there is a concern that if we try to enforce very strict standards, then people will take their publications elsewhere. If there is no desire from the community for these changes, publisher enforcement can only go so far.  

Stephen Eglen Funders have an important role to play too. If they lead on this, researchers will follow because ultimately researchers are focused on their career. Unless there is recognition that there doing this as a valuable part of one’s work, it will be hard to convince the majority of researchers to spend time on it.  

Take a pilot I was involved in with Nature Neuroscience. Originally this was meant to be a mandatory peer review of code after acceptance in principle, but in the end fears about driving away authors meant it was only made optional. Throughout a six-month trial, I was only aware of two papers that went through code review. I can see the barriers for both journal and authors, but if researchers received credit for doing it, this sort of thing will come from the bottom up. 

Lauren Cadwallader  In our biology-based model review pilot we ran a survey and found that many people opted in because they believe in open science, reproducibility, and so on, but two people opted in because they feared PLOS would think they had something to hide if they didn’t. That’s not at all what it was about. Although I suppose if it gets people sharing data… 


We were intrigued by many of the ideas put forward by the speakers, particularly the areas of tension that will need to be resolved. For instance, as we try to move from a world where most data remains in people’s laptops and drawers to a FAIR data world, even sharing simple, messy, unstructured data is ‘gold dust’. Yet ultimately, we want data to be shared with extensive metadata and in an easily accessible form. What should the initial standards be, and how should they be raised over time? And how about the idea of asking Early Career Researchers to take on reviewer roles? Certainly they (and their research communities) would benefit in many ways from such involvement, but will they be able to fit this in their packed schedules?  

The audience engaged in lively discussion throughout the session, especially around the use of repositories, the need for training, and disciplinary differences. At the end of the session, they surprised us all with their responses to our poll: “Which peer review model would work best for data?”. The most common response was ‘Incorporate it into the existing review of the article”, an option that had hardly been mentioned in the session. Perhaps we’ll need another webinar exploring this avenue next year! 

Poll graph showing the audience's response to the question "“Which peer review model would work best for data?”
Figure 2: Audience responses to a poll held at the end of the event 


Alexandra Freeman’s Octopus project aims to change the way we report research. Read the Octopus blog and an interview with Alex to find out more.  

Publish your computer code: it is good enough, a column by Nick Barnes in Nature in 2010 arguing that sharing code, whatever the quality, is more helpful than keeping it in a drawer.  

The Center for Reproducible Biomedical Modelling has been working with PLOS on a pilot about reviewing models.  

PLOS guidelines on peer-reviewing data were produced in collaboration with the Cambridge Data Champions 

CODECHECK, led by Stephen Eglen, runs code to offer a “certificate of reproducible computation” to document that core research outputs could be recreated outside of the authors’ lab. 

Code Ocean is a platform for computational research that creates web-based capsules to help enable reproducibility.  

Editorial on pilot for peer reviewing biology based models in PLOS Computational Biology 

Published on 25 January 2021

Written by Beatrice Gini

CCBY icon

Research Data at Cambridge – highlights of the year so far

By Dr Sacha Jones, Research Data Coordinator

This year we have continued, as always, to provide support and services for researchers to help with their research data management and open data practices. So far in 2020, we have approved more than 230 datasets into our institutional repository, Apollo. This includes Apollo’s 2000th dataset on the impact of health warning labels on snack selection, which represents a shining example of reproducible research, involving the full gamut: preregistration, and sharing of consent forms, code, protocols, data. There are other studies that have sparked media interest for which the data are also openly available in Apollo, such as the data supporting research that reports the development of a wireless device that can convert sunlight, carbon dioxide and water into a carbon-neutral fuel. Or, data supporting a study that has used computational modelling to explain why blues and greens are the brightest colours in nature. Also, and in the year of COVID, a dataset was published in April on the ability of common fabrics to filter ultrafine particles, associated with an article in BMJ Open. Sharing data associated with publications is critical for the integrity of many disciplines and best practice in the majority of studies, but there is also an important responsibility of science communication in particular to bring research datasets to the forefront. This point was discussed eloquently this summer in a guest blog post in Unlocking Research by Itamar Shatz, a researcher and Cambridge Data Champion. Making datasets open permits their reuse, and if you have wondered how research data is reused and then read this comprehensive data sharing and reuse case study written by the Research Data team’s Dominic Dixon. This centres on the use and value of the Mammographic Image Society database, published in Apollo five years ago. 

This year has seen the necessary move from our usual face-to-face Research Data Management (RDM) training to provision of training online. This has led us to produce an online training session in RDM, covering topics such as data organisation, storage, back up and sharing, as well as data management plans. This forms one component of a broader Research Skills Guide – an online course for Cambridge researchers on publishing, managing data, finding and disseminating research  – developed by Dr Bea Gini, the OSC’s training coordinator. We have also contributed to a ‘Managing your study resources’ CamGuide for Master’s students, providing guidance on how to work reproducibly. In collaboration with several University stakeholders we released last month new guidance on the use of electronic research notebooks (ERNs), providing information on the features of ERNs and guidance to help researchers select one that is suitable. 

At the start of this year we invited members of the University to apply to become Data Champions, joining the pre-existing community of 72 Data Champions. The 2020 call was very successful, with us welcoming 56 new Data Champions to the programme. The community has expanded this year, not only in terms of numbers of volunteers but also in terms of disciplinary focus, where there are now Data Champions in several areas of the arts, humanities and social sciences in particular where there were none previously. During this year, we have held forums in person and then online, covering themes such as how to curate manual research records, ideas for RDM guidance materials, data management in the time of coronavirus, and data practices in the arts and humanities and how these can be best supported. We look forward to further supporting and advocating the fantastic work of the Cambridge Data Champions in the months and years to come.  

Data sharing and reuse case study: the Mammographic Image Society database

The Mammographic Image Society (MIAS) database is a set of mammograms put together in 1992 by a consortium of UK academic institutions and archived on 8mm DAT tape, copies of which were made openly available and posted to applicants for a small administration fee. The mammograms themselves were curated from the UK National Breast Screening Programme, a major screening program that was established in the late 80s offering routine screening every three years to women aged between 50-64.

The motivations for creating the database were to make a practical contribution to computer vision research – which sought to improve the ability of computers to interpret images – and to encourage the creation of more extensive datasets. In the peer-reviewed paper bundled with the dataset, the researchers note that “a common database is a positive step towards achieving consistency in performance comparison and testing of algorithms”.

Due to increased demand, the MIAS database was made available online via third parties, albeit in a lower resolution than the original. Despite no longer working in this area of research, the lead author, John Suckling – now Director of Research in the Department of Psychiatry, part of Cambridge Neuroscience –  started receiving emails asking for access to the images at the original resolution. This led him to dig out the original 8mm DAT tapes with the intention of making the images available openly in a higher resolution. The tapes were sent to the University Information Service (UIS), who were able to access the original 8mm tape and download higher resolution versions of the images. The images were subsequently deposited in Apollo and made available under a CC BY license, meaning researchers are permitted to reuse them for further research as long as appropriate credit is given. This is the most commonly used license for open datasets and is recommended by the majority of research funding agencies.

Motivations for sharing the MIAS database openly

The MIAS database was created with open access in mind from the outset. When asked whether he had any reservations about sharing the database openly, the lead author John Suckling noted:

There are two broad categories of data sharing; data acquired for an original purpose that is later shared for secondary use; data acquired primarily for sharing. This dataset is an example of the latter. Sharing data for secondary use is potentially more problematic especially in consortia where there are a number of continuing interests in using the data locally. However, most datasets are (or should be) superseded, and then value can only be extracted if they are combined to create something greater than the sum of the parts. Here, careful drafting of acknowledgement text can be helpful in ensuring proper credit is given to all contributors.”

This distinction – between data acquired for an original purpose that is later shared for secondary use and data acquired primarily for sharing – is one that is important and often overlooked. The true value of some data can only be fully realised if openly shared. In such cases, as Suckling notes, sufficient documentation can help ensure the original researchers are given credit where it is due, as well as ensuring it can be reused effectively. This is also made possible by depositing the data on an institutional repository such as Apollo, where it will be given a DOI and its reuse will be easier to track.

Impact of the MIAS database

As of August 2020, the MIAS database has received over 5500 downloads across 27 different countries, including some developing countries where breast cancer survival rates are lower. Google Scholar currently reports over 1500 citations for the accompanying article as well as 23 citations for the dataset itself. A review of a sample of the 1500 citations revealed that many were examples of the data being reused rather than simply citations of the article. Additionally, a systematic review published in 2018 cited the MIAS database as one of the most widely used for applying breast cancer classification methods in computer aided diagnosis using machine learning, and a benchmarking review of databases used in mammogram research identified it as the most easily accessible mammographic image database. The reasons cited for this included the quality of the images, the wide coverage of types of abnormalities, and the supporting data which provides the specific locations of the abnormalities in each image.

The high impact of the MIAS database is something Suckling credits to the open, unrestricted access to the database, which has been the case since it was first created. When asked whether he has benefited from this personally, Suckling stated “Direct benefits have only been the citations of the primary article (on which I am first author). However, considerable efforts were made by a large number of early-career researchers using complex technologies and digital infrastructure that was in its infancy, and it is extremely gratifying to know that this work has had such an impact for such a large number of scientists.”. Given that the database continues to be widely cited and has been downloaded from Apollo 1358 times since January 2020, it is still clearly the case that the MIAS database is having a wide impact.

The MIAS Database Reused

As mentioned above, the MIAS database has been widely reused by researchers working in the field of medical image analysis. While originally intended for use in computer vision research, one of the main ways in which the dataset has been used is in the area of computer aided diagnosis (CAD), for which researchers have used the mammographic images to experiment with and train deep learning algorithms. CAD aims to augment manual inspection of medical images by medical professionals in order to increase the probability of making an accurate diagnosis.

A 2019 review of recent developments in medical image analysis identified lack of good quality data as one of the main barriers researchers in this area face. Not only is good quality data a necessity but it must also be well documented as this review also identified inappropriately annotated datasets as a core challenge in CAD. The MIAS database is accompanied by a peer-reviewed paper explaining its creation and content as well as a read me PDF which explains the file naming convention used for the images as well as the annotations used to indicate the presence of any abnormalities and classify them based on their severity. The presence of this extensive documentation combined with it having been openly available from the outset could explain why the database continues to be so widely used.

Reuse example: Applying Deep Learning for the Detection of Abnormalities in Mammograms

This research, published in 2019 in Information Science and Applications, looked at improving some of the current methods used in CAD and attempted to address some inherent shortcomings and increase the competency level of deep learning models when it comes the minimisation of false positives when applying CAD to mammographic imaging. The researchers used the MIAS database alongside another larger dataset in order to evaluate the performance of two existing convolutional neural networks (CNN), which are deep learning models used specifically for classifying images. Using these datasets, they were able to demonstrate that versions of two prominent CNNs were able to detect and classify the severity of abnormalities on the mammographic images with a high degree of accuracy.

While the researchers were able to make good use of the MIAS database to carry out their experiments, due to the inclusion of appropriate documentation and labelling, they do note that since it is a relatively small dataset it is not possible to rule out “overfitting”, where a deep learning model is highly accurate on the data used to train the model, but may not generalise well to other datasets. This highlights the importance of making such data openly available as it is only possible to improve the accuracy of CAD if sufficient data is available for researchers to carry out further experiments and improve the accuracy of their models. ­

Reuse example: Computer aided diagnosis system for automatic two stages classification of breast mass in digital mammogram images

This research, published in 2019 in Biomedical Engineering: Applications, Basis and Communications, used the MIAS database along with the Breast Cancer Digital Repository to test a CAD system based on a probabilistic neural network – a machine learning model that predicts the probability distribution of a given outcome –  developed to automate classification of breast masses on mammographic images. Unlike previously developed models, their model was able to segment and then carry out a two-stage classification of breast masses. This meant that rather than classifying masses into either benign or malignant, they were able to develop a system which carried out a more fine-grained classification consisting of seven different categories. Combining the two different databases allowed for an increased confidence level in the results gained from their model, again raising the importance of the open sharing of mammographic image datasets. After testing their model on images from these databases, they were able to demonstrate a significantly higher level of accuracy at detecting abnormalities than had been demonstrated by two similar models used for evaluation. On images from the MIAS Database and Breast Cancer Digital Repository their model was able to detect abnormalities with an accuracy of 99.8% and 97.08%, respectively. This was also accompanied by increased sensitivity (ability to correctly classify true positives) and specificity (ability to correctly classify false negatives).


Many areas of research can only move forward if sufficient data is available and if it is shared openly. This, as we have seen, is particularly true in medical imaging where despite datasets such as the MIAS database being openly available, there is a data deficiency which needs to be addressed in order to improve the accuracy of the models used in computer-aided diagnosis. The MIAS database is a clear example of a dataset that has enabled an important area of research to move forward by enabling researchers to carry out experiments and improve the accuracy of deep learning models developed for computer-aided diagnosis in medical imaging. The sharing and reuse of the MIAS database provides an excellent model for how and why future researchers should make their data openly available.

Published 20th August 2020
Written by Dominic Dixon

CCBY icon