This year has seen the necessary move from our usual face-to-face Research Data Management (RDM) training to provision of training online. This has led us to produce an online training session in RDM, covering topics such as data organisation, storage, back up and sharing, as well as data management plans. This forms one component of a broader Research Skills Guide – an online course for Cambridge researchers on publishing, managing data, finding and disseminating research – developed by Dr Bea Gini, the OSC’s training coordinator. We have also contributed to a ‘Managing your study resources’ CamGuide for Master’s students, providing guidance on how to work reproducibly. In collaboration with several University stakeholders we released last month new guidance on the use of electronic research notebooks (ERNs), providing information on the features of ERNs and guidance to help researchers select one that is suitable.
At the start of this year we invited members of the University to apply to become Data Champions, joining the pre-existing community of 72 Data Champions. The 2020 call was very successful, with us welcoming 56 new Data Champions to the programme. The community has expanded this year, not only in terms of numbers of volunteers but also in terms of disciplinary focus, where there are now Data Champions in several areas of the arts, humanities and social sciences in particular where there were none previously. During this year, we have held forums in person and then online, covering themes such as how to curate manual research records, ideas for RDM guidance materials, data management in the time of coronavirus, and data practices in the arts and humanities and how these can be best supported. We look forward to further supporting and advocating the fantastic work of the Cambridge Data Champions in the months and years to come.
The Mammographic Image Society (MIAS) database is a set of mammograms put together in 1992 by a consortium of UK academic institutions and archived on 8mm DAT tape, copies of which were made openly available and posted to applicants for a small administration fee. The mammograms themselves were curated from the UK National Breast Screening Programme, a major screening program that was established in the late 80s offering routine screening every three years to women aged between 50-64.
The motivations for creating the database were to make a practical contribution to computer vision research – which sought to improve the ability of computers to interpret images – and to encourage the creation of more extensive datasets. In the peer-reviewed paper bundled with the dataset, the researchers note that “a common database is a positive step towards achieving consistency in performance comparison and testing of algorithms”.
Due to increased demand, the MIAS database was made available online via third parties, albeit in a lower resolution than the original. Despite no longer working in this area of research, the lead author, John Suckling – now Director of Research in the Department of Psychiatry, part of Cambridge Neuroscience – started receiving emails asking for access to the images at the original resolution. This led him to dig out the original 8mm DAT tapes with the intention of making the images available openly in a higher resolution. The tapes were sent to the University Information Service (UIS), who were able to access the original 8mm tape and download higher resolution versions of the images. The images were subsequently deposited in Apollo and made available under a CC BY license, meaning researchers are permitted to reuse them for further research as long as appropriate credit is given. This is the most commonly used license for open datasets and is recommended by the majority of research funding agencies.
Motivations for sharing the MIAS database openly
The MIAS database was created with open access in mind from the outset. When asked whether he had any reservations about sharing the database openly, the lead author John Suckling noted:
“There are two broad categories of data sharing; data acquired for an original purpose that is later shared for secondary use; data acquired primarily for sharing. This dataset is an example of the latter. Sharing data for secondary use is potentially more problematic especially in consortia where there are a number of continuing interests in using the data locally. However, most datasets are (or should be) superseded, and then value can only be extracted if they are combined to create something greater than the sum of the parts. Here, careful drafting of acknowledgement text can be helpful in ensuring proper credit is given to all contributors.”
This distinction – between data acquired for an original purpose that is later shared for secondary use and data acquired primarily for sharing – is one that is important and often overlooked. The true value of some data can only be fully realised if openly shared. In such cases, as Suckling notes, sufficient documentation can help ensure the original researchers are given credit where it is due, as well as ensuring it can be reused effectively. This is also made possible by depositing the data on an institutional repository such as Apollo, where it will be given a DOI and its reuse will be easier to track.
Impact of the MIAS database
As of August 2020, the MIAS database has received over 5500 downloads across 27 different countries, including some developing countries where breast cancer survival rates are lower. Google Scholar currently reports over 1500 citations for the accompanying article as well as 23 citations for the dataset itself. A review of a sample of the 1500 citations revealed that many were examples of the data being reused rather than simply citations of the article. Additionally, a systematic review published in 2018 cited the MIAS database as one of the most widely used for applying breast cancer classification methods in computer aided diagnosis using machine learning, and a benchmarking review of databases used in mammogram research identified it as the most easily accessible mammographic image database. The reasons cited for this included the quality of the images, the wide coverage of types of abnormalities, and the supporting data which provides the specific locations of the abnormalities in each image.
The high impact of the MIAS database is something Suckling credits to the open, unrestricted access to the database, which has been the case since it was first created. When asked whether he has benefited from this personally, Suckling stated “Direct benefits have only been the citations of the primary article (on which I am first author). However, considerable efforts were made by a large number of early-career researchers using complex technologies and digital infrastructure that was in its infancy, and it is extremely gratifying to know that this work has had such an impact for such a large number of scientists.”. Given that the database continues to be widely cited and has been downloaded from Apollo 1358 times since January 2020, it is still clearly the case that the MIAS database is having a wide impact.
The MIAS Database Reused
As mentioned above, the MIAS database has been widely reused by researchers working in the field of medical image analysis. While originally intended for use in computer vision research, one of the main ways in which the dataset has been used is in the area of computer aided diagnosis (CAD), for which researchers have used the mammographic images to experiment with and train deep learning algorithms. CAD aims to augment manual inspection of medical images by medical professionals in order to increase the probability of making an accurate diagnosis.
A 2019 review of recent developments in medical image analysis identified lack of good quality data as one of the main barriers researchers in this area face. Not only is good quality data a necessity but it must also be well documented as this review also identified inappropriately annotated datasets as a core challenge in CAD. The MIAS database is accompanied by a peer-reviewed paper explaining its creation and content as well as a read me PDF which explains the file naming convention used for the images as well as the annotations used to indicate the presence of any abnormalities and classify them based on their severity. The presence of this extensive documentation combined with it having been openly available from the outset could explain why the database continues to be so widely used.
Reuse example: Applying Deep Learning for the Detection of Abnormalities in Mammograms
This research, published in 2019 in Information Science and Applications, looked at improving some of the current methods used in CAD and attempted to address some inherent shortcomings and increase the competency level of deep learning models when it comes the minimisation of false positives when applying CAD to mammographic imaging. The researchers used the MIAS database alongside another larger dataset in order to evaluate the performance of two existing convolutional neural networks (CNN), which are deep learning models used specifically for classifying images. Using these datasets, they were able to demonstrate that versions of two prominent CNNs were able to detect and classify the severity of abnormalities on the mammographic images with a high degree of accuracy.
While the researchers were able to make good use of the MIAS database to carry out their experiments, due to the inclusion of appropriate documentation and labelling, they do note that since it is a relatively small dataset it is not possible to rule out “overfitting”, where a deep learning model is highly accurate on the data used to train the model, but may not generalise well to other datasets. This highlights the importance of making such data openly available as it is only possible to improve the accuracy of CAD if sufficient data is available for researchers to carry out further experiments and improve the accuracy of their models.
Reuse example: Computer aided diagnosis system for automatic two stages classification of breast mass in digital mammogram images
This research, published in 2019 in Biomedical Engineering: Applications, Basis and Communications, used the MIAS database along with the Breast Cancer Digital Repository to test a CAD system based on a probabilistic neural network – a machine learning model that predicts the probability distribution of a given outcome – developed to automate classification of breast masses on mammographic images. Unlike previously developed models, their model was able to segment and then carry out a two-stage classification of breast masses. This meant that rather than classifying masses into either benign or malignant, they were able to develop a system which carried out a more fine-grained classification consisting of seven different categories. Combining the two different databases allowed for an increased confidence level in the results gained from their model, again raising the importance of the open sharing of mammographic image datasets. After testing their model on images from these databases, they were able to demonstrate a significantly higher level of accuracy at detecting abnormalities than had been demonstrated by two similar models used for evaluation. On images from the MIAS Database and Breast Cancer Digital Repository their model was able to detect abnormalities with an accuracy of 99.8% and 97.08%, respectively. This was also accompanied by increased sensitivity (ability to correctly classify true positives) and specificity (ability to correctly classify false negatives).
Many areas of research can only move forward if sufficient data is available and if it is shared openly. This, as we have seen, is particularly true in medical imaging where despite datasets such as the MIAS database being openly available, there is a data deficiency which needs to be addressed in order to improve the accuracy of the models used in computer-aided diagnosis. The MIAS database is a clear example of a dataset that has enabled an important area of research to move forward by enabling researchers to carry out experiments and improve the accuracy of deep learning models developed for computer-aided diagnosis in medical imaging. The sharing and reuse of the MIAS database provides an excellent model for how and why future researchers should make their data openly available.
Published 20th August 2020 Written by Dominic Dixon
The Cambridge Data Champions are an example of a community of volunteers engaged in promoting open research and good research data management (RDM). Currently entering its third year, the programme has attracted a total of 127 volunteers (86 current, 41 alumni) from diverse disciplinary backgrounds and positions. It continues to grow and has inspired similar initiatives at other universities within and outside the UK (Madsen, 2019). Dr Sacha Jones, Research Data Coordinator at the Office of Scholarly Communication, recently shared information about the programme at ‘FAIR Science: tricky problems and creative solutions’, an Open Science event held on 4th June 2019 at The Queen’s Medical Research Institute in Edinburgh, and organised by a previous Cambridge Data Champion – Dr Ralitsa Madsen. The aim of this event was to disseminate information about Open Science and promote the subsequent set-up of a network of Edinburgh Open Research Champions, with inspiration from the Cambridge Data Champion programme. Running a Data Champion programme, however, is not free of challenges. In this blog, Sacha highlights some of these alongside potential solutions in the hope that this information may be helpful to others. In this vein, Ralitsa adds her insights from ‘FAIR Science’ in Edinburgh and discusses how similar local events may spearhead the development of additional Open Science programmes/networks, thus broadening the local reach of this movement in the UK and beyond.
On 4 June 2019, the University of Edinburgh hosted ‘FAIR Science: tricky problems and creative solutions’ – a one-day event that brought together local life scientists and research support staff to discuss systemic flaws within current academic culture as well as potential solutions. Funded by the Institute for Academic Development and the UK Biochemical Society, the event was popular – with around 100 attendees – featuring both students, postdocs, principal investigators (PIs) and administrative staff. The programme featured talks by a range of local researchers – Dr Ralitsa Madsen (postdoctoral fellow and event organiser), Dr William Cawthorn (junior PI), Prof Robert Semple (Dean of Postgraduate Research and senior PI), Prof Malcolm Macleod (senior PI and member of the UK Reproducibility Network steering group), Prof Andrew Millar (senior PI and Chief Scientific Advisor on Environment, Natural Resources and Agriculture, for Scottish Government), Aki MacFarlene (Wellcome Trust Open Research Programme Officer), Dr Naomi Penfold (Associate Director, ASAPbio), Dr Nigel Goddard and Rory Macneil (RSpace developers) and Robin Rice (Research Data Service, University of Edinburgh), and Dr Sacha Jones (University of Cambridge). All slides have been made available via the Open Science Framework, and “live” tweets can be found via #FAIRScienceEDI.
Why is open science important? What is the extent of the reproducibility problem in science, and what are the responsibilities of individual stakeholders? Do all researchers need to engage with open research? Are the right metrics used when assessing researchers for appointment, promotion and funding? What are the barriers to widespread change, and can they be overcome through collective efforts? These were some of the ‘tricky’ problems that were addressed during the first half of the ‘Fair Science’ event, with the second half focussing on ‘creative solutions’, including: abandoning the journal impact factor in favour of alternative and fairer assessment criteria such as those proposed in DORA; preprinting of scientific articles and pre-registration of individual studies; new incentives introduced by funders like the Wellcome Trust who seek to promote Open Science; and data management tools such as electronic lab notebooks. Finally, the event sought to inspire local efforts in Edinburgh to establish a volunteer-driven network of Open Research Champions by providing insight into the maturing Data Champion programme at the University of Cambridge. This was a popular ‘creative solution’, with more than 20 attendees providing their contact details to receive additional information about Open Science and the set-up of a local network.
Overall, community engagement was a recurring theme during the ‘FAIR Science’ event, recognised as a catalyst required for research culture to change direction toward open practices and better science. Robert Semple discussed this in the greatest detail, suggesting that early stage researchers – PhDs and post-docs – are the building blocks of such a community, supported also by senior academics who have a responsibility to use their positions (e.g. as group leaders, editors) to promote open science. “Open Science is a responsibility also of individual groups and scientists, and grass roots efforts will be key to culture shift” (Robert Semple’s presentation). On a larger scale, Aki MacFarlene aptly stated that a supportive research ecosystem is needed to support open research; for example, where institutions as well as funders recognise and reward open practices.
Insights from the Cambridge Data Champion programme
The Data Champions at the University of Cambridge are an example of a community and a source of support for others in the research ecosystem. Promoting good RDM and the FAIR principles are two fundamental goals that Data Champions commit to when they join the programme. For some, endorsing open research practices is a fortuitous by-product of being part of the programme, yet for others, this is a key motivation for joining.
Now that the Data Champion programme has been running for three years, what challenges does it face, and might disclosing these here – alongside ongoing efforts to solve them – help others to establish and maintain similar initiatives elsewhere?
Four main challenges are outlined that the programme either has or continues to experience. These are discussed in increasing scale of difficulty to overcome.
(See also a recent article about the Data Champion programme by James Savage and Lauren Cadwallader.)
At a basic level, an initiative like the Data Champion programme needs both financial and institutional support. The Data Champions commit their time on a voluntary basis, yet the management of the programme, its regular events and occasional ad hoc projects all require funds. Currently, the programme is secure, but we continue to seek funding opportunities to support a community that is both expanding and deserving of reward (e.g. small grants awarded to Data Champions to support their ‘championing’ activities). Institutional support is already in place and hopefully this will continue to consolidate and grow now that the University has publicly committed to supporting open research.
Not all Data Champions who join will remain Data Champions. In fact, there is a growing community of alumni Data Champions. There are currently 41 alumni Data Champions. From the feedback provided by just over half of these, 68% left the programme because they left the University of Cambridge (as expected given that the majority of Data Champions are either post-docs or PhD students), and 32% left because of a lack of time to commit to the role. Of course, there might be other reasons that we are not aware of, and we cannot speculate here in the absence of data. Feedback from Data Champions is actively sought and is an essential part of sustaining and developing this type of community.
We are exploring various methods to enhance retention. To combat the pressures of individuals’ workloads, we are being transparent about the time that certain activities will involve – a task or process may be less overwhelming when a time estimate is provided (cf ‘this survey should take approximately ten minutes to complete’). We also initiated peer-mentoring amongst Data Champions this year, in part to encourage a stronger community. We are attempting to enhance networking within the community in other ways, during group discussion sessions in the bimonthly forums, and via a virtual space where Data Champions can view each other’s data-related specialisms – with mutual support and collaboration as intended by-products. These are just a few examples, and given that Data Champions are volunteers, retention is one of several aspects of the programme that requires frequent assessment.
Cambridge has six Schools – Arts and Humanities, Humanities and Social Sciences, Biological Sciences, Physical Sciences, Clinical Medicine, and Technology – with faculties, departments, centres, units, institutes nested within these. The ideal situation would be for each research community (e.g. a department) to be supported by at least one Data Champion. Currently this is not the case, and the distribution of Data Champions across the different disciplinary areas is patchy. Biological Sciences is relatively well-represented by Data Champions (there are 22 Data Champions to represent around 1742 researchers in the School, i.e. 1.3%) (see bar chart below). There is a clear bias towards STEM (science, technology, engineering and maths) disciplines, yet representation in the social sciences is fair. At the more extreme end is an absence of Data Champions in the Arts and Humanities. We are looking to resolve this via a more targeted approach, guided in part by insights gained into researcher needs via the OSC’s training programme for arts, humanities and social sciences researchers.
Determining how well the Data Champion programme is working is a sizeable challenge, as discussed previously. In those research communities represented by Data Champions, do we see improvements in data management, do we see a greater awareness of the FAIR principles, is there a change in research culture toward open research? These aspects are extremely difficult to measure and to assign to cause and effect, with multiple confounding factors to consider. We are working on how best to do this without overloading Data Champions and researchers with too many administrative tasks (e.g. surveys, questionnaires, etc.). Yet, the crux is for there to exist good communication and exchange of information between us (as a unit that is centrally managing the Data Champion programme) and the Data Champions, and between the Data Champions and the researchers who they are reaching out to and working with. We need to be the recipients of this information so that we can characterise the programme’s effectiveness and make improvements. As a start, the bimonthly Data Champion forums are used as an ideal venue to exchange and sound out ideas about best approaches, so that decisions on how to measure the programme’s impact lie also with the Data Champions.
A fifth challenge – recognition and reward
At the ‘FAIR Science’ event, two speakers (Naomi Penfold and Robert Semple) made a plea for those researchers who practise open science to be recognised for this – a change in reward culture is required. In a presentation centred on the misuse of metrics, Will Cawthorn referred to poor mental health in researchers as a result of the pressures of intrinsic but flawed methods of assessment. Understandably, DORA was mentioned multiple times at ‘FAIR Science’, and hopefully, with multiple universities including the University of Cambridge and University of Edinburgh as recent signatories of DORA, this marks the first steps toward a healthier and fairer researcher ecosystem. This may seem rather tangential to the Data Champions, but it is not: 66% of Data Champions, current and alumni, are or have been researchers (e.g. PhDs, post-docs, PIs). Despite the pressures of ‘publish or perish’, they have given precious time voluntarily to be a Data Champion and require recognition for this.
This raises a fifth challenge faced by the programme – how best to reward Data Champions for their contributions? Effectively addressing this may also help, via incentivisation, toward meeting three of the four challenges above – retention, coverage and measurement. While there is no official reward structure in place (see Higman et al. 2017), the benefits of being part of the programme are emphasised (networking opportunities, skills development, online presence as an expert, etc.), and we write to Heads of Departments so that Data Champions are recognised officially for their contributions. Is this enough? Perhaps not. We will address this issue via discussions at the September forum – how would those who are PhD students, post-docs, PIs, librarians, IT managers, data professionals (to name a few of the roles of Data Champions) like to be rewarded? In sharing these thoughts, we can then see what can be done.
Towards growing communities of volunteers
The Cambridge Data Champion programme is one among several UK- and Europe-wide initiatives that seek to promote good RDM and, more generally, Open Science. Their emergence speaks to a wider community interest and engagement in identifying solutions to some of the key issues haunting today’s academic culture (Madsen 2019). While the foundations of a network of Edinburgh Open Research Champions are still being laid, TU Delft in the Netherlands has already got their Data Champion programme up and running with inspiration from Cambridge. Independently, several Universities in the UK have also established their own Open Research groups, many of which are joined together through the recently established UK Reproducibility Network (UKRN) and the associated UK Network of Open Research Working Groups (UK-ORWG). Such integration fosters network crosstalk and is a step in the right direction, giving volunteers a stronger sense of ‘belonging’ while also actively working towards their formal recognition. Network crosstalk allows for beneficial resource sharing through centralised platforms such as the Open Science Framework or through direct knowledge exchange among neighbouring institutions. Following ‘FAIR Science’ in Edinburgh, for example, a meeting to discuss its outcome(s) involved members from Glasgow University’s Library Services (Valerie McCutcheon, Research Information Manager) and the UKRN’s local lead at Aberdeen University (Dr Jessica Butler, Research Fellow, Institute of Applied Health Science). Thus, similar to plans in Aberdeen, the ‘FAIR Science’ organisers are currently working with Edinburgh University’s Research Data Support team to adapt an Open Science survey developed and used at Cardiff University to guide the development of a specific Open Science strategy. This reflects the critical requirements for such strategies to be successful – active peer-to-peer engagement and community involvement to ensure that any initiatives match the needs of those who ought to benefit from them.
The long-term success of Open Science strategies – and any associated networks – will also hinge upon incorporation of formal recognition, as alluded to in the context of the Cambridge Data Champion programme. The importance of formal recognition of Open Science volunteers is also exemplified in SPARC Europe’s recent initiative – Europe’s Open Data Champions – which aims to showcase Open Data leaders who help ‘to change the hearts and minds of their peers towards more Openness’.
For formal recognition to gain traction, it will be critical to work towards recruitment of several prominent senior academics on board the Open Science wagon. By virtue of their academic status, such individuals will be able to put Open Science credentials high on the agenda of funding and academic institutions. Indeed, the establishment of the UKRN can be ascribed to a handful of senior researchers who have been able to secure financial support for this initiative, in addition to inspiring and nucleating local engagement across several UK universities. The ‘FAIR Science’ experience in Edinburgh supports this view. While difficult to prove, its impact would likely have been minimal without the involvement of prominent senior academics, including Professor Robert Semple (Dean of Postgraduate Research), Professor Malcolm Macleod (UKRN steering group member) and Professor Andrew Millar (Chief Scientific Advisor on Environment, Natural Resources and Agriculture, for Scottish Government). Thus, in addition to targeted and continuous communication by the ‘FAIR Science’ organisers before and after the event, ongoing efforts to establish a network of Edinburgh Open Research Champions has been dependent on these senior academics and their ability to mobilise essential forces throughout the University of Edinburgh.
Top-down or bottom-up?
Establishing and maintaining a champions initiative need not be conceived of as succeeding via either a top-down or bottom-up approach. Instead, a combination of the best of both of these approaches is optimal, as hopefully comes across here. The emphasis on such initiatives being community driven is essential, yet structure is also required so as to ensure their maintenance and longevity. Hierarchies have little place in such communities – there are enough of these already in the ‘researcher ecosystem’ – and the beauty of such initiatives is that they bring together people from various contexts (e.g. in terms of role, discipline, institution). In this sense, the Cambridge Data Champions community is especially robust because of its diversity, being comprised of individuals who derive from highly varied roles and disciplinary backgrounds. Every champion brings their own individual strengths; collectively, this is a powerful resource in terms of knowledge and skills. Through acting on these strengths and acknowledging their responsibilities (e.g. to influence, teach, engage others), and by being part of a community like those described here, champions have the opportunity to make perhaps a wider contribution to research than ever anticipated, and certainly one that enhances its overall integrity.