Tag Archives: Open Research

Data sharing and reuse case study: the Mammographic Image Society database

The Mammographic Image Society (MIAS) database is a set of mammograms put together in 1992 by a consortium of UK academic institutions and archived on 8mm DAT tape, copies of which were made openly available and posted to applicants for a small administration fee. The mammograms themselves were curated from the UK National Breast Screening Programme, a major screening program that was established in the late 80s offering routine screening every three years to women aged between 50-64.

The motivations for creating the database were to make a practical contribution to computer vision research – which sought to improve the ability of computers to interpret images – and to encourage the creation of more extensive datasets. In the peer-reviewed paper bundled with the dataset, the researchers note that “a common database is a positive step towards achieving consistency in performance comparison and testing of algorithms”.

Due to increased demand, the MIAS database was made available online via third parties, albeit in a lower resolution than the original. Despite no longer working in this area of research, the lead author, John Suckling – now Director of Research in the Department of Psychiatry, part of Cambridge Neuroscience –  started receiving emails asking for access to the images at the original resolution. This led him to dig out the original 8mm DAT tapes with the intention of making the images available openly in a higher resolution. The tapes were sent to the University Information Service (UIS), who were able to access the original 8mm tape and download higher resolution versions of the images. The images were subsequently deposited in Apollo and made available under a CC BY license, meaning researchers are permitted to reuse them for further research as long as appropriate credit is given. This is the most commonly used license for open datasets and is recommended by the majority of research funding agencies.

Motivations for sharing the MIAS database openly

The MIAS database was created with open access in mind from the outset. When asked whether he had any reservations about sharing the database openly, the lead author John Suckling noted:

There are two broad categories of data sharing; data acquired for an original purpose that is later shared for secondary use; data acquired primarily for sharing. This dataset is an example of the latter. Sharing data for secondary use is potentially more problematic especially in consortia where there are a number of continuing interests in using the data locally. However, most datasets are (or should be) superseded, and then value can only be extracted if they are combined to create something greater than the sum of the parts. Here, careful drafting of acknowledgement text can be helpful in ensuring proper credit is given to all contributors.”

This distinction – between data acquired for an original purpose that is later shared for secondary use and data acquired primarily for sharing – is one that is important and often overlooked. The true value of some data can only be fully realised if openly shared. In such cases, as Suckling notes, sufficient documentation can help ensure the original researchers are given credit where it is due, as well as ensuring it can be reused effectively. This is also made possible by depositing the data on an institutional repository such as Apollo, where it will be given a DOI and its reuse will be easier to track.

Impact of the MIAS database

As of August 2020, the MIAS database has received over 5500 downloads across 27 different countries, including some developing countries where breast cancer survival rates are lower. Google Scholar currently reports over 1500 citations for the accompanying article as well as 23 citations for the dataset itself. A review of a sample of the 1500 citations revealed that many were examples of the data being reused rather than simply citations of the article. Additionally, a systematic review published in 2018 cited the MIAS database as one of the most widely used for applying breast cancer classification methods in computer aided diagnosis using machine learning, and a benchmarking review of databases used in mammogram research identified it as the most easily accessible mammographic image database. The reasons cited for this included the quality of the images, the wide coverage of types of abnormalities, and the supporting data which provides the specific locations of the abnormalities in each image.

The high impact of the MIAS database is something Suckling credits to the open, unrestricted access to the database, which has been the case since it was first created. When asked whether he has benefited from this personally, Suckling stated “Direct benefits have only been the citations of the primary article (on which I am first author). However, considerable efforts were made by a large number of early-career researchers using complex technologies and digital infrastructure that was in its infancy, and it is extremely gratifying to know that this work has had such an impact for such a large number of scientists.”. Given that the database continues to be widely cited and has been downloaded from Apollo 1358 times since January 2020, it is still clearly the case that the MIAS database is having a wide impact.

The MIAS Database Reused

As mentioned above, the MIAS database has been widely reused by researchers working in the field of medical image analysis. While originally intended for use in computer vision research, one of the main ways in which the dataset has been used is in the area of computer aided diagnosis (CAD), for which researchers have used the mammographic images to experiment with and train deep learning algorithms. CAD aims to augment manual inspection of medical images by medical professionals in order to increase the probability of making an accurate diagnosis.

A 2019 review of recent developments in medical image analysis identified lack of good quality data as one of the main barriers researchers in this area face. Not only is good quality data a necessity but it must also be well documented as this review also identified inappropriately annotated datasets as a core challenge in CAD. The MIAS database is accompanied by a peer-reviewed paper explaining its creation and content as well as a read me PDF which explains the file naming convention used for the images as well as the annotations used to indicate the presence of any abnormalities and classify them based on their severity. The presence of this extensive documentation combined with it having been openly available from the outset could explain why the database continues to be so widely used.

Reuse example: Applying Deep Learning for the Detection of Abnormalities in Mammograms

This research, published in 2019 in Information Science and Applications, looked at improving some of the current methods used in CAD and attempted to address some inherent shortcomings and increase the competency level of deep learning models when it comes the minimisation of false positives when applying CAD to mammographic imaging. The researchers used the MIAS database alongside another larger dataset in order to evaluate the performance of two existing convolutional neural networks (CNN), which are deep learning models used specifically for classifying images. Using these datasets, they were able to demonstrate that versions of two prominent CNNs were able to detect and classify the severity of abnormalities on the mammographic images with a high degree of accuracy.

While the researchers were able to make good use of the MIAS database to carry out their experiments, due to the inclusion of appropriate documentation and labelling, they do note that since it is a relatively small dataset it is not possible to rule out “overfitting”, where a deep learning model is highly accurate on the data used to train the model, but may not generalise well to other datasets. This highlights the importance of making such data openly available as it is only possible to improve the accuracy of CAD if sufficient data is available for researchers to carry out further experiments and improve the accuracy of their models. ­

Reuse example: Computer aided diagnosis system for automatic two stages classification of breast mass in digital mammogram images

This research, published in 2019 in Biomedical Engineering: Applications, Basis and Communications, used the MIAS database along with the Breast Cancer Digital Repository to test a CAD system based on a probabilistic neural network – a machine learning model that predicts the probability distribution of a given outcome –  developed to automate classification of breast masses on mammographic images. Unlike previously developed models, their model was able to segment and then carry out a two-stage classification of breast masses. This meant that rather than classifying masses into either benign or malignant, they were able to develop a system which carried out a more fine-grained classification consisting of seven different categories. Combining the two different databases allowed for an increased confidence level in the results gained from their model, again raising the importance of the open sharing of mammographic image datasets. After testing their model on images from these databases, they were able to demonstrate a significantly higher level of accuracy at detecting abnormalities than had been demonstrated by two similar models used for evaluation. On images from the MIAS Database and Breast Cancer Digital Repository their model was able to detect abnormalities with an accuracy of 99.8% and 97.08%, respectively. This was also accompanied by increased sensitivity (ability to correctly classify true positives) and specificity (ability to correctly classify false negatives).

Conclusion

Many areas of research can only move forward if sufficient data is available and if it is shared openly. This, as we have seen, is particularly true in medical imaging where despite datasets such as the MIAS database being openly available, there is a data deficiency which needs to be addressed in order to improve the accuracy of the models used in computer-aided diagnosis. The MIAS database is a clear example of a dataset that has enabled an important area of research to move forward by enabling researchers to carry out experiments and improve the accuracy of deep learning models developed for computer-aided diagnosis in medical imaging. The sharing and reuse of the MIAS database provides an excellent model for how and why future researchers should make their data openly available.

Published 20th August 2020
Written by Dominic Dixon

CCBY icon

The Role of Open Data in Science Communication

Itamar Shatz has written a guest blog post for the Office of Scholarly Communication about how public trust in the scientific community increases when researchers make their data openly available to all. He also emphasizes that science communicators (e.g. press offices, journalists, publishers) have a responsibility to point attention directly at the primary source of the data. Itamar is a PhD candidate in the Department of Theoretical and Applied Linguistics at the University of Cambridge. He is also a member of the Cambridge Data Champion programme, having joined at the start of this year. He writes about science and philosophy that have practical applications at Effectiviology.com.

It’s no secret that the public’s view of the scientific community is far from ideal.

For example, a global survey published by the Wellcome Trust in 2019 showed that, on average, only 18% of people indicate that they have a high level of trust in scientists. Furthermore, the survey showed that there are stark differences between people living in different areas of the world; for instance, this rate was more than twice as high in Northern Europe (33%) and Central Asia (32%) than in Eastern Europe (15%), South America (13%), and Central Africa (12%).

Things do appear to be improving, to some degree, especially in light of the recent pandemic. For example, a recent survey in the UK, conducted by the Open Knowledge Foundation, has found that, following the COVID-19 pandemic, 64% of people are now “more likely to listen expert advice from qualified scientists and researchers”. Similar increases in public confidence have been found in other countries, such as Germany and the USA. However, despite these recent increases, there is still much room for improvement.

Open data can help increase the public’s confidence in scientists

The public’s lack of confidence in scientists is a complex, multifaceted issue, that is unlikely to be resolved by a single, neat solution. Nevertheless, one thing that can help alleviate this issue to some degree is open data, which is the practice of making data from scientific studies publicly accessible.

Research on the topic shows just how powerful this tool can be. For example, the recent survey by the Open Knowledge Foundation, conducted in the UK in response to the COVID-19 pandemic, found that 97% of those polled believed that it’s important for COVID-19 data to be openly available for people to check, and 67% believed that all COVID-19 related research and data should be openly available for anyone to use freely. Similarly, a 2019 US survey conducted before the pandemic found that 57% of Americans say that they trust the outcomes of scientific studies more if the data from the studies is openly available to the public.

Overall, such surveys strongly suggest that open data can help increase the public’s trust in scientists. However, it’s not enough for studies to just have open data for it to increase the public’s trust; if people don’t know about the open data, or if don’t fully understand what it means, then open data is unlikely to be as beneficial as it could be. As such, in the following section we will see some guidelines on how to properly incorporate open data into science communication, in order to utilize this tool as effectively as possible.

How to incorporate open data into science communication

To properly incorporate open data into science communication, there are several key things that people who engage in science communication—such as journalists and scientists—should generally do:

  • Say that the study has open data. That is, you should explicitly mention that the researchers have made the data from their research openly available. Do not assume that people will go to the original study and then learn there about the data being open.
  • Explain what open data is. That is, you should briefly explain what it means for the data to be openly available, and potentially also mention the benefits of making the data available, for example in terms of making research more transparent, and in terms of helping other researchers reproduce the results.
  • Describe what sort of data has been made openly available. For example, you can include descriptions of the type of data involved (surveys, clinical reports, brain scans, etc.), together with some concrete examples that help the audience understand the data.
  • Explain where the data can be found. For example, this can be in the article’s “supplementary information” section, though data should preferably be available in a repository where the dataset has its own persistent identifier, such as a DOI. This ensures that the audience can find and access the data, which may otherwise be hidden behind a paywall, and offers other benefits, such as allowing researchers to directly access and cite the dataset, without navigating through the article.

These practices can help people better understand the concept of open data, particularly as it pertains to the study in question, and can help increase their trust in the openness of the data, especially if it is placed somewhere that they can access themselves.

For one example of how open data might be communicated effectively in a press release, consider the following:

“The researchers have made all the data from this study openly available; this means that all the results from their experiments can be freely accessed by anyone through a repository available at: https://www.doi.org/10.xxxxx/xxxxxxx. This can help other scientists verify and reproduce their results, and will aid future research on the topic.”

Open data in different types of scientific communications

It’s important to note that there’s no single right way to incorporate open data into scientific communications. This can be attributed to various factors, such as:

  • Differences between fields (e.g. biology, economics, or psychology)
  • Differences between types of studies (e.g. computational or experimental)
  • Differences between media (e.g. press release or social media post).

Nevertheless, the guidelines outlined earlier can be beneficial as initial considerations to take into account when deciding how to incorporate open data into science communication. It is up to communicators to make the final modifications, in order to use open data as effectively as possible in their particular situation.

Summarizing what we’ve learned

Though the public’s trust in science is currently growing, there is much room for improvement. One powerful tool that can aid the academic community is open data—the practice of making data from research studies openly available. However, to benefit as much as possible from the presence of open data, it’s not sufficient for a study to merely make its data open. Rather, the accessibility of the data needs to be promoted and explained in scientific communication, and the dataset needs to be cited appropriately (see the Joint Declaration of Data Citation Principles for guidelines regarding this latter point).

What is currently being done

It is important to note that much work is already being done to promote the concept of open data. For example, organizations such as the Research Data Alliance promote discussion of the topic and publish relevant material, as in the case of their recent guidelines and recommendations regarding COVID-19 data.

In addition, at the University of Cambridge, in particular, we can already see a substantial push for open data practices, where appropriate, and from many angles as outlined in the University’s Open Research position statement. Many funding bodies mandate that data be made available, and the University facilitates the process of sharing the data via Apollo, the institutional repository. Furthermore, there are the various training courses and publications—including this very blog—led by bodies such as the Office of Scholarly Communication (OSC), which help to promote Open Research practices at the University. Most notably, there is the OSC’s Data Champion programme, which deals, among other things, with supporting researchers with open data practices.

Moving forward

Promoting the use of open data in scientific communication is something that different stakeholders can do in different ways.

For example, those engaging in science communication—such as journalists and universities’ communication offices—can mention and explain open data when covering studies. Similarly, scientists can ask relevant communicators to cite their open data, and can also mention this information themselves when they engage in science communication directly. In addition, consumers of scientific communication and other relevant stakeholders—such as the general public, politicians, regulators, and funding bodies—can ask, whenever they hear about new research findings, whether the data was made openly available, and if not, then why.

Overall, such actions will lead to increased and more effective use of open data over time, which will help increase the trust people have in scientists. Furthermore, this will help promote the adoption of open data practices in the scientific community, by making more scientists aware of the concept, and by increasing their incentives for engaging in it.

Published 19 June 2020

Written by Itamar Shatz

CCBY icon

Open Research at the University of Cambridge: What have we done so far?

At the start of 2019 the University of Cambridge announced its Position Statement on Open Research. This blog looks at what has been happening since then and the current plans for making research at Cambridge more open.

Our Position

In February 2019, the University of Cambridge set out its position on open research to support and encourage open practices throughout the research lifecycle for all research outputs. The Position Statement made clear that both the University and researchers have responsibility in this space and that there would be no one size fits all approach to how to be open. As part of forming a position on open research, the University also created the Open Research Steering Committee to oversee the open research agenda of the University. This Committee is currently looking at three key areas –training, infrastructure and Plan S.

Training

In 2018, we ran a survey on open research [available to Cambridge University only] which highlighted our research community’s desire for more training on open research practices and tools. In order to delve into this further, a pilot was run with the Faculty of Education who submitted a disproportionately high number of responses to the survey, suggesting a strong interest in open research. The pilot, run earlier this year, encompassed six face-to-face training sessions on topics around open research, such as managing digital information, copyright, and publishing. These sessions were well received by both PhD students and postdocs.

In tandem to this, work is also being carried out to make the provision of open research related training more strategic, sustainable and efficient. For example, some of the courses the Office of Scholarly Communication run have already been embedded into existing PhD programmes, such as Doctoral Training Centres or the centrally run Researcher Development Programme but we could still increase the opportunities to work more closely with other parts of the University. With so many other pressures on time, it is essential we work together with all stakeholders involved to ensure we get the balance of training offered correct, so that we maximise the time benefits/costs of both the trainer and the student.

Finally, the question of sustainability for open research training is also being investigated. How can we ensure open research training reaches the 9,000 or so academics and postgraduate students we have at Cambridge? One answer to this question is online training. We are currently developing a digital course which will introduce the basics of open research, complementary to the soon-to-be-launched online research integrity training. However, we know that researchers value face-to-face sessions too, and intend to continue to develop our face-to-face offer, where we can provide deeper knowledge and discuss issues in more detail. Within the libraries at Cambridge we are also starting to work more closely with research support librarians and others in department libraries who can offer expertise and guidance that is tailored to the discipline.

Infrastructure

The University Position Statement on Open Research says “University support is important to make Open Research simple, effective and appropriate” and a key part of that support is in the form of infrastructure. This is a complicated area because it involves a number of service providers at the University who all have different priorities as well as the large body of researchers, who have a huge variety of needs and technical abilities. Finding common solutions or tools will always be difficult in a large, research intensive institution like Cambridge, which has Schools spread across the spectrum of arts, humanities, social sciences and STEMM subjects.

The Open Research Steering Committee is made up of representatives from across the University both from different academic Schools and University services. This is key to ensure that the drive towards open research infrastructure is holistic and proportional in the context of other University agendas. A landscape review of the services already provided has been carried out as has a ‘wish list’ of IT infrastructure that researchers would like. Whilst the ‘wish list’ has been carried out in a context wider than open research, it is really heartening to see many ‘wishes’ relate to systems that would improve open research practices.

There is also work underway to look at how research notebooks (or electronic lab notebooks if you prefer) are being used across the University. A trial of notebooks run in 2017 resulted in the decision not to provide an institution-wide research notebook platform, but guidance instead. This new work under the auspices of the Open Research Steering Committee aims to build on this work by extending the guidelines to include principles around data security, data export and procurement.

Plan S

Plan S looms large on our horizon and will present a challenge when it comes into force in 2021. Whilst we are waiting to see to what extent UKRI’s updated open access policy will reflect Plan S principles, we are busy contributing to the Transparent Pricing Working Group. This group was convened by the Wellcome Trust in partnership with UKRI and on behalf of cOAlition S to bring together publishers, funders and universities to develop a framework to guide publishers on how to communicate about the price of the services in a practical and transparent manner. The University is also looking into how we can implement the principles of DORA, which are supported by cOAlition S. This work is being led by Professor Steve Russell, an academic advocate for open research, and the work will very much be done in consultation with our academic community.

Summary

Cambridge is showing its commitment to enabling open research by taking seriously its role in providing infrastructure, training and the right culture for our academics. These areas need to be tackled holistically and the oversight of the Open Research Steering Committee should allow this to happen. It is important that we are collaborative with our research community and we hope that we have got that balance right with the inclusion of academics in the main Committee and working groups. Ensuring open research is embedded in everyday practice at the University will, of course, take time but we think we are making a good start.

Published 22nd October 2019

Written by Dr Lauren Cadwallader

This icon displays that the content of this blog is licensed under CC BY 4.0