Cambridge Data Week 2020 was an event run by the Office of Scholarly Communication at Cambridge University Libraries from 23–27 November 2020. In a series of talks, panel discussions and interactive Q&A sessions, researchers, funders, publishers and other stakeholders explored and debated different approaches to research data management. This blog is part of a series summarising each event:
Cambridge Data Week 2020 concluded on 27 November with a discussion between Dr Lauren Cadwallader (PLOS), Professor Stephen Eglen (University of Cambridge) and Kiera McNeice (Cambridge University Press) on models of data peer review. The peer review process around data is still emerging despite the increase in data sharing. This session explored how peer review of data could be approached from both a publishing and a research perspective.
The discussion focused on three main questions and here are a few snippets of what was said. If you’d like to explore the speakers’ answers in full, see the recording and transcript below.
Why is it important to peer review datasets?
Are we in a post-truth world where claims can be made without needing to back them up? What if data could replace articles as the main output of research? What key criteria should peer review adopt?
Figure 1: Word cloud created by the audience in response to “Why is it important to peer review datasets?”
How should data review be done?
Can we drive the spread of Open Data by initially setting an incredibly low bar, encouraging everyone to share data even in its messy state? Are we reviewing to ensure reusability, or do we want to go further and check quality and reproducibility? Is data review a one-off event, or a continuous process involving everyone who reuses the data?
Are journals exclusively responsible for data review, or should authors, repository managers and other organisations be involved? Where will the money come from? What’s in it for researchers who volunteer as data reviewers? How do we introduce the peer review of data in a fair and equitable way?
Who should be doing the work?
Are journals exclusively responsible for data review, or should authors, repository managers and other organisations be involved? Where will the money come from? What’s in it for researchers who volunteer as data reviewers? How do we introduce the peer review of data in a fair and equitable way?
After the end of the session, Lauren, Kiera and Stephen continued the discussion, prompted by a question from the audience about whether there should be some form of template or checklist for peer reviewing code. Here is what they said.
Lauren Cadwallader That’s an interesting idea, though of course code is written for different reasons, software, analysis, figures, and so on. Inevitably there will be different ways of reviewing it. Stephen can you tell us more about your experience with CODECHECK?
Stephen Eglen At CODECHECK we have a process to help codecheckers run research code and award a “certificate of executable computation”, like this example of a report. If doing nothing else, then copying whatever files you’ve got onto some repository, dirty and unstructured as that might seem is still gold dust to the next researcher that comes along. Initially we can set the standards low, and from there we can come up with a whole range of more advanced quality checks. One question is ‘what are researchers willing to accept?’ I know of a couple of pilots that tried requiring more work from researchers in preparing and checking their files and code, such as the Code Ocean pilot that Kiera mentioned. I think that we have a community that understand the importance of this and is willing to put in some effort.
Kiera McNeice There’s value in having checklists that are not extremely specialised, but tailored somewhat towards different subject areas. For instance, the American Journal of Political Science has two separate checklists, one for quantitative data and one for qualitative data. Certainly, some of our HSS editors have been saying that some policies developed for quantitative data do not work for their authors.
Lauren Cadwallader It might be easy to start with places where there are communities that are already engaged and have a framework for data sharing, so the peer review system would check that. What do you think?
Kiera McNeice I guess there is a ‘chicken and egg’ issue: does this have to be driven from the top down, from publishers and funders, or does it come from the bottom up, with research communities initiating it? As journals, there is a concern that if we try to enforce very strict standards, then people will take their publications elsewhere. If there is no desire from the community for these changes, publisher enforcement can only go so far.
Stephen Eglen Funders have an important role to play too. If they lead on this, researchers will follow because ultimately researchers are focused on their career. Unless there is recognition that there doing this as a valuable part of one’s work, it will be hard to convince the majority of researchers to spend time on it.
Take a pilot I was involved in with Nature Neuroscience. Originally this was meant to be a mandatory peer review of code after acceptance in principle, but in the end fears about driving away authors meant it was only made optional. Throughout a six-month trial, I was only aware of two papers that went through code review. I can see the barriers for both journal and authors, but if researchers received credit for doing it, this sort of thing will come from the bottom up.
Lauren Cadwallader In our biology-based model review pilot we ran a survey and found that many people opted in because they believe in open science, reproducibility, and so on, but two people opted in because they feared PLOS would think they had something to hide if they didn’t. That’s not at all what it was about. Although I suppose if it gets people sharing data…
Conclusion
We were intrigued by many of the ideas put forward by the speakers, particularly the areas of tension that will need to be resolved. For instance, as we try to move from a world where most data remains in people’s laptops and drawers to a FAIR data world, even sharing simple, messy, unstructured data is ‘gold dust’. Yet ultimately, we want data to be shared with extensive metadata and in an easily accessible form. What should the initial standards be, and how should they be raised over time? And how about the idea of asking Early Career Researchers to take on reviewer roles? Certainly they (and their research communities) would benefit in many ways from such involvement, but will they be able to fit this in their packed schedules?
The audience engaged in lively discussion throughout the session, especially around the use of repositories, the need for training, and disciplinary differences. At the end of the session, they surprised us all with their responses to our poll: “Which peer review model would work best for data?”. The most common response was ‘Incorporate it into the existing review of the article”, an option that had hardly been mentioned in the session. Perhaps we’ll need another webinar exploring this avenue next year!
Figure 2: Audience responses to a poll held at the end of the event
Resources
Alexandra Freeman’s Octopus project aims to change the way we report research. Read the Octopus blog and an interview with Alex to find out more.
Publish your computer code: it is good enough, a column by Nick Barnes in Nature in 2010 arguing that sharing code, whatever the quality, is more helpful than keeping it in a drawer.
CODECHECK, led by Stephen Eglen, runs code to offer a “certificate of reproducible computation” to document that core research outputs could be recreated outside of the authors’ lab.
Code Ocean is a platform for computational research that creates web-based capsules to help enable reproducibility.
The Open Access team are getting ready for the end of Charity Open Access Fund (COAF), which is due to dissolve on 30th September 2020.
From 1st October 2020 onward, there are going to be changes to the block grants that we receive, and as a result, there will be a change in our policies on whether or not we can cover researchers’ article processing charges (APCs).
We have outlined how researchers should go about securing funding for the APC’s below:
Funder name
Are article processing charges covered by a block grant?
No- authors must include cost in their grant application
1. For payment, contact researchapplications@parkinsons.org.uk, 2. Upload your paper to ensure REF compliance.
Versus Arthritis
No – authors must request support direct from funder
1. Use funder’s Grant Tracker for OA support, 2. Upload your paper to ensure REF compliance.
Multiple funders acknowledged
If your paper includes funding from UKRI, Wellcome Trust, Cancer Research UK or British Heart Foundation then we may be able to help with the APC. Researchers should upload their paper to us for a funding decision.
There is no change in the funder’s open access policies for the rest of 2020. However, there are significant changes due in 2021, specifically to Wellcome Trust and Cancer Research UK.
We have outlined the policy changes in the table below:
1. Policy covers original research articles, 2. Policy applies to papers submitted for publication after 1/1/2021, 3. Papers must be made immediately open access (no embargo allowed) in Europe PMC, 4. Papers must be published with a CC BY licence, 5. Papers must be published in a journal that is indexed in DOAJ (Wellcome will no longer cover APCs for subscription journals) 6. The authors must retain their copyright.
1. Policy covers original research articles, 2. Policy applies to all papers after 1/1/2021, 3. Papers must be made immediately open access (no embargo allowed) in Europe PMC, 4. Papers must be published with a CC BY licence.
Multiple funders acknowledged
Any papers acknowledging Wellcome Trust or Cancer Research UK must be compliant in order to access funds.
To summarise:
From 1 October 2020, authors should continue to submit their papers to the Open Access Team as usual via our website. The Open Access Team will continue to advise on the best course of action to meet funder requirements, but we may not always be able to pay APCs.
The funders’ policies remain the same until 1st January 2021. We advise authors covered by Wellcome Trust and Cancer Research UK to familiarise themselves with the changes to their funder’s open access policies, which are outlined in COAF’s table.
The Mammographic Image Society (MIAS) database is a set of mammograms put together in 1992 by a consortium of UK academic institutions and archived on 8mm DAT tape, copies of which were made openly available and posted to applicants for a small administration fee. The mammograms themselves were curated from the UK National Breast Screening Programme, a major screening program that was established in the late 80s offering routine screening every three years to women aged between 50-64.
The motivations for creating the database were to make a practical contribution to computer vision research – which sought to improve the ability of computers to interpret images – and to encourage the creation of more extensive datasets. In the peer-reviewed paper bundled with the dataset, the researchers note that “a common database is a positive step towards achieving consistency in performance comparison and testing of algorithms”.
Due to increased demand, the MIAS database was made available online via third parties, albeit in a lower resolution than the original. Despite no longer working in this area of research, the lead author, John Suckling – now Director of Research in the Department of Psychiatry, part of Cambridge Neuroscience – started receiving emails asking for access to the images at the original resolution. This led him to dig out the original 8mm DAT tapes with the intention of making the images available openly in a higher resolution. The tapes were sent to the University Information Service (UIS), who were able to access the original 8mm tape and download higher resolution versions of the images. The images were subsequently deposited in Apollo and made available under a CC BY license, meaning researchers are permitted to reuse them for further research as long as appropriate credit is given. This is the most commonly used license for open datasets and is recommended by the majority of research funding agencies.
Motivations for sharing the MIAS database openly
The MIAS database was created with open access in mind from the outset. When asked whether he had any reservations about sharing the database openly, the lead author John Suckling noted:
“There are two broad categories of data sharing; data acquired for an original purpose that is later shared for secondary use; data acquired primarily for sharing. This dataset is an example of the latter. Sharing data for secondary use is potentially more problematic especially in consortia where there are a number of continuing interests in using the data locally. However, most datasets are (or should be) superseded, and then value can only be extracted if they are combined to create something greater than the sum of the parts. Here, careful drafting of acknowledgement text can be helpful in ensuring proper credit is given to all contributors.”
This distinction – between data acquired for an original purpose that is later shared for secondary use and data acquired primarily for sharing – is one that is important and often overlooked. The true value of some data can only be fully realised if openly shared. In such cases, as Suckling notes, sufficient documentation can help ensure the original researchers are given credit where it is due, as well as ensuring it can be reused effectively. This is also made possible by depositing the data on an institutional repository such as Apollo, where it will be given a DOI and its reuse will be easier to track.
Impact of the MIAS database
As of August 2020, the MIAS database has received over 5500 downloads across 27 different countries, including some developing countries where breast cancer survival rates are lower. Google Scholar currently reports over 1500 citations for the accompanying article as well as 23 citations for the dataset itself. A review of a sample of the 1500 citations revealed that many were examples of the data being reused rather than simply citations of the article. Additionally, a systematic review published in 2018 cited the MIAS database as one of the most widely used for applying breast cancer classification methods in computer aided diagnosis using machine learning, and a benchmarking review of databases used in mammogram research identified it as the most easily accessible mammographic image database. The reasons cited for this included the quality of the images, the wide coverage of types of abnormalities, and the supporting data which provides the specific locations of the abnormalities in each image.
The high impact of the MIAS database is something Suckling credits to the open, unrestricted access to the database, which has been the case since it was first created. When asked whether he has benefited from this personally, Suckling stated “Direct benefits have only been the citations of the primary article (on which I am first author). However, considerable efforts were made by a large number of early-career researchers using complex technologies and digital infrastructure that was in its infancy, and it is extremely gratifying to know that this work has had such an impact for such a large number of scientists.”. Given that the database continues to be widely cited and has been downloaded from Apollo 1358 times since January 2020, it is still clearly the case that the MIAS database is having a wide impact.
The MIAS Database Reused
As mentioned above, the MIAS database has been widely reused by researchers working in the field of medical image analysis. While originally intended for use in computer vision research, one of the main ways in which the dataset has been used is in the area of computer aided diagnosis (CAD), for which researchers have used the mammographic images to experiment with and train deep learning algorithms. CAD aims to augment manual inspection of medical images by medical professionals in order to increase the probability of making an accurate diagnosis.
A 2019 review of recent developments in medical image analysis identified lack of good quality data as one of the main barriers researchers in this area face. Not only is good quality data a necessity but it must also be well documented as this review also identified inappropriately annotated datasets as a core challenge in CAD. The MIAS database is accompanied by a peer-reviewed paper explaining its creation and content as well as a read me PDF which explains the file naming convention used for the images as well as the annotations used to indicate the presence of any abnormalities and classify them based on their severity. The presence of this extensive documentation combined with it having been openly available from the outset could explain why the database continues to be so widely used.
Reuse example: Applying Deep Learning for the Detection of Abnormalities in Mammograms
This research, published in 2019 in Information Science and Applications, looked at improving some of the current methods used in CAD and attempted to address some inherent shortcomings and increase the competency level of deep learning models when it comes the minimisation of false positives when applying CAD to mammographic imaging. The researchers used the MIAS database alongside another larger dataset in order to evaluate the performance of two existing convolutional neural networks (CNN), which are deep learning models used specifically for classifying images. Using these datasets, they were able to demonstrate that versions of two prominent CNNs were able to detect and classify the severity of abnormalities on the mammographic images with a high degree of accuracy.
While the researchers were able to make good use of the MIAS database to carry out their experiments, due to the inclusion of appropriate documentation and labelling, they do note that since it is a relatively small dataset it is not possible to rule out “overfitting”, where a deep learning model is highly accurate on the data used to train the model, but may not generalise well to other datasets. This highlights the importance of making such data openly available as it is only possible to improve the accuracy of CAD if sufficient data is available for researchers to carry out further experiments and improve the accuracy of their models.
Reuse example: Computer aided diagnosis system for automatic two stages classification of breast mass in digital mammogram images
This research, published in 2019 in Biomedical Engineering: Applications, Basis and Communications, used the MIAS database along with the Breast Cancer Digital Repository to test a CAD system based on a probabilistic neural network – a machine learning model that predicts the probability distribution of a given outcome – developed to automate classification of breast masses on mammographic images. Unlike previously developed models, their model was able to segment and then carry out a two-stage classification of breast masses. This meant that rather than classifying masses into either benign or malignant, they were able to develop a system which carried out a more fine-grained classification consisting of seven different categories. Combining the two different databases allowed for an increased confidence level in the results gained from their model, again raising the importance of the open sharing of mammographic image datasets. After testing their model on images from these databases, they were able to demonstrate a significantly higher level of accuracy at detecting abnormalities than had been demonstrated by two similar models used for evaluation. On images from the MIAS Database and Breast Cancer Digital Repository their model was able to detect abnormalities with an accuracy of 99.8% and 97.08%, respectively. This was also accompanied by increased sensitivity (ability to correctly classify true positives) and specificity (ability to correctly classify false negatives).
Conclusion
Many areas of research can only move forward if sufficient data is available and if it is shared openly. This, as we have seen, is particularly true in medical imaging where despite datasets such as the MIAS database being openly available, there is a data deficiency which needs to be addressed in order to improve the accuracy of the models used in computer-aided diagnosis. The MIAS database is a clear example of a dataset that has enabled an important area of research to move forward by enabling researchers to carry out experiments and improve the accuracy of deep learning models developed for computer-aided diagnosis in medical imaging. The sharing and reuse of the MIAS database provides an excellent model for how and why future researchers should make their data openly available.
Published 20th August 2020 Written by Dominic Dixon
Itamar Shatz has written a guest blog post for the Office of Scholarly Communication about how public trust in the scientific community increases when researchers make their data openly available to all. He also emphasizes that science communicators (e.g. press offices, journalists, publishers) have a responsibility to point attention directly at the primary source of the data. Itamar is a PhD candidate in the Department of Theoretical and Applied Linguistics at the University of Cambridge. He is also a member of the Cambridge Data Champion programme, having joined at the start of this year. He writes about science and philosophy that have practical applications at Effectiviology.com.
It’s no secret that the public’s view of the
scientific community is far from ideal.
For example, a global survey published by the Wellcome Trust in 2019 showed that, on average, only 18% of people indicate that they have a high level of trust in scientists. Furthermore, the survey showed that there are stark differences between people living in different areas of the world; for instance, this rate was more than twice as high in Northern Europe (33%) and Central Asia (32%) than in Eastern Europe (15%), South America (13%), and Central Africa (12%).
Things do appear to be improving, to some degree, especially in light of the recent pandemic. For example, a recent survey in the UK, conducted by the Open Knowledge Foundation, has found that, following the COVID-19 pandemic, 64% of people are now “more likely to listen expert advice from qualified scientists and researchers”. Similar increases in public confidence have been found in other countries, such as Germany and the USA. However, despite these recent increases, there is still much room for improvement.
Open data can help increase the public’s confidence in
scientists
The public’s lack of confidence in
scientists is a complex, multifaceted issue, that is unlikely to be resolved by
a single, neat solution. Nevertheless, one thing that can help alleviate this
issue to some degree is open data, which is the practice of making data
from scientific studies publicly accessible.
Research on the topic shows just how powerful this tool can be. For example, the recent survey by the Open Knowledge Foundation, conducted in the UK in response to the COVID-19 pandemic, found that 97% of those polled believed that it’s important for COVID-19 data to be openly available for people to check, and 67% believed that all COVID-19 related research and data should be openly available for anyone to use freely. Similarly, a 2019 US survey conducted before the pandemic found that 57% of Americans say that they trust the outcomes of scientific studies more if the data from the studies is openly available to the public.
Overall, such surveys strongly suggest that
open data can help increase the public’s trust in scientists. However, it’s not
enough for studies to just have open data for it to increase the
public’s trust; if people don’t know about the open data, or if don’t fully understand
what it means, then open data is unlikely to be as beneficial as it could be.
As such, in the following section we will see some guidelines on how to properly
incorporate open data into science communication, in order to utilize this tool
as effectively as possible.
How to incorporate open data into science communication
To properly incorporate open data into science
communication, there are several key things that people who engage in science
communication—such as journalists and scientists—should generally do:
Say that the study has open data. That
is, you should explicitly mention that the researchers have made the data from
their research openly available. Do not assume that people will go to
the original study and then learn there about the data being open.
Explain what open data is. That is, you should briefly explain what it means for the data to
be openly available, and potentially also mention the benefits of making the
data available, for example in terms of making research more transparent, and
in terms of helping other researchers reproduce the results.
Describe what sort of data
has been made openly available. For example, you
can include descriptions of the type of data involved (surveys, clinical
reports, brain scans, etc.), together with some concrete examples that help the
audience understand the data.
Explain where the data can
be found. For example, this can be in the article’s
“supplementary information” section, though data should preferably be available
in a repository where the dataset has its own persistent identifier, such as a
DOI. This ensures that the audience can find and access the data, which may
otherwise be hidden behind a paywall, and offers other benefits, such as
allowing researchers to directly access and cite the dataset, without navigating
through the article.
These practices can help people better
understand the concept of open data, particularly as it pertains to the study
in question, and can help increase their trust in the openness of the data,
especially if it is placed somewhere that they can access themselves.
For one example of how open data might be
communicated effectively in a press release, consider the following:
“The researchers have made all the data from this study openly available; this means that all the results from their experiments can be freely accessed by anyone through a repository available at: https://www.doi.org/10.xxxxx/xxxxxxx. This can help other scientists verify and reproduce their results, and will aid future research on the topic.”
Open data in different types of scientific communications
It’s important to note that there’s no
single right way to incorporate open data into scientific communications. This
can be attributed to various factors, such as:
Differences between fields
(e.g. biology, economics, or psychology)
Differences between types
of studies (e.g. computational or experimental)
Differences between media
(e.g. press release or social media post).
Nevertheless, the guidelines outlined
earlier can be beneficial as initial considerations to take into account when
deciding how to incorporate open data into science communication. It is up to
communicators to make the final modifications, in order to use open data as
effectively as possible in their particular situation.
Summarizing what we’ve learned
Though the public’s trust in science is currently growing, there is much room for improvement. One powerful tool that can aid the academic community is open data—the practice of making data from research studies openly available. However, to benefit as much as possible from the presence of open data, it’s not sufficient for a study to merely make its data open. Rather, the accessibility of the data needs to be promoted and explained in scientific communication, and the dataset needs to be cited appropriately (see the Joint Declaration of Data Citation Principles for guidelines regarding this latter point).
What is currently being done
It is important to note that much work is already being done to promote the concept of open data. For example, organizations such as the Research Data Alliance promote discussion of the topic and publish relevant material, as in the case of their recent guidelines and recommendations regarding COVID-19 data.
In addition, at the University of Cambridge, in particular, we can already see a substantial push for open data practices, where appropriate, and from many angles as outlined in the University’s Open Research position statement. Many funding bodies mandate that data be made available, and the University facilitates the process of sharing the data via Apollo, the institutional repository. Furthermore, there are the various training courses and publications—including this very blog—led by bodies such as the Office of Scholarly Communication (OSC), which help to promote Open Research practices at the University. Most notably, there is the OSC’s Data Champion programme, which deals, among other things, with supporting researchers with open data practices.
Moving forward
Promoting the use of open data in scientific
communication is something that different stakeholders can do in different
ways.
For example, those engaging in science
communication—such as journalists and universities’ communication offices—can
mention and explain open data when covering studies. Similarly, scientists can
ask relevant communicators to cite their open data, and can also mention this
information themselves when they engage in science communication directly. In
addition, consumers of scientific communication and other relevant stakeholders—such
as the general public, politicians, regulators, and funding bodies—can ask,
whenever they hear about new research findings, whether the data was made
openly available, and if not, then why.
Overall, such actions will lead to increased and more effective use of open data over time, which will help increase the trust people have in scientists. Furthermore, this will help promote the adoption of open data practices in the scientific community, by making more scientists aware of the concept, and by increasing their incentives for engaging in it.
At the start of 2019 the University of
Cambridge announced its Position Statement on Open Research. This blog looks at
what has been happening since then and the current plans for making research at
Cambridge more open.
Our Position
In February 2019, the University of
Cambridge set out its position on open research to support and encourage open
practices throughout the research lifecycle for all research outputs. The Position
Statement made clear that both the University and researchers have
responsibility in this space and that there would be no one size fits all
approach to how to be open. As part of forming a position on open research, the
University also created the Open
Research Steering Committee to oversee the open research agenda of the
University. This Committee is currently looking at three key areas –training, infrastructure
and Plan S.
Training
In 2018, we ran a survey
on open research [available to Cambridge University only] which highlighted
our research community’s desire for more training on open research practices
and tools. In order to delve into this further, a pilot was run with the Faculty of Education who submitted a
disproportionately high number of responses to the survey, suggesting a strong
interest in open research. The pilot, run earlier this year, encompassed six
face-to-face training sessions on topics around open research, such as managing
digital information, copyright, and publishing. These sessions were well
received by both PhD students and postdocs.
In tandem to this, work is also being
carried out to make the provision of open research related training more
strategic, sustainable and efficient. For example, some of the courses the Office of Scholarly Communication run have
already been embedded into existing PhD programmes, such as Doctoral Training Centres or
the centrally run Researcher Development
Programme but we could
still increase the opportunities to work more closely with other parts of the
University. With so many other pressures on time, it is essential
we work together with all stakeholders involved to ensure we get the balance of
training offered correct, so that we maximise the time benefits/costs of both
the trainer and the student.
Finally, the question of sustainability for open research training is also being investigated. How can we ensure open research training reaches the 9,000 or so academics and postgraduate students we have at Cambridge? One answer to this question is online training. We are currently developing a digital course which will introduce the basics of open research, complementary to the soon-to-be-launched online research integrity training. However, we know that researchers value face-to-face sessions too, and intend to continue to develop our face-to-face offer, where we can provide deeper knowledge and discuss issues in more detail. Within the libraries at Cambridge we are also starting to work more closely with research support librarians and others in department libraries who can offer expertise and guidance that is tailored to the discipline.
Infrastructure
The University
Position Statement on Open Research says “University support
is important to make Open Research simple, effective and appropriate” and a key
part of that support is in the form of infrastructure. This is a complicated
area because it involves a number of service providers at the University who
all have different priorities as well as the large body of researchers, who
have a huge variety of needs and technical abilities. Finding common solutions
or tools will always be difficult in a large, research intensive institution
like Cambridge, which has Schools spread across the spectrum of arts, humanities,
social sciences and STEMM subjects.
The Open Research
Steering Committee is made up of representatives from across the
University both from different academic Schools and University services. This
is key to ensure that the drive towards open research infrastructure is
holistic and proportional in the context of other University agendas. A
landscape review of the services already provided has been carried out as has a
‘wish list’ of IT infrastructure that researchers would like. Whilst the ‘wish
list’ has been carried out in a context wider than open research, it is really
heartening to see many ‘wishes’ relate to systems that would improve open
research practices.
There is also work underway to look at how
research notebooks (or electronic lab notebooks if you prefer) are being used
across the University. A trial
of notebooks run in 2017 resulted in the decision not to provide an
institution-wide research notebook platform, but guidance
instead. This new work under the auspices of the Open Research Steering
Committee aims to build on this work by extending the guidelines to include
principles around data security, data export and procurement.
Plan S
Plan
S looms large on our horizon and will present a challenge when it comes
into force in 2021. Whilst we are waiting to see to what extent UKRI’s updated
open access policy will reflect Plan S principles, we are busy contributing to
the Transparent
Pricing Working Group. This group was convened by the Wellcome Trust in partnership with UKRI and on behalf of cOAlition S to bring together
publishers, funders and universities to develop a framework to guide publishers
on how to communicate about the price of the services in a practical and
transparent manner. The University is also looking into how we can implement
the principles of DORA, which are supported
by cOAlition S. This work is being led by Professor Steve Russell, an academic
advocate for open research, and the work will very much be done in consultation
with our academic community.
Summary
Cambridge is showing its commitment to enabling open research by taking seriously its role in providing infrastructure, training and the right culture for our academics. These areas need to be tackled holistically and the oversight of the Open Research Steering Committee should allow this to happen. It is important that we are collaborative with our research community and we hope that we have got that balance right with the inclusion of academics in the main Committee and working groups. Ensuring open research is embedded in everyday practice at the University will, of course, take time but we think we are making a good start.
The Cambridge Data Champions are an example of a community of volunteers engaged in promoting open research and good research data management (RDM). Currently entering its third year, the programme has attracted a total of 127 volunteers (86 current, 41 alumni) from diverse disciplinary backgrounds and positions. It continues to grow and has inspired similar initiatives at other universities within and outside the UK (Madsen, 2019). Dr Sacha Jones, Research Data Coordinator at the Office of Scholarly Communication, recently shared information about the programme at ‘FAIR Science: tricky problems and creative solutions’, an Open Science event held on 4th June 2019 at The Queen’s Medical Research Institute in Edinburgh, and organised by a previous Cambridge Data Champion – Dr Ralitsa Madsen. The aim of this event was to disseminate information about Open Science and promote the subsequent set-up of a network of Edinburgh Open Research Champions, with inspiration from the Cambridge Data Champion programme. Running a Data Champion programme, however, is not free of challenges. In this blog, Sacha highlights some of these alongside potential solutions in the hope that this information may be helpful to others. In this vein, Ralitsa adds her insights from ‘FAIR Science’ in Edinburgh and discusses how similar local events may spearhead the development of additional Open Science programmes/networks, thus broadening the local reach of this movement in the UK and beyond.
#FAIRscienceEDI
On 4 June 2019, the University of Edinburgh hosted ‘FAIR Science: tricky problems and creative solutions’ – a one-day event that brought together local life scientists and research support staff to discuss systemic flaws within current academic culture as well as potential solutions. Funded by the Institute for Academic Development and the UK Biochemical Society, the event was popular – with around 100 attendees – featuring both students, postdocs, principal investigators (PIs) and administrative staff. The programme featured talks by a range of local researchers – Dr Ralitsa Madsen (postdoctoral fellow and event organiser), Dr William Cawthorn (junior PI), Prof Robert Semple (Dean of Postgraduate Research and senior PI), Prof Malcolm Macleod (senior PI and member of the UK Reproducibility Network steering group), Prof Andrew Millar (senior PI and Chief Scientific Advisor on Environment, Natural Resources and Agriculture, for Scottish Government), Aki MacFarlene (Wellcome Trust Open Research Programme Officer), Dr Naomi Penfold (Associate Director, ASAPbio), Dr Nigel Goddard and Rory Macneil (RSpace developers) and Robin Rice (Research Data Service, University of Edinburgh), and Dr Sacha Jones (University of Cambridge). All slides have been made available via the Open Science Framework, and “live” tweets can be found via #FAIRScienceEDI.
Shifting the balance of research culture for the better. Image source: Presentation by Ralitsa Madsen, ‘Why FAIR Science and why now?’
Why is open science important? What is the extent of the reproducibility problem in science, and what are the responsibilities of individual stakeholders? Do all researchers need to engage with open research? Are the right metrics used when assessing researchers for appointment, promotion and funding? What are the barriers to widespread change, and can they be overcome through collective efforts? These were some of the ‘tricky’ problems that were addressed during the first half of the ‘Fair Science’ event, with the second half focussing on ‘creative solutions’, including: abandoning the journal impact factor in favour of alternative and fairer assessment criteria such as those proposed in DORA; preprinting of scientific articles and pre-registration of individual studies; new incentives introduced by funders like the Wellcome Trust who seek to promote Open Science; and data management tools such as electronic lab notebooks. Finally, the event sought to inspire local efforts in Edinburgh to establish a volunteer-driven network of Open Research Champions by providing insight into the maturing Data Champion programme at the University of Cambridge. This was a popular ‘creative solution’, with more than 20 attendees providing their contact details to receive additional information about Open Science and the set-up of a local network.
Overall, community engagement was a recurring theme during the ‘FAIR Science’ event, recognised as a catalyst required for research culture to change direction toward open practices and better science. Robert Semple discussed this in the greatest detail, suggesting that early stage researchers – PhDs and post-docs – are the building blocks of such a community, supported also by senior academics who have a responsibility to use their positions (e.g. as group leaders, editors) to promote open science. “Open Science is a responsibility also of individual groups and scientists, and grass roots efforts will be key to culture shift” (Robert Semple’s presentation). On a larger scale, Aki MacFarlene aptly stated that a supportive research ecosystem is needed to support open research; for example, where institutions as well as funders recognise and reward open practices.
Insights from the Cambridge Data Champion programme
The Data Champions at the University of Cambridge are an example of a community and a source of support for others in the research ecosystem. Promoting good RDM and the FAIR principles are two fundamental goals that Data Champions commit to when they join the programme. For some, endorsing open research practices is a fortuitous by-product of being part of the programme, yet for others, this is a key motivation for joining.
This word cloud depicts the reasons why the Cambridge Data Champions applied to become a Data Champion (the larger the text size, the more common the response). It is based on data from 105 applicants responding to the following: “What is your main motivation for becoming a Data Champion?”
Now that the Data Champion programme has been running for three years, what challenges does it face, and might disclosing these here – alongside ongoing efforts to solve them – help others to establish and maintain similar initiatives elsewhere?
Four main challenges are outlined that the programme either has or continues to experience. These are discussed in increasing scale of difficulty to overcome.
Support
Retention
Disciplinary coverage
Measuring effectiveness
(See also a recent article about the Data Champion programme by James Savage and Lauren Cadwallader.)
What challenges does the Cambridge Data Champion programme face and how may these be overcome? (image: CC0)
Support
At a basic level, an initiative like the Data Champion programme needs both financial and institutional support. The Data Champions commit their time on a voluntary basis, yet the management of the programme, its regular events and occasional ad hoc projects all require funds. Currently, the programme is secure, but we continue to seek funding opportunities to support a community that is both expanding and deserving of reward (e.g. small grants awarded to Data Champions to support their ‘championing’ activities). Institutional support is already in place and hopefully this will continue to consolidate and grow now that the University has publicly committed to supporting open research.
Retention
Not all Data Champions who join will remain Data Champions. In fact, there is a growing community of alumni Data Champions. There are currently 41 alumni Data Champions. From the feedback provided by just over half of these, 68% left the programme because they left the University of Cambridge (as expected given that the majority of Data Champions are either post-docs or PhD students), and 32% left because of a lack of time to commit to the role. Of course, there might be other reasons that we are not aware of, and we cannot speculate here in the absence of data. Feedback from Data Champions is actively sought and is an essential part of sustaining and developing this type of community.
We are exploring various methods to enhance retention. To combat the pressures of individuals’ workloads, we are being transparent about the time that certain activities will involve – a task or process may be less overwhelming when a time estimate is provided (cf ‘this survey should take approximately ten minutes to complete’). We also initiated peer-mentoring amongst Data Champions this year, in part to encourage a stronger community. We are attempting to enhance networking within the community in other ways, during group discussion sessions in the bimonthly forums, and via a virtual space where Data Champions can view each other’s data-related specialisms – with mutual support and collaboration as intended by-products. These are just a few examples, and given that Data Champions are volunteers, retention is one of several aspects of the programme that requires frequent assessment.
Disciplinary coverage
Cambridge has six Schools – Arts and Humanities, Humanities and Social Sciences, Biological Sciences, Physical Sciences, Clinical Medicine, and Technology – with faculties, departments, centres, units, institutes nested within these. The ideal situation would be for each research community (e.g. a department) to be supported by at least one Data Champion. Currently this is not the case, and the distribution of Data Champions across the different disciplinary areas is patchy. Biological Sciences is relatively well-represented by Data Champions (there are 22 Data Champions to represent around 1742 researchers in the School, i.e. 1.3%) (see bar chart below). There is a clear bias towards STEM (science, technology, engineering and maths) disciplines, yet representation in the social sciences is fair. At the more extreme end is an absence of Data Champions in the Arts and Humanities. We are looking to resolve this via a more targeted approach, guided in part by insights gained into researcher needs via the OSC’s training programme for arts, humanities and social sciences researchers.
The bars depict the number of Data Champions within each School. Percentage values give the number of Data Champions as a proportion of the total number of researchers within each School. For example, within the School of Clinical Medicine, the ratio of Data Champions to researchers is around 1:100 (researchers include contract and established researchers, and PhD students).
Measuring effectiveness
Determining how well the Data Champion programme is working is a sizeable challenge, as discussed previously. In those research communities represented by Data Champions, do we see improvements in data management, do we see a greater awareness of the FAIR principles, is there a change in research culture toward open research? These aspects are extremely difficult to measure and to assign to cause and effect, with multiple confounding factors to consider. We are working on how best to do this without overloading Data Champions and researchers with too many administrative tasks (e.g. surveys, questionnaires, etc.). Yet, the crux is for there to exist good communication and exchange of information between us (as a unit that is centrally managing the Data Champion programme) and the Data Champions, and between the Data Champions and the researchers who they are reaching out to and working with. We need to be the recipients of this information so that we can characterise the programme’s effectiveness and make improvements. As a start, the bimonthly Data Champion forums are used as an ideal venue to exchange and sound out ideas about best approaches, so that decisions on how to measure the programme’s impact lie also with the Data Champions.
A fifth challenge – recognition and reward
At the ‘FAIR Science’ event, two speakers (Naomi Penfold and Robert Semple) made a plea for those researchers who practise open science to be recognised for this – a change in reward culture is required. In a presentation centred on the misuse of metrics, Will Cawthorn referred to poor mental health in researchers as a result of the pressures of intrinsic but flawed methods of assessment. Understandably, DORA was mentioned multiple times at ‘FAIR Science’, and hopefully, with multiple universities including the University of Cambridge and University of Edinburgh as recent signatories of DORA, this marks the first steps toward a healthier and fairer researcher ecosystem. This may seem rather tangential to the Data Champions, but it is not: 66% of Data Champions, current and alumni, are or have been researchers (e.g. PhDs, post-docs, PIs). Despite the pressures of ‘publish or perish’, they have given precious time voluntarily to be a Data Champion and require recognition for this.
This raises a fifth challenge faced by the programme – how best to reward Data Champions for their contributions? Effectively addressing this may also help, via incentivisation, toward meeting three of the four challenges above – retention, coverage and measurement. While there is no official reward structure in place (see Higman et al. 2017), the benefits of being part of the programme are emphasised (networking opportunities, skills development, online presence as an expert, etc.), and we write to Heads of Departments so that Data Champions are recognised officially for their contributions. Is this enough? Perhaps not. We will address this issue via discussions at the September forum – how would those who are PhD students, post-docs, PIs, librarians, IT managers, data professionals (to name a few of the roles of Data Champions) like to be rewarded? In sharing these thoughts, we can then see what can be done.
Towards growing communities of volunteers
The Cambridge Data Champion programme is one among several UK- and Europe-wide initiatives that seek to promote good RDM and, more generally, Open Science. Their emergence speaks to a wider community interest and engagement in identifying solutions to some of the key issues haunting today’s academic culture (Madsen 2019). While the foundations of a network of Edinburgh Open Research Champions are still being laid, TU Delft in the Netherlands has already got their Data Champion programme up and running with inspiration from Cambridge. Independently, several Universities in the UK have also established their own Open Research groups, many of which are joined together through the recently established UK Reproducibility Network (UKRN) and the associated UK Network of Open Research Working Groups (UK-ORWG). Such integration fosters network crosstalk and is a step in the right direction, giving volunteers a stronger sense of ‘belonging’ while also actively working towards their formal recognition. Network crosstalk allows for beneficial resource sharing through centralised platforms such as the Open Science Framework or through direct knowledge exchange among neighbouring institutions. Following ‘FAIR Science’ in Edinburgh, for example, a meeting to discuss its outcome(s) involved members from Glasgow University’s Library Services (Valerie McCutcheon, Research Information Manager) and the UKRN’s local lead at Aberdeen University (Dr Jessica Butler, Research Fellow, Institute of Applied Health Science). Thus, similar to plans in Aberdeen, the ‘FAIR Science’ organisers are currently working with Edinburgh University’s Research Data Support team to adapt an Open Science survey developed and used at Cardiff University to guide the development of a specific Open Science strategy. This reflects the critical requirements for such strategies to be successful – active peer-to-peer engagement and community involvement to ensure that any initiatives match the needs of those who ought to benefit from them.
The long-term success of Open Science strategies – and any associated networks – will also hinge upon incorporation of formal recognition, as alluded to in the context of the Cambridge Data Champion programme. The importance of formal recognition of Open Science volunteers is also exemplified in SPARC Europe’s recent initiative – Europe’s Open Data Champions – which aims to showcase Open Data leaders who help ‘to change the hearts and minds of their peers towards more Openness’.
For formal recognition to gain traction, it will be critical to work towards recruitment of several prominent senior academics on board the Open Science wagon. By virtue of their academic status, such individuals will be able to put Open Science credentials high on the agenda of funding and academic institutions. Indeed, the establishment of the UKRN can be ascribed to a handful of senior researchers who have been able to secure financial support for this initiative, in addition to inspiring and nucleating local engagement across several UK universities. The ‘FAIR Science’ experience in Edinburgh supports this view. While difficult to prove, its impact would likely have been minimal without the involvement of prominent senior academics, including Professor Robert Semple (Dean of Postgraduate Research), Professor Malcolm Macleod (UKRN steering group member) and Professor Andrew Millar (Chief Scientific Advisor on Environment, Natural Resources and Agriculture, for Scottish Government). Thus, in addition to targeted and continuous communication by the ‘FAIR Science’ organisers before and after the event, ongoing efforts to establish a network of Edinburgh Open Research Champions has been dependent on these senior academics and their ability to mobilise essential forces throughout the University of Edinburgh.
Amongst several other factors, community engagement is central to making improvements toward reproducibility, Open Science and Open Research in general. There are multiple stakeholders involved with their own responsibilities, and senior academics are a notable part of this. Image source: Robert Semple’s presentation at #FAIRscienceEdi, ‘The “Reproducibility Crisis”: lessons learnt on the job’.
Top-down or bottom-up?
Establishing and maintaining a champions initiative need not be conceived of as succeeding via either a top-down or bottom-up approach. Instead, a combination of the best of both of these approaches is optimal, as hopefully comes across here. The emphasis on such initiatives being community driven is essential, yet structure is also required so as to ensure their maintenance and longevity. Hierarchies have little place in such communities – there are enough of these already in the ‘researcher ecosystem’ – and the beauty of such initiatives is that they bring together people from various contexts (e.g. in terms of role, discipline, institution). In this sense, the Cambridge Data Champions community is especially robust because of its diversity, being comprised of individuals who derive from highly varied roles and disciplinary backgrounds. Every champion brings their own individual strengths; collectively, this is a powerful resource in terms of knowledge and skills. Through acting on these strengths and acknowledging their responsibilities (e.g. to influence, teach, engage others), and by being part of a community like those described here, champions have the opportunity to make perhaps a wider contribution to research than ever anticipated, and certainly one that enhances its overall integrity.
References
Higman, R., Teperek, M. & Kingsley, D. (2017). Creating a community of Data Champions. International Journal of Digital Curation 12 (2): 96–106. DOI: https://doi.org/10.2218/ijdc.v12i2.562
Savage, J. & Cadwallader, L. (2019). Establishing, Developing, and Sustaining a Community of Data Champions. Data Science Journal 18 (23): 1–8. DOI: https://doi.org/10.5334/dsj-2019-023
The Cambridge Data Champions (DCs) advocate good Research Data Management (RDM) and Open Data practices to researchers locally in their departments, within Cambridge University in general, and sometimes further afield. They network with one another, exchange good methods of RDM, share ideas and, as a collective, reflect on current issues surrounding RDM, Open Data and researcher engagement, where a major shared goal is to establish best practices when it comes to research data. By attending bi-monthly forums facilitated by the Research Data Team, the DCs convene as a community, hear speakers presenting on relevant topics, and engage in workshops that will help them in their ‘championing’ activities. Following up from our latest blog which summarised how a workshop led to the creation of cartoon postcards as a new tool to add to the DCs’ resource kit for RDM advocacy, we are now reflecting on initiatives that sprung from workshops during the past year and are considering the challenges and opportunities that this programme brings as it approaches the end of its third year.
Growing
The programme started in Autumn 2016, comprising researchers who volunteered to become local community experts and advocate on research data management and sharing. Our first call welcomed 43 DCs (September 2016), our second call 20 DCs (March 2018) and the third call 40 DCs (January 2019). For simplicity, this year we also added to our statistics the “affiliate” DCs, who are colleagues who contribute to the DC community in other ways (as interested members of Cambridge’s RDM Project Group) and not necessarily through channelling their RDM efforts for the benefit of a specific department.
We are now a community comprised of 87 active DCs.
Total number of Data Champions who joined in each year (orange column indicates Champions who are still active; blue column indicates Champions who are now alumni).
Communities within a community
Over the last year we caught ourselves using words such as the ‘old DCs’ and the ‘new DCs’ and what we really meant was ‘established DCs’ and ‘new DCs’, with the latter group being those joining the programme each year. In September we celebrate the programme’s third birthday and it is reasonable to expect that there will be more experienced DCs who have already built their networks and have, more or less, a stable offering of RDM support and an enhanced understanding of the needs of their department. On the other hand, there are those who are being welcomed into the group who seek, to differing degrees, initial support from both the RDM team and their fellow colleagues in order to become successful DCs. It is easy to imagine that different layers are being developed with different needs, both in terms of support and engagement.
Through various activities and feedback from DCs, we now have a good quantity of raw data to analyse their needs for being, as we called it, ‘a good Data Champion’. We have brainstormed ideas which we are putting into action to respond to the challenges of an ever-growing Data Champions group.
Planning
DC Welcome Pack
Every year we circulate the Data Champions Welcome Pack to coincide with the inductions we organise to welcome new DCs into the group. This year we included in the pack what it is expected from a DC when s/he joins the programme so that expectations are clearly communicated from the beginning and are the same for everybody.
Page from the Cambridge Data Champions Welcome Pack
Bi-monthly forums
Lightning talks have been introduced as a standard item in each forum. These have provided DCs with the opportunity to discuss aspects of RDM they are working on (e.g. new tools and techniques), or to feed back to the group on DC activities undertaken in their departments and data-related events they have attended so that the whole group can benefit. Importantly, the lightning talks have been used by DCs to problem solve, where the collective knowledge and experience of DCs attending a forum has been harnessed to address particular challenges faced by individual DCs. This is where the community aspect of the programme truly shines.
It is always a priority for us to invite speakers to forums who are external to the programme, reflecting the needs of both the new and established DCs. For example, Hannah Clements from Cambridge University’s Researcher Development Programme (RDP) spoke to the DCs at the January forum about mentoring, providing guidance on how support can be best delivered within the DC community. In the May forum, we had talks and discussions from a panel of experts working on different aspects of data archiving. The panellists came from across the University bringing a diversity of experience, grounded in clinical governance, computing, and more traditional archiving. These examples are just a couple of the themes that we have covered so far in the forums, which have been derived predominantly from information provided by (and the needs of) the DCs themselves. Additional topics that we plan to cover in future forums include issues surrounding reproducibility, IP and commercialisation, publishing and the impact of research data.
Key aims of these forums are to not only facilitate networking between DCs but to also act as an arena for the transfer of knowledge along the ‘researcher pipeline’, from forum to DCs and from DCs to researchers in their departments.
DCspecialisation group
As a community, we need to be able to map expertise internally and understand the make-up of such an organic group at any given moment. This makes it is easier to support each other and create collaborations, but also improves how we promote the programme externally.
Areas of expertise amongst our Data Champions
This led to the formation of the DC specialisation group, consisting of one of us and six of the DCs, which determined how to categorise expertise within the group. As a result, a spreadsheet was created where all DCs can chart their specialist areas and update or amend when necessary (and at least annually). We have top level categories for simple statistical analysis and second level categories that offer more specific details for the benefit of the DC community.
The next stage is to include the wider research community and improve how various stakeholders can reach the appropriate Data Champions for initial advice and support in RDM issues. One way to do this is by presenting more coherent and consistent specialisations on the Data Champions’ website, using the categories which we have already created for internal use within the group. This stage is due to begin this month and we hope to report on our efforts next year.
Branding group
A growing community is inevitably going to bring to the forefront various identity discussions. With this in mind, we formed a branding group to examine if a DC logo should be created to enhance the Data Champions’ visibility and raise their profile amongst their peers when advocating for RDM. A logo has been created and is going through various stages of approval before it will be released later this year.
Pilot programme – Mentoring
In February 2019, we initiated a pilot mentoring project as part of the induction process for the new DCs. The mentors are established DCs who have volunteered to support those new DCs wishing to take part in this pilot exercise. This followed on from our January forum where the benefits of mentoring for both mentees and mentors were outlined by Hannah Clements of RDP. At this forum, which preceded the University-wide call for new DCs, we also held a workshop where DCs were divided into three groups and asked three questions: what do you wish you knew when you first became a DC that you know now; what could you offer as mentors to the new DCs; how do you think the mentor-mentee system could work? The responses from DCs in the three groups informed the implementation, structure and aims of the mentoring pilot.
Our aim is to learn from this project in close consultation with both mentors and mentees. We want to see if this process helps new DCs to establish themselves within their departments/institutes. Will it be effective? The findings will inform our steps for the following year. Watch this space!
Fostering clusters within departments
We have excellent examples of departments that promote their DCs within their institutions. A good example is the Chemistry department, which has a cluster of five DCs who work together in their advocacy. During this year’s call for new DCs, and with help from the Department Librarian, we used a targeted approach at advertising the DC Programme within the Department of Engineering. This was highly successful, resulting in ten new Data Champions from Engineering from various roles and Academic Divisions. They represent a hub with the local knowledge, experience and skills to assess their department’s needs and explore best approaches to support good RDM practices and Open Research, ones that are tailored to the discipline.
Alumni community
Heading toward the programme’s third birthday means that we are growing bigger but also that we are developing an alumni community as well. This is a different kettle of fish but it is on our radar to investigate how we can foster this distinct group and build a network that is not only Cambridge based but has a more national and even international outlook.
Funding
Let’s not forget that the DC programme consists of volunteers. We are in the process of seeking more funds to support this ever increasing community, to run expanding bimonthly forums, and to be able to offer grants to assist DCs in their endeavours. As an example, we supported one of the DCs, James Savage, to bring the programme to the international stage in November at the SCIDataCon 2018 in Botswana. He talked about the programme as well as his experience of being a DC. This resulted in James writing a paper together with Lauren Cadwallader, to be published soon in Data Science Journal (the accepted manuscript and associated data available now in Apollo, the Cambridge University institutional repository).
An exciting year so far!
During this third year of the DC programme the number of active DCs across the University of Cambridge has doubled. We can only anticipate it growing further each year, yet balanced by an expanding community of alumni DCs as, for example, DCs leave Cambridge. The DC community is inherently dynamic, as is the programme. Because of this, we always seek to respond and adapt to changing conditions in novel and beneficial ways while maintaining the programme’s core structure to provide strong foundations. This has been a period of reflection, organisation and anticipation, all required to drive the Data Champion programme forward and tackle current challenges effectively, as well as those that lie ahead – more on this to come soon!
‘More cash, more clarity and don’t make this compulsory’ is the take home message from a recent workshop held with Cambridge researchers on the question of Open Research.
The recent session, called “An Open Future? How Cambridge is Responding to Challenges in the Open Landscape” was with a group of new Cambridge lecturers at a seminar organized by Pathways in Higher Education Practice. This event offered us an opportunity to go beyond the usual information we provide in our training workshops*.
This session provided a unique opportunity to speak with researchers from various disciplines further along in their career who already had a basic knowledge of Open Access and Research Data sharing requirements. This meant we were able to have more of an informed discussion rather than a lecture and we wanted to hear what they thought about Open Research.
(* The OSC is often asked to provide training on all things Open Research. Generally our training is focused on PhD students and early career researchers. We create our PowerPoint slides that explain the benefits of Open Access, the necessity of a good Data Management Plan or how to promote your research through social media (all of which are freely available here). We try to make these sessions as interactive as possible.)
Quiz Time
The session started by laying out how the current academic publishing model works. Basically, researchers submit their latest findings to a journal for FREE, peer reviewers review the paper for FREE, editors oversee the journal for FREE and the publishers format the article then turn around and charge libraries exorbitant subscription fees (yep, that about sums it up). This got a good laugh from the audience.
So our first activity was a short quiz. We were interested to know if researchers knew how much things cost. We asked them a set of questions:
How much do you think we pay in subscription costs every year?
What’s the average APC?
How many papers were made gold OA and had at least one Cambridge author on it in 2016?
There was a lot of debate among the groups. Some of the answers were wildly overestimated (one researcher suggested £50 million GBP for subscriptions per year), others were quite low.
What are people sharing?
For our next activity, we wanted to know what they were already sharing and what tools they were using to share. We presented each table with a Venn diagram and a bunch of post-its:
Unsurprisingly, the ‘Publication’ circle had the most post-its. Answers included tools such as ArXiv, ResearchGate, and Academia.edu as well as personal websites and Facebook. There were also mentions of Cambridge Open Access and the Departmental Libraries. Interestingly a few noted that they made their work available to researchers through personal contact such as email requests.
There were a few post-its in the ‘Data’ circle describing what tools they used to deposit, such as university repositories and Zenodo.
The ‘Other’ category mostly talked about sharing code and software through github; although, one lecturer noted free workshops they offered. There was only one post-it that made it into the centre and that was for “webpage”. For the future, it may be interesting to know which discipline the researchers were from when they were posting because this theme came up quite a few times during the discussions.
When are people prepared to share?
The second activity involved lots of sticky dots and large pieces of paper. The participants were asked if they were comfortable sharing different aspects of their research at different stages in the research lifecycle. Each sheet was laid out in a grid as follows:
All of the researchers were asked to stick dots in the grid. The results were interesting. Most researchers were happy to share the published version of their paper, but a large number were uncomfortable sharing their pre-print or submitted version. There were only two dots in the “yes” square to share pre-prints. During the discussion it was apparent that this was probably down to the culture of the discipline where one physics researcher said it was part of the process versus one of the lecturers from English who disliked having more than one version of her paper available to read. The Book Chapter had similar results.
Data and Data Management Plans were all over the place. There were quite a few dots in the ‘Not sure’ squares. Most were happy to share data at the time of publication or at the end of the project. For the Data Management Plans it was evenly split between ‘yes’ to sharing at the end of the project versus ‘not sure’. No one wanted to share their DMP at the start of the project. There was some confusion among researchers (mostly from the humanities) who felt they didn’t have any data and therefore there was nothing to share.
The majority of the researchers were unenthusiastic about sharing their Grant Applications or Grey literature at any stage. For Grant Applications the overall feeling was that if the grant was successful then researchers didn’t want to share their methodology. If the grant was unsuccessful, they were reluctant to share their failures or they planned to submit to another granting agency. Most lecturers in the room agreed that they were fine sharing an abstract of their grant awards (which many funders post on their website).
As for Grey Literature which we defined as working papers or opinion papers, no one wanted to share anything that could be considered unfinished or not well thought out. One member of the law faculty said that if they had produced any grey literature worth sharing, then they would publish it in a journal. Moreover, it could be detrimental to their career if they shared anything that wasn’t well-researched and presented.
More money please
To finish up the session, we asked researchers what more could the University be doing to promote Open Research. Not surprisingly most people were resistant to any University mandate telling them what to do. In addition, they were strongly against any Open Research requirements being tied in with HR practices like promotions. The researchers supported discipline specific requirements for Open Research.
Clearer instructions from the University and from funders of what is required of researchers was also desired. Having a myriad of policies is quite confusing and burdensome for researchers who already feel pressured to publish. In the end, most said that if the University would pay, then they would be happy to share their published work.
This is the second in a series of three blog posts which set out the perspectives of researchers, funders and universities on support for open resources. The first was Open Resources, who should pay? In this post, David Carr from the Open Research team at the Wellcome Trust provides the view of a research funder on the challenges of developing and sustaining the key infrastructures needed to enable open research.
As a global research foundation, Wellcome is dedicated to ensuring that the outputs of the research we fund – including articles, data, software and materials – can be accessed and used in ways that maximise the benefits to health and society. For many years, we have been a passionate advocate of open access to publications and data sharing.
I am part of a new team at Wellcome which is seeking to build upon the leadership role we have taken in enabling access to research outputs. Our key priorities include:
developing novel platforms and tools to support researchers in sharing their research – such as the Wellcome Open Research publishing platform which we launched last year;
supporting pioneering projects, tools and experiments in open research, building on the Open Science Prize which with the NIH and Howard Hughes Medical Institute;
developing our policies and practices as a funder to support and incentivise open research.
We are delighted to be working with the Office of Scholarly Communication on the Open Research Pilot Project, where we will work with four Wellcome-funded research groups at Cambridge to support them in making their research outputs open. The pilot will explore the opportunities and challenges, and how platforms such as Wellcome Open Research can facilitate output sharing.
Realising the long-term value of research outputs will depend critically upon developing the infrastructures to preserve, access, combine and re-use outputs for as long as their value persists. At present, many disciplines lack recognised community repositories and, where they do exist, many cannot rely on stable long-term funding. How are we as a funder thinking about this issue?
Meeting the costs of outputs sharing
In July 2017, Wellcome published a new policy on managing and sharing data, software and materials. This replaced our long-standing policy on data management and sharing – extending our requirements for research data to also cover original software and materials (such as antibodies, cell lines and reagents). Rather than ask for a data management plan, applicants are now asked to provide an outputs management plan setting out how they will maximise the value of their research outputs more broadly.
Wellcome commits to meet the costs of these plans as an integral part of the grant, and provides guidance on the costs that funding applicants should consider. We recognise, however, that many research outputs will continue to have value long after the funding period comes to an end. Further, while it not appropriate to make all research data open indefinitely, researchers are expected to retain data underlying publications for at least ten years (a requirement which was recently formalised in the UK Concordat on Open Research Data). We must accept that preserving and making these outputs available into the future carries an ongoing cost.
Some disciplines have existing subject-area repositories which store, curate and provide access to data and other outputs on behalf of the communities they serve. Our expectation, made more explicit in our new policy, is that researchers should deposit their outputs in these repositories wherever they exist. If no recognised subject-area repository is available, we encourage researchers to consider using generalist repositories – such as Dryad, FigShare and Zenodo – or if not, to use institutional repositories. Looking ahead, we may consider developing an orphan repository to house Wellcome-funded research data which has no other obvious home.
Recognising the key importance of this infrastructure, Wellcome provides significant grant funding to repositories, databases and other community resources. As of July 2016, Wellcome had active grants totalling £80 million to support major data resources. We have also invested many millions more in major cohort and longitudinal studies, such as UK Biobank and ALSPAC. We provide such support through our Biomedical Resource and Technology Development scheme, and have provided additional major awards over the years to support key resources, such as PDB-Europe, Ensembl and the Open Microscopy Environment.
While our funding for these resources is not open-ended and subject to review, we have been conscious for some time that the reliance of key community resources on grant funding (typically of three to five years’ duration) can create significant challenges, hindering their ability to plan for the long-term and retain staff. As we develop our work on Open Research, we are keen to explore ways in which we adapt our approach to help put key infrastructures on a more sustainable footing, but this is a far from straightforward challenge.
Gaining the perspectives of resource providers
In order to better understand the issues, we did some initial work earlier this year to canvas the views of those we support. We conducted semi-structured interviews with leaders of 10 resources in receipt of Wellcome funding – six database and software resources, three cohort resources and one materials stock centre – to explore their current funding, long-term sustainability plans and thoughts on the wider funding and policy landscape.
We gathered a wealth of insights through these conversations, and several key themes emerged:
All of the resources were clear that they would continue to be dependent on support from Wellcome and/or other funders for the long-term.
While cohort studies (which provide managed access to data) can operate cost recovery models to transfer some of the cost of accessing data onto users, such models were not appropriate for data and software resources who commit to open and unrestricted access.
Several resources had additional revenue-generation routes – including collaborations with commercial entities– and these had delivered benefits in enhancing their resources. However, the level of income was usually relatively modest in terms of the total cost of sustaining the resource. Commitments to openness could also limit the extent to which such arrangements were feasible.
Diversification of funding sources can give greater assurance and reduce reliance on single funders, but can bring an additional burden. There was felt to be a need for better coordination between funders where they co-fund resources. Europe PMC, which has 27 partner funders but is managed through a single grant is a model which could be considered.
Several of the resources were actively engaged in collaborations with other resources internationally that house related data – it was felt that funders could help further facilitate such partnerships.
We are considering how Wellcome might develop its funding approaches in light of these findings. As an initial outcome, we plan to develop guidance for our funded researchers on key issues to consider in relation to sustainability. We are already working actively with other funders to facilitate co-funding and make decisions as streamlined as possible, and wish to explore how we join forces in the future in developing our broader approaches for funding open resources.
Coordinating our efforts
There is growing recognition of the crucial need for funders and wider research community to work together develop and sustain research data infrastructure. As the first blog in this series highlighted, the scientific enterprise is global and this is an issue which must be addressed international level.
In the life sciences, the ELIXIR and US BD2K initiatives have sought to develop coordinated approaches for supporting key resources and, more recently, the European Open Science Cloud initiative has developed a bold vision for a cloud-based infrastructure to store, share and re-use data across borders and disciplines.
Building on this momentum, the Human Frontiers Science Programme convened an international workshop last November to bring together data resources and major funders in the life sciences. This resulted in a call for action (reported in Nature) to coordinate efforts to ensure long-term sustainability of key resources, whilst supporting resources in providing access at no charge to users. The group proposed an international mechanism to prioritise core data resources of global importance, building on the work undertaken by ELIXIR to define criteria for such resources. It was proposed national funders could potentially then contribute a set proportion of their overall funding (with initial proposals suggesting around 1.5 to 2 per cent) to support these core data resources.
Grasping the nettle
Public and charitable funders are acutely aware that many of the core repositories and resources needed to make research outputs discoverable and useable will continue to rely on our long-term funding support. There is clear realisation that a reliance on traditional competitive grant funding is not the ideal route through which to support these key resources in a sustainable manner.
But no one yet has a perfect solution and no funder will take on this burden alone. Aligning global funders and developing joint funding models of the type described above will be far from straightforward, but hopefully we can work towards a more coordinated international approach. If we are to realise the incredible potential of open research, it’s a challenge we must address
Published 26 July 2017 Written by David Carr, Wellcome Trust (d.carr@wellcome.ac.uk)
This blog is the first in a series of three which considers the perspectives of researchers, funders and universities in relation to the support for open resources, coordinated and written by Dr Lauren Cadwallader. This post asks the question: What is the responsibility of national funders to research resources that are internationally important?
In January 2017 the Office of Scholarly Communication and Wellcome Trust started an Open Research Pilot Project to try to understand how we could help our researchers work more openly and what barriers they faced with making their work open. One of the issues that is a common theme with the groups that we are working with is the issue of the sustainability of open resources.
The Virtual Fly Brain Example
Let’s take the Connectomics group I am working with for example. They investigate the connections of neurons in fly brains (Drosophila). They produce a lot of data and are committed to sharing this openly. They share their data via the Virtual Fly Brain platform (VFB).
This platform was set up in 2009 by a group of researchers in Cambridge and Edinburgh; some of the VFB team are now also involved in the Connectomics group so there is a close relationship between these projects. The platform was created as a domain-specific location to curate existing data, taken from the literature, on Drosophila neurons and for curating and sharing new data produced by researchers working in this area.
Initially it was set up thanks to a grant from the Biotechnology and Biological Sciences Research Council (BBSRC). After an initial three year grant, the BBSRC declined to fund the database further. One likely reason for this is that the BBSRC resources scheme explicitly favours resources with a large number of UK users. The number of UK researchers who use Drosophila brain image data is relatively small (<10 labs), whereas the number of international researchers who use this data is relatively large, with an estimated 200 labs working on this type of data in other parts of the world.
Subsequently, the Wellcome Trust stepped in with funding for a further three years, due to end in September 2017. Currently it is uncertain whether or not they will fund it in the future. By now, almost eight years after its creation, VFB has become the go-to source for openly available data on Drosophila brain information and images integrated into a queryable platform. No other resource like it exists and no other research group is making moves to curate Drosophila neurobiology data openly. The VFB case raises interesting and important questions about how resources are funded and the future of domain specific open infrastructures.
The status quo
On the one hand funders like the Wellcome Trust, Research Councils UK and National Institutes of Health (NIH) are encouraging researchers to use domain specific repositories for data sharing. Yet on the other, they are acknowledging that the current approaches for these resources are not necessarily sustainable.
A recent review on building and sustaining data infrastructures commissioned by the Wellcome Trust acknowledges that in light of the FAIR principles “it is clear that data is best made available through repositories where aggregation can add most value”, which is arguably in a domain-specific repository. Use of domain-specific repositories allows data to be aggregated with similar data recorded using the same metadata fields.
It is also clear that publishers can influence where data is deposited, with publishers such as Nature Publishing Group, PLOS and F1000 all recommending subject-specific repositories as the first choice place for deposition. If no subject-specific repository is available then unstructured repositories, such as Dryad or figshare are often recommended instead, which complicates infrastructure needs and therefore provisions.
The economic model for supporting data infrastructures is something the Wellcome Trust are considering, with reports recently published by other funding agencies (here, here and here). The Wellcome Trust’s commissioned review noted that project-based funding for data infrastructures in not sustainable in the long term.
However, historically funders have encouraged, and still encourage, the use of domain specific resources, which have been born from project-based funding because of a lack of provision elsewhere. This has created a complex situation – researchers created domain specific data infrastructures using their project funding; these have become the subject norm; funder’s encourage their use, but now don’t have the mechanisms to be able to pledge sustained long-term funding.
National interests?
What is the responsibility of national funders to research resources that are internationally important? Academic research is collaborative. It crosses borders and utilises shared knowledge regardless of where it was generated and this is acknowledged by funders who see the benefits of collaboration. Yet, the strategic goals of funders, such as the BBSRC, are often focused on the national level when it comes to relevance and importance.
On the one hand it is understandable that funders concentrate on national interests – taxpayers’ money goes into the funder’s coffers and therefore they have a responsibility to those taxpayers to ensure that the money is spent on research that benefits the nation.
But, one could argue that international collaboration is in the national interest. The US-based NIH funds resources that are of international importance, including most of the model organism databases and genomic resources, such as the Gene Expression Omnibus. These are highly used by US researchers so one could argue that NIH are acting in the national interest but they are open to researchers all over the world and therefore constitute a resource of international importance.
Wellcome Trust do have a global outlook when it comes to funding, with 21% of their total spend (2015-6) going to projects outside of the UK. Yet, the VFB resource is still vulnerable despite being an internationally important resource.
One of the motivations for the Connectomics group to to participate in the Open Research Pilot is to open a dialogue with the Wellcome Trust about these issues. The Wellcome Trust are committed to strategically investing in Open Research and encourage the use of domain-specific resources. The Connectomics group are interested in how will this strategic investment translate into actual funding decisions now and into the future.
Issues on which researchers would like clarification
All the researchers who are part of the Open Research Pilot have had the opportunity to contribute to questions on open resources sustainability. Posts on the funder’s and University’s perspective will be published as parts 2 and 3 of this blog.
What do you think is the responsibility of national funders towards research resources that are of more international benefit than national?
How do you think the funding landscape will react to the move towards open research in terms of supporting the sustainability of resources used for curating and sharing data?
Researchers are asked to share their data in domain specific resources if they are available. There are 1598 discipline specific repositories listed on re3data.org and each one needs to be supported. How big does a research community need to be to expect support?
What percentage of financial support should be focussed on resources versus primary research?
If funders are reluctant to pay for domain specific resources, is there a need to move to a researcher pays model for data sharing rather than centrally funding resources in some circumstances? Why? How do they envisage this being paid for?
How can we harmonise the approach to sustainable open resources across a global research community? Should we move to centralised infrastructures like the European Open Science Cloud?
More generally how can funders and employers help to incentivise open research (carrot or stick?)
Wellcome often tries to act in a way to bring about change (e.g. open access publishing): Do they envisage that the long term funding of open research (10-20 years from now) will be very different from the situation over e.g. the next 5 years?
Published 23 June 2017 Written by Dr Lauren Cadwallader