All posts by Maria Angelaki

Cambridge Data Week 2020 day 4: Supporting researchers on data management – do we need a fairy godmother?

Cambridge Data Week 2020 was an event run by the Office of Scholarly Communication at Cambridge University Libraries from 23–27 November 2020. In a series of talks, panel discussions and interactive Q&A sessions, researchers, funders, publishers and other stakeholders explored and debated different approaches to research data management. This blog is part of a series summarising each event: 

The rest of the blogs comprising this series are as follows:
Cambridge Data Week day 1 blog
Cambridge Data Week day 2 blog
Cambridge Data Week day 3 blog
Cambridge Data Week day 5 blog

Introduction 

How should researchers’ data management activities and skills be supported? What are the data management responsibilities of the funder, the institution, the research group and the individual researcher? Should we focus on training researchers so they can carry out good data management themselves or should we be funding specialist teams who can work with research groups, allowing the researchers to concentrate on research instead of data management? These were the questions addressed on day 4 of Cambridge Data Week 2020. This session benefitted from the perspectives of three speakers deriving from three different components of the research ecosystem: national funder, institutional research support and department/institute. Respectively, these were provided by Tao-Tao Chang (Arts and Humanities Research Council [AHRC]), Marta Teperek (TU Delft) and Alastair Downie (The Gurdon Institute, Cambridge). 

From a funder’s perspective, and following UKRI community consultation, Tao-Tao specifies that digital research infrastructure is recognised as an area for urgent investment, particularly in the arts and humanities, where both software and data loss are acute. Going forwards, AHRC’s key priorities will be to prevent further data loss, invest in skills, build capability, and work with the community to effect a sustained change in research culture. At an institutional level, Marta argues that it is unfair for researchers to be left unsupported to manage their data. The TU Delft model addresses this via three methods: central data support, disciplinary support by data stewards as permanent faculty staff, and hands-on support for research groups via data managers and research software engineers. Regarding the latter, an important take-home message for all researchers, regardless of institutional affiliation, is to build data management costs into grant proposals. Alastair takes up the discussion at the level of the department, research group and even individual, highlighting how researchers are locked into infrastructure silos, and locked into an unhelpful, competitive culture where altruism is a risky proposition and the career benefits of sharing seem intangible or insufficient. Alastair proposes that the climate is right and the community is ready for change, and goes on to discuss some positive changes afoot in the School of Biological Sciences to counteract these.  

Audience composition  

We had 291 registrations for the webinar, with just over 70% originating from the Higher Education sector. Researchers and PhD students accounted for 30% of the registrations whilst research support staff from various organisations accounted for an impressive 46%. On the day, we were thrilled to see that 136 people attended the webinar, participating from a wide range of countries. 

Recording, transcript and presentations 

The video recording of the webinar can be found below and the recording, transcript and presentations are present in Apollo, the University of Cambridge repository.

Bonus material 

There were a few questions we did not have time to address during the live session, so we put them to the speakers afterwards. Here are their answers: 

Talking about the technical side have you yet come across anyone using a machine implementable DMP? Setting up a data management infrastructure for a large project it’s become apparent that checking compliance with a DMP is a huge job and of course there is minimal resource for doing this.

Marta Teperek Work is being done in this area by Research Data Alliance where there are several groups working on machine actionable DMPs. Basically, the idea is that instead of asking researchers to write long essays about how they are planning to manage their data, they are asked to provide answers that are structured. These can be multiple choice options, for example, where the researcher specifies that they will be depositing large amounts of data in the repository and the repository will be notified of data coming their way. In other words, actions are made depending on what the researcher says they will do. University of Queensland is doing a lot on this already [see link to blog post here and in Resources further below].

What are the best cross-platform, mobile and desktop tools for data management?

Alastair Downie RDM encompasses a far too broad a range of activities – it’s a concept rather than a single activity that you can build into a neat little app. In the context of electronic lab notebooks, for example, there are hundreds of apps that serve that function and some of them cross over into lab management as well. Those products that try to do too much become very bloated and complex, which makes them unattractive and so we don’t see uptake of those kind of products. I think a suite approach is better than a single solution.

Institutions audit spending on research grants, they should do the same for research data and should be a requirement of holding a grant.

Alastair Downie Wellcome Trust are now challenging researchers to demonstrate that they have complied with their DMPs. It’s not particularly empirical but the fact that they are demonstrating their determination to make sure that everyone’s doing things properly is very helpful. 

Are there any specific infrastructure projects that the AHRC is sponsoring? I’m curious about what infrastructure/services would be useful for Arts and Humanities researchers

Tao-Tao Chang Not at this juncture. But we are hoping that this will change. AHRC recognises the importance of good data management practice and the need to support it. We also recognise that there is a skills gap and that all researchers at every level need support.

Is there a 2020 edition of the State of Open Data report?

Yes, this was published five days after this webinar! See the Digital Science website and further below under ‘Resources’.

Conclusion 

There are two outcomes of the webinar to draw upon here. The first raises again the question: do researchers need, or even want, a fairy godmother to support their research data management?  We held a poll at the end of the webinar, asking participants to choose which one of the following statements they believe most strongly: (1) ‘Individual researchers should learn how to manage their own data well’ or (2) ‘Researchers’ data should be managed by funded RDM specialists so that researchers can focus on research’. Of the 78 respondents, 67% chose the first option and 33% chose the second. There was not an intermediate option to incorporate both, simply because we wanted to force a choice in the direction of strongest belief when the two options are considered relative to one another. 

The results of the poll and the discussions during the webinar (between the speakers and within the chat) indicate that while individual researchers are responsible for managing their research data, support does need to be made available and promoted actively (we provide in the ‘Resources’ section some links to University of Cambridge research data management support). A second outcome reveals that support needs to be provided under several different guises. On the one hand, there is support that comes via the provision of funding, research data services and individually tailored expertise. Yet, on the other hand, there is support that will derive, albeit in a less tangible sense, from positive changes in research culture, specifically in terms of how the research of individual researchers is assessed and rewarded.  

Resources  

Some links to University of Cambridge research data management support include: the Research Data Management Policy Framework that outlines, for example, the data management responsibilities of research students and staff; our data management guide; a list of Cambridge Data Champions, searchable by areas of expertise. 

A recent Postdoc Academy podcast on ‘How can we improve the research culture at Cambridge?’ 

description of different data management support roles at TU Delft, by Alastair Dunning and Marta Teperek: data steward, data manager, research software engineer, data scientist and data champion.  

A Gurdon Computing blog post by Alastair Downie on ‘Research data management as a national service’; in other words, rather than duplicating infrastructure and services across the research landscape. 

An article by Florian Markowetz, discussed in the webinar, on ‘Five selfish reasons to work reproducibly’ (in Genome Biology)

TU Delft Open Working blog post by Marta Teperek on machine actionable Data Management Plans (DMPs) at the University of Queensland. For more information, see this article by Miksa and colleagues on the ‘Ten principles for machine-actionable data management plans’ (in PLOS Computational Biology).  

The State of Open Data 2020 report, published on 1 December 2020. 

Published on 25 January 2021

Written by Dr Sacha Jones with contributions from Tao-Tao Chang, Dr Marta Teperek, Alastair Downie and Maria Angelaki. 

CCBY icon

Cambridge Data Week 2020 day 5: How do we peer review data? New sustainable and effective models

Cambridge Data Week 2020 was an event run by the Office of Scholarly Communication at Cambridge University Libraries from 23–27 November 2020. In a series of talks, panel discussions and interactive Q&A sessions, researchers, funders, publishers and other stakeholders explored and debated different approaches to research data management. This blog is part of a series summarising each event:   

The rest of the blogs comprising this series are as follows:
Cambridge Data Week day 1 blog
Cambridge Data Week day 2 blog
Cambridge Data Weekday 3 blog
Cambridge Data Week day 4 blog

Introduction  

Cambridge Data Week 2020 concluded on 27 November with a discussion between Dr Lauren Cadwallader (PLOS), Professor Stephen Eglen (University of Cambridge) and Kiera McNeice (Cambridge University Press) on models of data peer review. The peer review process around data is still emerging despite the increase in data sharing. This session explored how peer review of data could be approached from both a publishing and a research perspective. 

The discussion focused on three main questions and here are a few snippets of what was said. If you’d like to explore the speakers’ answers in full, see the recording and transcript below.  

Why is it important to peer review datasets?

Are we in a post-truth world where claims can be made without needing to back them up? What if data could replace articles as the main output of research? What key criteria should peer review adopt?

Word cloud created by the audience in response to “Why is it important to peer review datasets?” The four most prominent words are: integrity, quality, trust, reproducibility.
Figure 1: Word cloud created by the audience in response to “Why is it important to peer review datasets?”

How should data review be done?

Can we drive the spread of Open Data by initially setting an incredibly low bar, encouraging everyone to share data even in its messy state? Are we reviewing to ensure reusability, or do we want to go further and check quality and reproducibility? Is data review a one-off event, or a continuous process involving everyone who reuses the data?

Are journals exclusively responsible for data review, or should authors, repository managers and other organisations be involved? Where will the money come from? What’s in it for researchers who volunteer as data reviewers? How do we introduce the peer review of data in a fair and equitable way? 

Who should be doing the work?

Are journals exclusively responsible for data review, or should authors, repository managers and other organisations be involved? Where will the money come from? What’s in it for researchers who volunteer as data reviewers? How do we introduce the peer review of data in a fair and equitable way?

Watch the session 

The video recording of the webinar can be found below and the transcript is present in Apollo, the University of Cambridge repository

Bonus material 

After the end of the session, Lauren, Kiera and Stephen continued the discussion, prompted by a question from the audience about whether there should be some form of template or checklist for peer reviewing code. Here is what they said. 

Lauren Cadwallader  That’s an interesting idea, though of course code is written for different reasons, software, analysis, figures, and so on. Inevitably there will be different ways of reviewing it. Stephen can you tell us more about your experience with CODECHECK? 

Stephen Eglen At CODECHECK we have a process to help codecheckers run research code and award a “certificate of executable computation”, like this example of a report. If doing nothing else, then copying whatever files you’ve got onto some repository, dirty and unstructured as that might seem is still gold dust to the next researcher that comes along. Initially we can set the standards low, and from there we can come up with a whole range of more advanced quality checks. One question is ‘what are researchers willing to accept?’ I know of a couple of pilots that tried requiring more work from researchers in preparing and checking their files and code, such as the Code Ocean pilot that Kiera mentioned. I think that we have a community that understand the importance of this and is willing to put in some effort.  

Kiera McNeice There’s value in having checklists that are not extremely specialised, but tailored somewhat towards different subject areas. For instance, the American Journal of Political Science has two separate checklists, one for quantitative data and one for qualitative data. Certainly, some of our HSS editors have been saying that some policies developed for quantitative data do not work for their authors.  

Lauren Cadwallader  It might be easy to start with places where there are communities that are already engaged and have a framework for data sharing, so the peer review system would check that. What do you think? 

Kiera McNeice I guess there is a ‘chicken and egg’ issue: does this have to be driven from the top down, from publishers and funders, or does it come from the bottom up, with research communities initiating it? As journals, there is a concern that if we try to enforce very strict standards, then people will take their publications elsewhere. If there is no desire from the community for these changes, publisher enforcement can only go so far.  

Stephen Eglen Funders have an important role to play too. If they lead on this, researchers will follow because ultimately researchers are focused on their career. Unless there is recognition that there doing this as a valuable part of one’s work, it will be hard to convince the majority of researchers to spend time on it.  

Take a pilot I was involved in with Nature Neuroscience. Originally this was meant to be a mandatory peer review of code after acceptance in principle, but in the end fears about driving away authors meant it was only made optional. Throughout a six-month trial, I was only aware of two papers that went through code review. I can see the barriers for both journal and authors, but if researchers received credit for doing it, this sort of thing will come from the bottom up. 

Lauren Cadwallader  In our biology-based model review pilot we ran a survey and found that many people opted in because they believe in open science, reproducibility, and so on, but two people opted in because they feared PLOS would think they had something to hide if they didn’t. That’s not at all what it was about. Although I suppose if it gets people sharing data… 

Conclusion 

We were intrigued by many of the ideas put forward by the speakers, particularly the areas of tension that will need to be resolved. For instance, as we try to move from a world where most data remains in people’s laptops and drawers to a FAIR data world, even sharing simple, messy, unstructured data is ‘gold dust’. Yet ultimately, we want data to be shared with extensive metadata and in an easily accessible form. What should the initial standards be, and how should they be raised over time? And how about the idea of asking Early Career Researchers to take on reviewer roles? Certainly they (and their research communities) would benefit in many ways from such involvement, but will they be able to fit this in their packed schedules?  

The audience engaged in lively discussion throughout the session, especially around the use of repositories, the need for training, and disciplinary differences. At the end of the session, they surprised us all with their responses to our poll: “Which peer review model would work best for data?”. The most common response was ‘Incorporate it into the existing review of the article”, an option that had hardly been mentioned in the session. Perhaps we’ll need another webinar exploring this avenue next year! 

Poll graph showing the audience's response to the question "“Which peer review model would work best for data?”
Figure 2: Audience responses to a poll held at the end of the event 

Resources 

Alexandra Freeman’s Octopus project aims to change the way we report research. Read the Octopus blog and an interview with Alex to find out more.  

Publish your computer code: it is good enough, a column by Nick Barnes in Nature in 2010 arguing that sharing code, whatever the quality, is more helpful than keeping it in a drawer.  

The Center for Reproducible Biomedical Modelling has been working with PLOS on a pilot about reviewing models.  

PLOS guidelines on peer-reviewing data were produced in collaboration with the Cambridge Data Champions 

CODECHECK, led by Stephen Eglen, runs code to offer a “certificate of reproducible computation” to document that core research outputs could be recreated outside of the authors’ lab. 

Code Ocean is a platform for computational research that creates web-based capsules to help enable reproducibility.  

Editorial on pilot for peer reviewing biology based models in PLOS Computational Biology 

Published on 25 January 2021

Written by Beatrice Gini

CCBY icon

The Role of Open Data in Science Communication

Itamar Shatz has written a guest blog post for the Office of Scholarly Communication about how public trust in the scientific community increases when researchers make their data openly available to all. He also emphasizes that science communicators (e.g. press offices, journalists, publishers) have a responsibility to point attention directly at the primary source of the data. Itamar is a PhD candidate in the Department of Theoretical and Applied Linguistics at the University of Cambridge. He is also a member of the Cambridge Data Champion programme, having joined at the start of this year. He writes about science and philosophy that have practical applications at Effectiviology.com.

It’s no secret that the public’s view of the scientific community is far from ideal.

For example, a global survey published by the Wellcome Trust in 2019 showed that, on average, only 18% of people indicate that they have a high level of trust in scientists. Furthermore, the survey showed that there are stark differences between people living in different areas of the world; for instance, this rate was more than twice as high in Northern Europe (33%) and Central Asia (32%) than in Eastern Europe (15%), South America (13%), and Central Africa (12%).

Things do appear to be improving, to some degree, especially in light of the recent pandemic. For example, a recent survey in the UK, conducted by the Open Knowledge Foundation, has found that, following the COVID-19 pandemic, 64% of people are now “more likely to listen expert advice from qualified scientists and researchers”. Similar increases in public confidence have been found in other countries, such as Germany and the USA. However, despite these recent increases, there is still much room for improvement.

Open data can help increase the public’s confidence in scientists

The public’s lack of confidence in scientists is a complex, multifaceted issue, that is unlikely to be resolved by a single, neat solution. Nevertheless, one thing that can help alleviate this issue to some degree is open data, which is the practice of making data from scientific studies publicly accessible.

Research on the topic shows just how powerful this tool can be. For example, the recent survey by the Open Knowledge Foundation, conducted in the UK in response to the COVID-19 pandemic, found that 97% of those polled believed that it’s important for COVID-19 data to be openly available for people to check, and 67% believed that all COVID-19 related research and data should be openly available for anyone to use freely. Similarly, a 2019 US survey conducted before the pandemic found that 57% of Americans say that they trust the outcomes of scientific studies more if the data from the studies is openly available to the public.

Overall, such surveys strongly suggest that open data can help increase the public’s trust in scientists. However, it’s not enough for studies to just have open data for it to increase the public’s trust; if people don’t know about the open data, or if don’t fully understand what it means, then open data is unlikely to be as beneficial as it could be. As such, in the following section we will see some guidelines on how to properly incorporate open data into science communication, in order to utilize this tool as effectively as possible.

How to incorporate open data into science communication

To properly incorporate open data into science communication, there are several key things that people who engage in science communication—such as journalists and scientists—should generally do:

  • Say that the study has open data. That is, you should explicitly mention that the researchers have made the data from their research openly available. Do not assume that people will go to the original study and then learn there about the data being open.
  • Explain what open data is. That is, you should briefly explain what it means for the data to be openly available, and potentially also mention the benefits of making the data available, for example in terms of making research more transparent, and in terms of helping other researchers reproduce the results.
  • Describe what sort of data has been made openly available. For example, you can include descriptions of the type of data involved (surveys, clinical reports, brain scans, etc.), together with some concrete examples that help the audience understand the data.
  • Explain where the data can be found. For example, this can be in the article’s “supplementary information” section, though data should preferably be available in a repository where the dataset has its own persistent identifier, such as a DOI. This ensures that the audience can find and access the data, which may otherwise be hidden behind a paywall, and offers other benefits, such as allowing researchers to directly access and cite the dataset, without navigating through the article.

These practices can help people better understand the concept of open data, particularly as it pertains to the study in question, and can help increase their trust in the openness of the data, especially if it is placed somewhere that they can access themselves.

For one example of how open data might be communicated effectively in a press release, consider the following:

“The researchers have made all the data from this study openly available; this means that all the results from their experiments can be freely accessed by anyone through a repository available at: https://www.doi.org/10.xxxxx/xxxxxxx. This can help other scientists verify and reproduce their results, and will aid future research on the topic.”

Open data in different types of scientific communications

It’s important to note that there’s no single right way to incorporate open data into scientific communications. This can be attributed to various factors, such as:

  • Differences between fields (e.g. biology, economics, or psychology)
  • Differences between types of studies (e.g. computational or experimental)
  • Differences between media (e.g. press release or social media post).

Nevertheless, the guidelines outlined earlier can be beneficial as initial considerations to take into account when deciding how to incorporate open data into science communication. It is up to communicators to make the final modifications, in order to use open data as effectively as possible in their particular situation.

Summarizing what we’ve learned

Though the public’s trust in science is currently growing, there is much room for improvement. One powerful tool that can aid the academic community is open data—the practice of making data from research studies openly available. However, to benefit as much as possible from the presence of open data, it’s not sufficient for a study to merely make its data open. Rather, the accessibility of the data needs to be promoted and explained in scientific communication, and the dataset needs to be cited appropriately (see the Joint Declaration of Data Citation Principles for guidelines regarding this latter point).

What is currently being done

It is important to note that much work is already being done to promote the concept of open data. For example, organizations such as the Research Data Alliance promote discussion of the topic and publish relevant material, as in the case of their recent guidelines and recommendations regarding COVID-19 data.

In addition, at the University of Cambridge, in particular, we can already see a substantial push for open data practices, where appropriate, and from many angles as outlined in the University’s Open Research position statement. Many funding bodies mandate that data be made available, and the University facilitates the process of sharing the data via Apollo, the institutional repository. Furthermore, there are the various training courses and publications—including this very blog—led by bodies such as the Office of Scholarly Communication (OSC), which help to promote Open Research practices at the University. Most notably, there is the OSC’s Data Champion programme, which deals, among other things, with supporting researchers with open data practices.

Moving forward

Promoting the use of open data in scientific communication is something that different stakeholders can do in different ways.

For example, those engaging in science communication—such as journalists and universities’ communication offices—can mention and explain open data when covering studies. Similarly, scientists can ask relevant communicators to cite their open data, and can also mention this information themselves when they engage in science communication directly. In addition, consumers of scientific communication and other relevant stakeholders—such as the general public, politicians, regulators, and funding bodies—can ask, whenever they hear about new research findings, whether the data was made openly available, and if not, then why.

Overall, such actions will lead to increased and more effective use of open data over time, which will help increase the trust people have in scientists. Furthermore, this will help promote the adoption of open data practices in the scientific community, by making more scientists aware of the concept, and by increasing their incentives for engaging in it.

Published 19 June 2020

Written by Itamar Shatz

CCBY icon