Tag Archives: research data management

Cambridge Data Week 2020 day 4: Supporting researchers on data management – do we need a fairy godmother?

Cambridge Data Week 2020 was an event run by the Office of Scholarly Communication at Cambridge University Libraries from 23–27 November 2020. In a series of talks, panel discussions and interactive Q&A sessions, researchers, funders, publishers and other stakeholders explored and debated different approaches to research data management. This blog is part of a series summarising each event: 

The rest of the blogs comprising this series are as follows:
Cambridge Data Week day 1 blog
Cambridge Data Week day 2 blog
Cambridge Data Week day 3 blog
Cambridge Data Week day 5 blog

Introduction 

How should researchers’ data management activities and skills be supported? What are the data management responsibilities of the funder, the institution, the research group and the individual researcher? Should we focus on training researchers so they can carry out good data management themselves or should we be funding specialist teams who can work with research groups, allowing the researchers to concentrate on research instead of data management? These were the questions addressed on day 4 of Cambridge Data Week 2020. This session benefitted from the perspectives of three speakers deriving from three different components of the research ecosystem: national funder, institutional research support and department/institute. Respectively, these were provided by Tao-Tao Chang (Arts and Humanities Research Council [AHRC]), Marta Teperek (TU Delft) and Alastair Downie (The Gurdon Institute, Cambridge). 

From a funder’s perspective, and following UKRI community consultation, Tao-Tao specifies that digital research infrastructure is recognised as an area for urgent investment, particularly in the arts and humanities, where both software and data loss are acute. Going forwards, AHRC’s key priorities will be to prevent further data loss, invest in skills, build capability, and work with the community to effect a sustained change in research culture. At an institutional level, Marta argues that it is unfair for researchers to be left unsupported to manage their data. The TU Delft model addresses this via three methods: central data support, disciplinary support by data stewards as permanent faculty staff, and hands-on support for research groups via data managers and research software engineers. Regarding the latter, an important take-home message for all researchers, regardless of institutional affiliation, is to build data management costs into grant proposals. Alastair takes up the discussion at the level of the department, research group and even individual, highlighting how researchers are locked into infrastructure silos, and locked into an unhelpful, competitive culture where altruism is a risky proposition and the career benefits of sharing seem intangible or insufficient. Alastair proposes that the climate is right and the community is ready for change, and goes on to discuss some positive changes afoot in the School of Biological Sciences to counteract these.  

Audience composition  

We had 291 registrations for the webinar, with just over 70% originating from the Higher Education sector. Researchers and PhD students accounted for 30% of the registrations whilst research support staff from various organisations accounted for an impressive 46%. On the day, we were thrilled to see that 136 people attended the webinar, participating from a wide range of countries. 

Recording, transcript and presentations 

The video recording of the webinar can be found below and the recording, transcript and presentations are present in Apollo, the University of Cambridge repository.

Bonus material 

There were a few questions we did not have time to address during the live session, so we put them to the speakers afterwards. Here are their answers: 

Talking about the technical side have you yet come across anyone using a machine implementable DMP? Setting up a data management infrastructure for a large project it’s become apparent that checking compliance with a DMP is a huge job and of course there is minimal resource for doing this.

Marta Teperek Work is being done in this area by Research Data Alliance where there are several groups working on machine actionable DMPs. Basically, the idea is that instead of asking researchers to write long essays about how they are planning to manage their data, they are asked to provide answers that are structured. These can be multiple choice options, for example, where the researcher specifies that they will be depositing large amounts of data in the repository and the repository will be notified of data coming their way. In other words, actions are made depending on what the researcher says they will do. University of Queensland is doing a lot on this already [see link to blog post here and in Resources further below].

What are the best cross-platform, mobile and desktop tools for data management?

Alastair Downie RDM encompasses a far too broad a range of activities – it’s a concept rather than a single activity that you can build into a neat little app. In the context of electronic lab notebooks, for example, there are hundreds of apps that serve that function and some of them cross over into lab management as well. Those products that try to do too much become very bloated and complex, which makes them unattractive and so we don’t see uptake of those kind of products. I think a suite approach is better than a single solution.

Institutions audit spending on research grants, they should do the same for research data and should be a requirement of holding a grant.

Alastair Downie Wellcome Trust are now challenging researchers to demonstrate that they have complied with their DMPs. It’s not particularly empirical but the fact that they are demonstrating their determination to make sure that everyone’s doing things properly is very helpful. 

Are there any specific infrastructure projects that the AHRC is sponsoring? I’m curious about what infrastructure/services would be useful for Arts and Humanities researchers

Tao-Tao Chang Not at this juncture. But we are hoping that this will change. AHRC recognises the importance of good data management practice and the need to support it. We also recognise that there is a skills gap and that all researchers at every level need support.

Is there a 2020 edition of the State of Open Data report?

Yes, this was published five days after this webinar! See the Digital Science website and further below under ‘Resources’.

Conclusion 

There are two outcomes of the webinar to draw upon here. The first raises again the question: do researchers need, or even want, a fairy godmother to support their research data management?  We held a poll at the end of the webinar, asking participants to choose which one of the following statements they believe most strongly: (1) ‘Individual researchers should learn how to manage their own data well’ or (2) ‘Researchers’ data should be managed by funded RDM specialists so that researchers can focus on research’. Of the 78 respondents, 67% chose the first option and 33% chose the second. There was not an intermediate option to incorporate both, simply because we wanted to force a choice in the direction of strongest belief when the two options are considered relative to one another. 

The results of the poll and the discussions during the webinar (between the speakers and within the chat) indicate that while individual researchers are responsible for managing their research data, support does need to be made available and promoted actively (we provide in the ‘Resources’ section some links to University of Cambridge research data management support). A second outcome reveals that support needs to be provided under several different guises. On the one hand, there is support that comes via the provision of funding, research data services and individually tailored expertise. Yet, on the other hand, there is support that will derive, albeit in a less tangible sense, from positive changes in research culture, specifically in terms of how the research of individual researchers is assessed and rewarded.  

Resources  

Some links to University of Cambridge research data management support include: the Research Data Management Policy Framework that outlines, for example, the data management responsibilities of research students and staff; our data management guide; a list of Cambridge Data Champions, searchable by areas of expertise. 

A recent Postdoc Academy podcast on ‘How can we improve the research culture at Cambridge?’ 

description of different data management support roles at TU Delft, by Alastair Dunning and Marta Teperek: data steward, data manager, research software engineer, data scientist and data champion.  

A Gurdon Computing blog post by Alastair Downie on ‘Research data management as a national service’; in other words, rather than duplicating infrastructure and services across the research landscape. 

An article by Florian Markowetz, discussed in the webinar, on ‘Five selfish reasons to work reproducibly’ (in Genome Biology)

TU Delft Open Working blog post by Marta Teperek on machine actionable Data Management Plans (DMPs) at the University of Queensland. For more information, see this article by Miksa and colleagues on the ‘Ten principles for machine-actionable data management plans’ (in PLOS Computational Biology).  

The State of Open Data 2020 report, published on 1 December 2020. 

Published on 25 January 2021

Written by Dr Sacha Jones with contributions from Tao-Tao Chang, Dr Marta Teperek, Alastair Downie and Maria Angelaki. 

CCBY icon

Cambridge Data Week 2020 day 5: How do we peer review data? New sustainable and effective models

Cambridge Data Week 2020 was an event run by the Office of Scholarly Communication at Cambridge University Libraries from 23–27 November 2020. In a series of talks, panel discussions and interactive Q&A sessions, researchers, funders, publishers and other stakeholders explored and debated different approaches to research data management. This blog is part of a series summarising each event:   

The rest of the blogs comprising this series are as follows:
Cambridge Data Week day 1 blog
Cambridge Data Week day 2 blog
Cambridge Data Weekday 3 blog
Cambridge Data Week day 4 blog

Introduction  

Cambridge Data Week 2020 concluded on 27 November with a discussion between Dr Lauren Cadwallader (PLOS), Professor Stephen Eglen (University of Cambridge) and Kiera McNeice (Cambridge University Press) on models of data peer review. The peer review process around data is still emerging despite the increase in data sharing. This session explored how peer review of data could be approached from both a publishing and a research perspective. 

The discussion focused on three main questions and here are a few snippets of what was said. If you’d like to explore the speakers’ answers in full, see the recording and transcript below.  

Why is it important to peer review datasets?

Are we in a post-truth world where claims can be made without needing to back them up? What if data could replace articles as the main output of research? What key criteria should peer review adopt?

Word cloud created by the audience in response to “Why is it important to peer review datasets?” The four most prominent words are: integrity, quality, trust, reproducibility.
Figure 1: Word cloud created by the audience in response to “Why is it important to peer review datasets?”

How should data review be done?

Can we drive the spread of Open Data by initially setting an incredibly low bar, encouraging everyone to share data even in its messy state? Are we reviewing to ensure reusability, or do we want to go further and check quality and reproducibility? Is data review a one-off event, or a continuous process involving everyone who reuses the data?

Are journals exclusively responsible for data review, or should authors, repository managers and other organisations be involved? Where will the money come from? What’s in it for researchers who volunteer as data reviewers? How do we introduce the peer review of data in a fair and equitable way? 

Who should be doing the work?

Are journals exclusively responsible for data review, or should authors, repository managers and other organisations be involved? Where will the money come from? What’s in it for researchers who volunteer as data reviewers? How do we introduce the peer review of data in a fair and equitable way?

Watch the session 

The video recording of the webinar can be found below and the transcript is present in Apollo, the University of Cambridge repository

Bonus material 

After the end of the session, Lauren, Kiera and Stephen continued the discussion, prompted by a question from the audience about whether there should be some form of template or checklist for peer reviewing code. Here is what they said. 

Lauren Cadwallader  That’s an interesting idea, though of course code is written for different reasons, software, analysis, figures, and so on. Inevitably there will be different ways of reviewing it. Stephen can you tell us more about your experience with CODECHECK? 

Stephen Eglen At CODECHECK we have a process to help codecheckers run research code and award a “certificate of executable computation”, like this example of a report. If doing nothing else, then copying whatever files you’ve got onto some repository, dirty and unstructured as that might seem is still gold dust to the next researcher that comes along. Initially we can set the standards low, and from there we can come up with a whole range of more advanced quality checks. One question is ‘what are researchers willing to accept?’ I know of a couple of pilots that tried requiring more work from researchers in preparing and checking their files and code, such as the Code Ocean pilot that Kiera mentioned. I think that we have a community that understand the importance of this and is willing to put in some effort.  

Kiera McNeice There’s value in having checklists that are not extremely specialised, but tailored somewhat towards different subject areas. For instance, the American Journal of Political Science has two separate checklists, one for quantitative data and one for qualitative data. Certainly, some of our HSS editors have been saying that some policies developed for quantitative data do not work for their authors.  

Lauren Cadwallader  It might be easy to start with places where there are communities that are already engaged and have a framework for data sharing, so the peer review system would check that. What do you think? 

Kiera McNeice I guess there is a ‘chicken and egg’ issue: does this have to be driven from the top down, from publishers and funders, or does it come from the bottom up, with research communities initiating it? As journals, there is a concern that if we try to enforce very strict standards, then people will take their publications elsewhere. If there is no desire from the community for these changes, publisher enforcement can only go so far.  

Stephen Eglen Funders have an important role to play too. If they lead on this, researchers will follow because ultimately researchers are focused on their career. Unless there is recognition that there doing this as a valuable part of one’s work, it will be hard to convince the majority of researchers to spend time on it.  

Take a pilot I was involved in with Nature Neuroscience. Originally this was meant to be a mandatory peer review of code after acceptance in principle, but in the end fears about driving away authors meant it was only made optional. Throughout a six-month trial, I was only aware of two papers that went through code review. I can see the barriers for both journal and authors, but if researchers received credit for doing it, this sort of thing will come from the bottom up. 

Lauren Cadwallader  In our biology-based model review pilot we ran a survey and found that many people opted in because they believe in open science, reproducibility, and so on, but two people opted in because they feared PLOS would think they had something to hide if they didn’t. That’s not at all what it was about. Although I suppose if it gets people sharing data… 

Conclusion 

We were intrigued by many of the ideas put forward by the speakers, particularly the areas of tension that will need to be resolved. For instance, as we try to move from a world where most data remains in people’s laptops and drawers to a FAIR data world, even sharing simple, messy, unstructured data is ‘gold dust’. Yet ultimately, we want data to be shared with extensive metadata and in an easily accessible form. What should the initial standards be, and how should they be raised over time? And how about the idea of asking Early Career Researchers to take on reviewer roles? Certainly they (and their research communities) would benefit in many ways from such involvement, but will they be able to fit this in their packed schedules?  

The audience engaged in lively discussion throughout the session, especially around the use of repositories, the need for training, and disciplinary differences. At the end of the session, they surprised us all with their responses to our poll: “Which peer review model would work best for data?”. The most common response was ‘Incorporate it into the existing review of the article”, an option that had hardly been mentioned in the session. Perhaps we’ll need another webinar exploring this avenue next year! 

Poll graph showing the audience's response to the question "“Which peer review model would work best for data?”
Figure 2: Audience responses to a poll held at the end of the event 

Resources 

Alexandra Freeman’s Octopus project aims to change the way we report research. Read the Octopus blog and an interview with Alex to find out more.  

Publish your computer code: it is good enough, a column by Nick Barnes in Nature in 2010 arguing that sharing code, whatever the quality, is more helpful than keeping it in a drawer.  

The Center for Reproducible Biomedical Modelling has been working with PLOS on a pilot about reviewing models.  

PLOS guidelines on peer-reviewing data were produced in collaboration with the Cambridge Data Champions 

CODECHECK, led by Stephen Eglen, runs code to offer a “certificate of reproducible computation” to document that core research outputs could be recreated outside of the authors’ lab. 

Code Ocean is a platform for computational research that creates web-based capsules to help enable reproducibility.  

Editorial on pilot for peer reviewing biology based models in PLOS Computational Biology 

Published on 25 January 2021

Written by Beatrice Gini

CCBY icon

Research Data at Cambridge – highlights of the year so far

By Dr Sacha Jones, Research Data Coordinator

This year we have continued, as always, to provide support and services for researchers to help with their research data management and open data practices. So far in 2020, we have approved more than 230 datasets into our institutional repository, Apollo. This includes Apollo’s 2000th dataset on the impact of health warning labels on snack selection, which represents a shining example of reproducible research, involving the full gamut: preregistration, and sharing of consent forms, code, protocols, data. There are other studies that have sparked media interest for which the data are also openly available in Apollo, such as the data supporting research that reports the development of a wireless device that can convert sunlight, carbon dioxide and water into a carbon-neutral fuel. Or, data supporting a study that has used computational modelling to explain why blues and greens are the brightest colours in nature. Also, and in the year of COVID, a dataset was published in April on the ability of common fabrics to filter ultrafine particles, associated with an article in BMJ Open. Sharing data associated with publications is critical for the integrity of many disciplines and best practice in the majority of studies, but there is also an important responsibility of science communication in particular to bring research datasets to the forefront. This point was discussed eloquently this summer in a guest blog post in Unlocking Research by Itamar Shatz, a researcher and Cambridge Data Champion. Making datasets open permits their reuse, and if you have wondered how research data is reused and then read this comprehensive data sharing and reuse case study written by the Research Data team’s Dominic Dixon. This centres on the use and value of the Mammographic Image Society database, published in Apollo five years ago. 

This year has seen the necessary move from our usual face-to-face Research Data Management (RDM) training to provision of training online. This has led us to produce an online training session in RDM, covering topics such as data organisation, storage, back up and sharing, as well as data management plans. This forms one component of a broader Research Skills Guide – an online course for Cambridge researchers on publishing, managing data, finding and disseminating research  – developed by Dr Bea Gini, the OSC’s training coordinator. We have also contributed to a ‘Managing your study resources’ CamGuide for Master’s students, providing guidance on how to work reproducibly. In collaboration with several University stakeholders we released last month new guidance on the use of electronic research notebooks (ERNs), providing information on the features of ERNs and guidance to help researchers select one that is suitable. 

At the start of this year we invited members of the University to apply to become Data Champions, joining the pre-existing community of 72 Data Champions. The 2020 call was very successful, with us welcoming 56 new Data Champions to the programme. The community has expanded this year, not only in terms of numbers of volunteers but also in terms of disciplinary focus, where there are now Data Champions in several areas of the arts, humanities and social sciences in particular where there were none previously. During this year, we have held forums in person and then online, covering themes such as how to curate manual research records, ideas for RDM guidance materials, data management in the time of coronavirus, and data practices in the arts and humanities and how these can be best supported. We look forward to further supporting and advocating the fantastic work of the Cambridge Data Champions in the months and years to come.