Category Archives: Uncategorized

Cambridge Data Week 2020 day 4: Supporting researchers on data management – do we need a fairy godmother?

Cambridge Data Week 2020 was an event run by the Office of Scholarly Communication at Cambridge University Libraries from 23–27 November 2020. In a series of talks, panel discussions and interactive Q&A sessions, researchers, funders, publishers and other stakeholders explored and debated different approaches to research data management. This blog is part of a series summarising each event: 

The rest of the blogs comprising this series are as follows:
Cambridge Data Week day 1 blog
Cambridge Data Week day 2 blog
Cambridge Data Week day 3 blog
Cambridge Data Week day 5 blog

Introduction 

How should researchers’ data management activities and skills be supported? What are the data management responsibilities of the funder, the institution, the research group and the individual researcher? Should we focus on training researchers so they can carry out good data management themselves or should we be funding specialist teams who can work with research groups, allowing the researchers to concentrate on research instead of data management? These were the questions addressed on day 4 of Cambridge Data Week 2020. This session benefitted from the perspectives of three speakers deriving from three different components of the research ecosystem: national funder, institutional research support and department/institute. Respectively, these were provided by Tao-Tao Chang (Arts and Humanities Research Council [AHRC]), Marta Teperek (TU Delft) and Alastair Downie (The Gurdon Institute, Cambridge). 

From a funder’s perspective, and following UKRI community consultation, Tao-Tao specifies that digital research infrastructure is recognised as an area for urgent investment, particularly in the arts and humanities, where both software and data loss are acute. Going forwards, AHRC’s key priorities will be to prevent further data loss, invest in skills, build capability, and work with the community to effect a sustained change in research culture. At an institutional level, Marta argues that it is unfair for researchers to be left unsupported to manage their data. The TU Delft model addresses this via three methods: central data support, disciplinary support by data stewards as permanent faculty staff, and hands-on support for research groups via data managers and research software engineers. Regarding the latter, an important take-home message for all researchers, regardless of institutional affiliation, is to build data management costs into grant proposals. Alastair takes up the discussion at the level of the department, research group and even individual, highlighting how researchers are locked into infrastructure silos, and locked into an unhelpful, competitive culture where altruism is a risky proposition and the career benefits of sharing seem intangible or insufficient. Alastair proposes that the climate is right and the community is ready for change, and goes on to discuss some positive changes afoot in the School of Biological Sciences to counteract these.  

Audience composition  

We had 291 registrations for the webinar, with just over 70% originating from the Higher Education sector. Researchers and PhD students accounted for 30% of the registrations whilst research support staff from various organisations accounted for an impressive 46%. On the day, we were thrilled to see that 136 people attended the webinar, participating from a wide range of countries. 

Recording, transcript and presentations 

The video recording of the webinar can be found below and the recording, transcript and presentations are present in Apollo, the University of Cambridge repository.

Bonus material 

There were a few questions we did not have time to address during the live session, so we put them to the speakers afterwards. Here are their answers: 

Talking about the technical side have you yet come across anyone using a machine implementable DMP? Setting up a data management infrastructure for a large project it’s become apparent that checking compliance with a DMP is a huge job and of course there is minimal resource for doing this.

Marta Teperek Work is being done in this area by Research Data Alliance where there are several groups working on machine actionable DMPs. Basically, the idea is that instead of asking researchers to write long essays about how they are planning to manage their data, they are asked to provide answers that are structured. These can be multiple choice options, for example, where the researcher specifies that they will be depositing large amounts of data in the repository and the repository will be notified of data coming their way. In other words, actions are made depending on what the researcher says they will do. University of Queensland is doing a lot on this already [see link to blog post here and in Resources further below].

What are the best cross-platform, mobile and desktop tools for data management?

Alastair Downie RDM encompasses a far too broad a range of activities – it’s a concept rather than a single activity that you can build into a neat little app. In the context of electronic lab notebooks, for example, there are hundreds of apps that serve that function and some of them cross over into lab management as well. Those products that try to do too much become very bloated and complex, which makes them unattractive and so we don’t see uptake of those kind of products. I think a suite approach is better than a single solution.

Institutions audit spending on research grants, they should do the same for research data and should be a requirement of holding a grant.

Alastair Downie Wellcome Trust are now challenging researchers to demonstrate that they have complied with their DMPs. It’s not particularly empirical but the fact that they are demonstrating their determination to make sure that everyone’s doing things properly is very helpful. 

Are there any specific infrastructure projects that the AHRC is sponsoring? I’m curious about what infrastructure/services would be useful for Arts and Humanities researchers

Tao-Tao Chang Not at this juncture. But we are hoping that this will change. AHRC recognises the importance of good data management practice and the need to support it. We also recognise that there is a skills gap and that all researchers at every level need support.

Is there a 2020 edition of the State of Open Data report?

Yes, this was published five days after this webinar! See the Digital Science website and further below under ‘Resources’.

Conclusion 

There are two outcomes of the webinar to draw upon here. The first raises again the question: do researchers need, or even want, a fairy godmother to support their research data management?  We held a poll at the end of the webinar, asking participants to choose which one of the following statements they believe most strongly: (1) ‘Individual researchers should learn how to manage their own data well’ or (2) ‘Researchers’ data should be managed by funded RDM specialists so that researchers can focus on research’. Of the 78 respondents, 67% chose the first option and 33% chose the second. There was not an intermediate option to incorporate both, simply because we wanted to force a choice in the direction of strongest belief when the two options are considered relative to one another. 

The results of the poll and the discussions during the webinar (between the speakers and within the chat) indicate that while individual researchers are responsible for managing their research data, support does need to be made available and promoted actively (we provide in the ‘Resources’ section some links to University of Cambridge research data management support). A second outcome reveals that support needs to be provided under several different guises. On the one hand, there is support that comes via the provision of funding, research data services and individually tailored expertise. Yet, on the other hand, there is support that will derive, albeit in a less tangible sense, from positive changes in research culture, specifically in terms of how the research of individual researchers is assessed and rewarded.  

Resources  

Some links to University of Cambridge research data management support include: the Research Data Management Policy Framework that outlines, for example, the data management responsibilities of research students and staff; our data management guide; a list of Cambridge Data Champions, searchable by areas of expertise. 

A recent Postdoc Academy podcast on ‘How can we improve the research culture at Cambridge?’ 

description of different data management support roles at TU Delft, by Alastair Dunning and Marta Teperek: data steward, data manager, research software engineer, data scientist and data champion.  

A Gurdon Computing blog post by Alastair Downie on ‘Research data management as a national service’; in other words, rather than duplicating infrastructure and services across the research landscape. 

An article by Florian Markowetz, discussed in the webinar, on ‘Five selfish reasons to work reproducibly’ (in Genome Biology)

TU Delft Open Working blog post by Marta Teperek on machine actionable Data Management Plans (DMPs) at the University of Queensland. For more information, see this article by Miksa and colleagues on the ‘Ten principles for machine-actionable data management plans’ (in PLOS Computational Biology).  

The State of Open Data 2020 report, published on 1 December 2020. 

Published on 25 January 2021

Written by Dr Sacha Jones with contributions from Tao-Tao Chang, Dr Marta Teperek, Alastair Downie and Maria Angelaki. 

CCBY icon

Cambridge Data Week 2020 day 5: How do we peer review data? New sustainable and effective models

Cambridge Data Week 2020 was an event run by the Office of Scholarly Communication at Cambridge University Libraries from 23–27 November 2020. In a series of talks, panel discussions and interactive Q&A sessions, researchers, funders, publishers and other stakeholders explored and debated different approaches to research data management. This blog is part of a series summarising each event:   

The rest of the blogs comprising this series are as follows:
Cambridge Data Week day 1 blog
Cambridge Data Week day 2 blog
Cambridge Data Weekday 3 blog
Cambridge Data Week day 4 blog

Introduction  

Cambridge Data Week 2020 concluded on 27 November with a discussion between Dr Lauren Cadwallader (PLOS), Professor Stephen Eglen (University of Cambridge) and Kiera McNeice (Cambridge University Press) on models of data peer review. The peer review process around data is still emerging despite the increase in data sharing. This session explored how peer review of data could be approached from both a publishing and a research perspective. 

The discussion focused on three main questions and here are a few snippets of what was said. If you’d like to explore the speakers’ answers in full, see the recording and transcript below.  

Why is it important to peer review datasets?

Are we in a post-truth world where claims can be made without needing to back them up? What if data could replace articles as the main output of research? What key criteria should peer review adopt?

Word cloud created by the audience in response to “Why is it important to peer review datasets?” The four most prominent words are: integrity, quality, trust, reproducibility.
Figure 1: Word cloud created by the audience in response to “Why is it important to peer review datasets?”

How should data review be done?

Can we drive the spread of Open Data by initially setting an incredibly low bar, encouraging everyone to share data even in its messy state? Are we reviewing to ensure reusability, or do we want to go further and check quality and reproducibility? Is data review a one-off event, or a continuous process involving everyone who reuses the data?

Are journals exclusively responsible for data review, or should authors, repository managers and other organisations be involved? Where will the money come from? What’s in it for researchers who volunteer as data reviewers? How do we introduce the peer review of data in a fair and equitable way? 

Who should be doing the work?

Are journals exclusively responsible for data review, or should authors, repository managers and other organisations be involved? Where will the money come from? What’s in it for researchers who volunteer as data reviewers? How do we introduce the peer review of data in a fair and equitable way?

Watch the session 

The video recording of the webinar can be found below and the transcript is present in Apollo, the University of Cambridge repository

Bonus material 

After the end of the session, Lauren, Kiera and Stephen continued the discussion, prompted by a question from the audience about whether there should be some form of template or checklist for peer reviewing code. Here is what they said. 

Lauren Cadwallader  That’s an interesting idea, though of course code is written for different reasons, software, analysis, figures, and so on. Inevitably there will be different ways of reviewing it. Stephen can you tell us more about your experience with CODECHECK? 

Stephen Eglen At CODECHECK we have a process to help codecheckers run research code and award a “certificate of executable computation”, like this example of a report. If doing nothing else, then copying whatever files you’ve got onto some repository, dirty and unstructured as that might seem is still gold dust to the next researcher that comes along. Initially we can set the standards low, and from there we can come up with a whole range of more advanced quality checks. One question is ‘what are researchers willing to accept?’ I know of a couple of pilots that tried requiring more work from researchers in preparing and checking their files and code, such as the Code Ocean pilot that Kiera mentioned. I think that we have a community that understand the importance of this and is willing to put in some effort.  

Kiera McNeice There’s value in having checklists that are not extremely specialised, but tailored somewhat towards different subject areas. For instance, the American Journal of Political Science has two separate checklists, one for quantitative data and one for qualitative data. Certainly, some of our HSS editors have been saying that some policies developed for quantitative data do not work for their authors.  

Lauren Cadwallader  It might be easy to start with places where there are communities that are already engaged and have a framework for data sharing, so the peer review system would check that. What do you think? 

Kiera McNeice I guess there is a ‘chicken and egg’ issue: does this have to be driven from the top down, from publishers and funders, or does it come from the bottom up, with research communities initiating it? As journals, there is a concern that if we try to enforce very strict standards, then people will take their publications elsewhere. If there is no desire from the community for these changes, publisher enforcement can only go so far.  

Stephen Eglen Funders have an important role to play too. If they lead on this, researchers will follow because ultimately researchers are focused on their career. Unless there is recognition that there doing this as a valuable part of one’s work, it will be hard to convince the majority of researchers to spend time on it.  

Take a pilot I was involved in with Nature Neuroscience. Originally this was meant to be a mandatory peer review of code after acceptance in principle, but in the end fears about driving away authors meant it was only made optional. Throughout a six-month trial, I was only aware of two papers that went through code review. I can see the barriers for both journal and authors, but if researchers received credit for doing it, this sort of thing will come from the bottom up. 

Lauren Cadwallader  In our biology-based model review pilot we ran a survey and found that many people opted in because they believe in open science, reproducibility, and so on, but two people opted in because they feared PLOS would think they had something to hide if they didn’t. That’s not at all what it was about. Although I suppose if it gets people sharing data… 

Conclusion 

We were intrigued by many of the ideas put forward by the speakers, particularly the areas of tension that will need to be resolved. For instance, as we try to move from a world where most data remains in people’s laptops and drawers to a FAIR data world, even sharing simple, messy, unstructured data is ‘gold dust’. Yet ultimately, we want data to be shared with extensive metadata and in an easily accessible form. What should the initial standards be, and how should they be raised over time? And how about the idea of asking Early Career Researchers to take on reviewer roles? Certainly they (and their research communities) would benefit in many ways from such involvement, but will they be able to fit this in their packed schedules?  

The audience engaged in lively discussion throughout the session, especially around the use of repositories, the need for training, and disciplinary differences. At the end of the session, they surprised us all with their responses to our poll: “Which peer review model would work best for data?”. The most common response was ‘Incorporate it into the existing review of the article”, an option that had hardly been mentioned in the session. Perhaps we’ll need another webinar exploring this avenue next year! 

Poll graph showing the audience's response to the question "“Which peer review model would work best for data?”
Figure 2: Audience responses to a poll held at the end of the event 

Resources 

Alexandra Freeman’s Octopus project aims to change the way we report research. Read the Octopus blog and an interview with Alex to find out more.  

Publish your computer code: it is good enough, a column by Nick Barnes in Nature in 2010 arguing that sharing code, whatever the quality, is more helpful than keeping it in a drawer.  

The Center for Reproducible Biomedical Modelling has been working with PLOS on a pilot about reviewing models.  

PLOS guidelines on peer-reviewing data were produced in collaboration with the Cambridge Data Champions 

CODECHECK, led by Stephen Eglen, runs code to offer a “certificate of reproducible computation” to document that core research outputs could be recreated outside of the authors’ lab. 

Code Ocean is a platform for computational research that creates web-based capsules to help enable reproducibility.  

Editorial on pilot for peer reviewing biology based models in PLOS Computational Biology 

Published on 25 January 2021

Written by Beatrice Gini

CCBY icon

Open access: fringe or mainstream?

When I was just settling in to the world of open access and scholarly communication, I wrote about the need for open access to stop being a fringe activity and enter the mainstream of researcher behaviour:

“Open access needs to stop being a ‘fringe’ activity and become part of the mainstream. It shouldn’t be an afterthought to the publication process. Whether the solution to academic inaction is better systems or, as I believe, greater engagement and reward, I feel that the scholarly communications and repository community can look forward to many interesting developments over the coming months and years.”

While much has changed in the five years since I (somewhat naïvely) wrote those concluding thoughts, there are still significant barriers towards the complete opening of scholarly discourse. However, should open access be an afterthought for researchers? I’ve changed my mind. Open access should be something researchers don’t even need to think about, and I think that future is already here, though I fear it will ultimately sideline institutional repositories.

According to the 2020 Leiden Ranking, the median rate at which UK institutions make their research outputs open access is over 80%, which is far higher than any other nation (Figure 1). Indeed, the UK is the only country that has ‘levelled up’ over the last five years, while the rest of the world’s institutions have slowly plodded along making slow, but steady, progress.

Figure 1. The median institutional open access percentage for each country according to the Leiden Ranking. Note, these figures are medians of all institutions within a country. This does not mean that 80% of the UK’s publications are open access, but that the median rate of open access at UK institutions is 80%.

The main driver for this increase in open access content in the UK is through green open access (Figure 2), due in large part to the REF 2021 open access policy (announced in 2014 and effective from 2016). This is a dramatic demonstration of the influence that policy can have on researcher behaviour, which has made open access a mainstream activity in the UK.

Figure 2. The median institutional green open access percentage for each country according to the Leiden Ranking.

Like the rest of the UK, Cambridge has seen similar trends across all forms of open access (Figure 3), with rising use of green open access, and steadily increasing adoption of gold and hybrid. Yet despite all the money poured into gold and (more controversially) hybrid open access, the net effect of all this other activity is a measly 3% additional open access content (82% vs 79%). Which begs the question, was it worth it? If open access can be so successfully achieved through green routes, what is the inherent benefit of gold/hybrid open access?

Figure 3. Open access trends in Cambridge according to the Leiden Ranking. In the 2020 ranking, 79% was delivered through green open access. This means that despite all the work to facilitate other forms of open access, this activity only contributed an additional 3% to the total (82%).

Of course, Plan S has now emerged as the most significant attempt to coordinate a clear and coherent international strategy for open access. While it is not without its detractors, I am nonetheless supportive of cOAlition S’s overall aims. However, as the UK scholarly communication community has experienced, policy implementation is messy and can lead to unintended consequences. While Plan S provides options for complying through green open access routes, the discussions that institutions and publishers (both traditional and fully open access alike) have engaged in are almost entirely focussed on gold open access through transformative deals. This is not because we, as institutions, want to spend more on publishing, but rather it is the pragmatic approach to create open access content at the source and provide authors with easy and palatable routes to open access. It also is a recognition that flipping journals requires give and take from institutions and publishers alike.

We are now very close to reaching a point where open access can be an afterthought for researchers, particularly in the UK. In large part, it will be done for them through direct agreements between institutions and publishers. Cambridge already has open access publishing arrangements with over 5000 journals, and this figure will continue to grow as we sign more transformative agreements. However, this will ultimately be to the detriment of green open access. Instead of being the only open access source for a journal article, institutional repositories will instead become secondary storehouses of already gold open access content. The heyday of institutional repositories, if one ever existed, is now over.

For me, that is a sad thought. We have poured enormous resource and effort into maintaining Apollo, but we must recognise the burden that green open access places on researchers. They have better things to do. I expect that the next five years will see a dramatic increase in gold and hybrid open access content produced in the UK. Green open access won’t go away, but we will have entered a time where open access is no longer fringe, nor indeed mainstream, but rather de facto for all research.

Published 23 October 2020

Written by Dr Arthur Smith

This icon displays that the content of this blog is licensed under CC BY 4.0