Tag Archives: reproducibility

Cambridge Data Week 2020 day 3: Is data management just a footnote to reproducibility?

Cambridge Data Week 2020 was an event run by the Office of Scholarly Communication at Cambridge University Libraries from 23–27 November 2020. In a series of talks, panel discussions and interactive Q&A sessions, researchers, funders, publishers and other stakeholders explored and debated different approaches to research data management. This blog is part of a series summarising each event:

The rest of the blogs comprising this series are as follows:
Cambridge Data Week day 1 blog
Cambridge Data Week day 2 blog
Cambridge Data Week day 4 blog
Cambridge Data Week day 5 blog

Introduction

The third day of Cambridge Data Week consisted of a panel discussion about the relationship between reproducibility and Research Data Management (RDM), looking for ways to advocate effectively to reach positive outcomes in both areas. Alexia Cardona (University of Cambridge), Lennart Stoy (European University Association), Florian Markowetz (University of Cambridge & UK Reproducibility Network), and René Schneider (Geneva School of Business Administration) offered their perspectives on whether RDM really is just a ‘footnote’ to the more popular concept of reproducibility.

The speakers agreed that we are still in need of cultural change towards better data management and reproducibility. The word ‘reproducibility’ is more likely to excite researchers and it is important to craft messages that work for each group, hence the emphasis on this term. In contrast to the Cambridge Data Week event on data peer review, the discussion here focused on engaging senior researchers, from PIs to Heads of Institutions, motivating them to be not just good data managers, but great data leaders.

Among the key elements needed to drive best practice in this area, two stood out. The first is communities. Whether these are reproducibility circles of peers, or networks like the Cambridge Data Champions, communities are key to creating and implementing guidelines for data management. The second element is a solid technological infrastructure. For instance, block chains could be used to enable reproducibility in citations in the humanities, or Persistent Identifiers, used at a very granular level, could lead to better data reuse.

Recording , transcript and presentations

The video recording of the webinar can be found below and the recording, transcript and presentations are present in Apollo, the University of Cambridge repository.

Bonus material

There were a few questions we did not have time to address during the live session, so we put them to the speakers afterwards. Here are their answers:

What are good practices regarding data deletion?

Florian Markowetz It very much depends on what kind of data you have, it’s hard to give general directions. However, drives and other hardware are becoming cheaper and cheaper, so I would say ‘save everything’.

René Schneider I would agree. I have spoken to researchers who keep all their data, because it would create too much work to sort what to keep and what to delete.

Alexia Cardona We tend to talk more about data archiving than data deletion. I often hear about data deletion where it has created problems, for example an account has been deleted in bulk when a researcher left an institution, so unpublished data and scripts are lost due to lack of communication. There are also cases on the internet of PhD students losing all their thesis when the laptop crashed, so this issue goes hand in hand with data storage and backup. Let’s focus on good practices and archiving of data, deletion is the very last thing to worry about.

Lennart Stoy It’s worth mentioning that there is often a compulsory period that data should be kept for, perhaps 3 years or 5 years according to funders mandates, so data should be stored for some time. I suppose the expense could become an issue in the coming years, some Universities are already concerned about the cost of having to buy large amounts of cloud storage space. There are also discussions in the Open Science Could teams about what to preserve in the long term. We want to make sure we preserve the higher value datasets, but of course it’s hard to define which ones those are.

Couldn’t scholarly communities of practice or learned societies create guidelines for reproducibility and good data management?

Lennart Stoy Absolutely, they must be involved as they are the ones with the specific knowledge. This is the idea behind Research Data Alliance (RDA) and the National Research Data Infrastructure (NFDI) in Germany. In those cases, you have to prove a link to the community in that field to establish a consortium. It is great when communities structure their areas of infrastructure from the bottom up.

What roles could Early Career Researchers (ECRs) have? Could they act as code-checkers to assist reproducibility, or are we asking too much of them given their busy schedules? Would they receive credit for this?

Florian Markowetz Senior academics have no excuses for not getting more involved in this once they have stable positions. It’s easy for people in my position to point to students, or to funders, saying they are not doing enough, but we should not be pointing away from ourselves, we should do the work. It could be coupled to pay rises: if you hold any role above grade 12 it’s your job now to sort this all out.

René Schneider I have been thinking about the role of data custodians or similar. If we ask researchers to spend a lot of time just checking data, like ‘warehouse workers’, we could be undervaluing their role. I don’t think it’s necessarily the researchers who should do the work, especially not ECRs, there should be other roles dedicated to this.

Alexia Cardona I second that, researchers are supposed to focus on the research, not necessarily the data checking and curation. But the unfortunate truth is that with short contracts and lack of resources the work is left to them. Another problem is the lack of rewards. For instance in my area, training, there’s no reward for people who take the time to make their training FAIR. We should embrace more openness and fairness, including rewarding those who do the work.

Lennart Stoy This is something we’ve been working on but it’s a challenging system to change because there are so many elements to disentangle. It relates to intense competition for jobs, the culture in different disciplines, and the pressure to publish in certain journals. Some Universities are very serious about implementing DORA and I hope that in a few years these will be able to show high levels of satisfaction among PhD students and ECRs. A lot depends on the leadership at the institutional level to initiate change, for instance the rector at Ghent University in Belgium has been driving DORA-inspired reward mechanisms and the Netherlands is also moving ahead and moving away from journal-based factors. The University of Bath is an example in the UK that I’ve heard mentioned a lot. We’re following progress in all these examples and will write up DORA good practice case studies to inspire other organisations. But it is a hard problem, ECRs have a lot on the line, it’s important not to jeopardise their careers.

Conclusion

This compelling discussion left us feeling that it does not matter too much which words we emphasise: reproducibility, data management, data leadership, or something else entirely. What matters is that we spark interest and commitment in the right groups of researchers to drive progress. Creating a culture where great research practices are routine will take effective advocacy, but also rewards that align with our aims and the right technical infrastructure to underpin them.

Resources

UK data service is a data repository funded by the Economic and Social Research Council (ESRC), which also provides extensive resources on data practices.

The journal PLOS Computational Biology introduced a pilot in 2019 where all papers are checked for the reproducibility of models.

Is there a reproducibility crisis? Baker’s 2016 paper in Nature reporting the results of a survey that exposed the extent of the reproducibility crisis.

San Francisco Declaration on Research Assessment (DORA), a set of recommendations for institutions, funders, publishers, metrics companies and researchers, aiming for a fairer and more varied system of research quality assessment.

Published on 25 January 2021

Written by Beatrice Gini

CCBY icon

Beyond compliance – dialogue on barriers to data sharing

Welcome to International Data Week. The Office of Scholarly Communication is celebrating with a series of blog posts about data, starting with a summary of an event we held in July.

JME_0629.jpgOn 29 July 2016 the Cambridge Research Data Team joined forces with the Science and Engineering South Consortium to organise a one day conference at the Murray Edwards College to gather researchers and practitioners for a discussion about the existing barriers to data sharing. The whole aim of the event was to move beyond compliance with funders’ policies. We hoped that the community was ready to change the focus of data sharing discussions from whether it is worth sharing or not towards more mature discussions about the benefits and limitations of data sharing.

What are the barriers?

So what are the barriers to effective sharing of research data? There were three main barriers identified, all somewhat related to each other: poorly described data, insufficient data discoverability and difficulties with sharing personal/sensitive data. All of these problems arise from the fact that research data does not always shared in accordance to FAIR principles: that data is Findable, Accessible, Interoperable and Re-usable.

Poorly described data

The event started with an inspiring keynote talk from Dr Nicole Janz from the Department of Sociology at the University of Cambridge: “Transparency in Social Science Research & Teaching”. Nicole regularly runs replication workshops at Cambridge, where students select published research papers and they work hard for several weeks to reproduce the published findings. The purpose of these workshop is to allow students to learn by experience on what is important in making their own work transparent and reproducible to others.

Very often students fail to reproduce the results. Frequently, the reasons for failures are insufficient methodology available, or simply the fact that key datasets were not made available. Students learn that in order to make research reproducible, one not only needs to make the raw data files available, but that the data needs to be shared with the source code used to transform it and with written down methodology of the process, ideally in a README file. While doing replication studies, students also learn about the five selfish benefits of good data management and sharing: data disasters are avoided, it is easier to write up papers from well-managed data, transparent approach to sharing makes the work more convincing to reviewers, the continuity of research is possible and researchers can build their reputation for being transparent. As a tip for researchers, Nicole suggested always asking a colleague to try to reproduce the findings before submitting a paper for peer-review.

The problem of insufficient data description/availability was also discussed during the first case study talk by Dr Kai Ruggeri from the Department of Psychology, University of Cambridge. Kai reflected on his work on the assessment of happiness and wellbeing across many European countries, which was part of the ESRC Secondary Data Analysis Initiative. Kai re-iterated that missing data make the analysis complicated and sometimes prevent one from being able to make effective policy recommendations. Kai also stressed that frequently the choice of baseline for data analysis can affect the final results. Therefore, proper description of methodology and approaches taken is key for making research reproducible.

Insufficient data discoverability

JME_0665We also heard several speakers describing problems with data discoverability. Fiona Nielsen founded Repositive – a platform for finding human genomic data. Fiona founded the platform out of frustration that genomic data was so difficult to find and access. Proliferation of data repositories made it very hard for researchers to actually find what they need.

IMG_SearchingForData_20160911Fiona started with doing a quick poll among the audience: how do researchers look for data? It turned out that most researchers find data by doing a literature research or by googling for it. This is not surprising – there is no search engine enabling looking for information simultaneously across the multiple repositories where the data is available. To make it even more complicated, Fiona reported that in 2015 80PB of human genomic data was generated. Unfortunately, only 0.5PB of human genomic data was made available in a data repository.

So how can researchers find the other datasets, which are not made available in public repositories? Repositive is a platform harvesting metadata from several repositories hosting human genomic data and providing a search engine allowing researchers to simultaneously look for datasets shared in all of them. Additionally, researchers who cannot share their research data via a public repository (for example, due to lack of participants’ consent for sharing), can at least create a metadata record about the data – to let others know that the data exist and to provide them with information on data access procedure.

The problem of data discoverability is however not only related to people’s awareness that datasets exists. Sometimes, especially in the case of complex biological data with a vast amount of variables, it can be difficult to find the right information inside the dataset. In an excellent lightening talk, Jullie Sullivan from the University of Cambridge described InterMine –platform to make biological data easily searchable (‘mineable’). Anyone can simply upload their data onto the platform to make it searchable and discoverable. One example of the platform’s use is FlyMine – database where researchers looking for results of experiments conducted on fruit fly can easily find and share information.

Difficulties with sharing personal/sensitive data

The last barrier to sharing that we discussed was related to sharing personal/sensitive research data. This barrier is perhaps the most difficult one to overcome, but here again the conference participants came up with some excellent solutions. First one came from the keynote speech by Louise Corti – with a talk with a very uplifting title: “Personal not painful: Practical and Motivating Experiences in Data Sharing”.

Louise based her talk on the long experience of the UK Data Service with providing managed access to data containing some forms of confidential/restricted information. Apart from being able to host datasets which can be made openly available, the UKDS can also provide two other types of access: safeguarded access, where data requestors need to register before downloading the data, and controlled data, where requests for data are considered on a case by case basis.

At the outset of the research project, researchers discuss their research proposals with the UKDS, including any potential limitations to data sharing. It is at this stage – at the outset of the research project, that the decision is made on the type of access that will be required for the data to be successfully shared. All processes of project management and data handling, such as data anonymisation and collection of informed consent forms from study participants, are then carried in adherence to that decision. The UKDS also offers protocols clarifying what is going to happen with research data once they are deposited with the repository. The use of standard licences for sharing make the governance of data access much more transparent and easy to understand, both from the perspective of data depositors and data re-users.

Louise stressed that transparency and willingness to discuss problems is key for mutual respect and understanding between data producers, data re-users and data curators. Sometimes unnecessary misunderstandings make data sharing difficult, when it does not need to be. Louise mentioned that researchers often confuse ‘sensitive topic’ with ‘sensitive data’ and referred to a success case study where, by working directly with researchers, UKDS managed to share a dataset about sedation at the end of life. The subject of study was sensitive, but because the data was collected and managed with the view of sharing at the end of the project, the dataset itself was not sensitive and was suitable for sharing.

As Louise said “data sharing relies on trust that data curators will treat it ethically and with respect” and open communication is key to build and maintain this trust.

So did it work?

JME_0698The purpose of this event was to engage the community in discussions about the existing limitation to data sharing. Did we succeed? Did we manage to engage the community? Judging by the fact that we have received twenty high quality abstract applications from researchers across various disciplines for only five available case study speaking slots (it was so difficult to shortlist the top five ones!) and also because the venue was full – with around eighty attendees from Cambridge and other institutions, I think that the objective was pretty well met.

Additionally, the panel discussion was led by researchers and involved fifty eight active users on the Sli.do platform for questions to panellists. There were also questions asked outside of Sli.do platform. So overall I feel that the event was a great success and it was truly fantastic to be part of it and to see the degree of participant involvement in data sharing.

Another observation is also the great progress of the research community in Cambridge in the area of sharing: we have successfully moved away from discussions whether research data is worth sharing to how to make data sharing more FAIR.

It seems that our intense advocacy, and the effort of speaking with over 1,800 academics from across the campus since January 2015 paid off and we have indeed managed to build an engaged research data management community.

Read (and see!) more:

Published 12 September 2016
Written by Dr Marta Teperek
Creative Commons License

Could Open Research benefit Cambridge University researchers?

This blog is part of the recent series about Open Research and reports on a discussion with Cambridge researchers  held on 8 June 2016 in the Department of Engineering. Extended notes from the meeting and slides are available at the Cambridge University Research Repository. This report is written by  Lauren Cadwallader, Joanna Jasiewicz and Marta Teperek (listed alphabetically by surname).

At the Office of Scholarly Communication we have been thinking for a while about Open Research ideas and about moving beyond mere compliance with funders’ policies on Open Access and research data sharing. We thought that the time has come to ask our researchers what they thought about opening up the research process and sharing more: not only publications and research data, but also protocols, methods, source code, theses and all the other elements of research. Would they consider this beneficial?

Working together with researchers – democratic approach to problem-solving

To get an initial idea of the expectations of the research community in Cambridge, we organised an open discussion hosted at the Department of Engineering. Anyone registering was asked three questions:

  • What frustrates you about the research process as it is?
  • Could you propose a solution that could solve that problem?
  • Would you be willing to speak about your ideas publicly?

20160608_163000Interestingly, around fifty people registered to take part in the discussion and almost all of them contributed very thought-provoking problems and appealing solutions. To our surprise, half of the people expressed their will to speak publicly about their ideas. This shaped our discussion on the day.

So what do researchers think about Open Research? Not surprisingly, we started from an animated discussion about unfair reward systems in academia.

Flawed metrics

A well-worn complaint: the only thing that counts in academia is publication in a high impact journal. As a result, early career researchers have no motivation to share their data and to publish their work in open access journals, which can sometimes have lower impact factors. Additionally, metrics based on the whole journal do not reflect the importance of the research described: what is needed is article-level impact measurements. But it is difficult to solve this systemic problem because any new journal which wishes to introduce a new metrics system has no journal-level impact factor to start with, and therefore researchers do not want to publish in it.

Reproducibility crisis: where quantity, not quality, matters

Researchers also complained that the volume of produced research is higher and higher in terms of quantity and science seems to have entered an ‘era of quantity’. They raised the concern that the quantity matters more than the quality of research. Only the fast and loud research gets rewarded (because it is published in high impact factor journals), and the slow and careful seems to be valued less. Additionally, researchers are under pressure to publish and they often report what they want to see, and not what the data really shows. This approach has led to the reproducibility crisis and lack of trust among researchers.

Funders should promote and reward reproducible research

The participants had some good ideas for how to solve these problems. One of the most compelling suggestions was that perhaps funding should go not only to novel research (as it seems to be at the moment), but also to people who want to reproduce existing research. Additionally, reproducible research itself should be rewarded. Funders could offer grant renewal schemes for researchers whose research is reproducible.

Institutions should hire academics committed to open research

Another suggestion was to incentivise reward systems other than journal impact factor metrics. Someone proposed that institutions should not only teach the next generation of researchers how to do reproducible research, but also embed reproducibility of research as an employment criteria. Commitment to Open Research could be an essential requirement in job description. Applicants could be asked at the recruitment stage how they achieve the goals of Open Research. LMU University in Munich had recently included such a statement in a job description for a professor of social psychology (see the original job description here and a commentary here).

Academia feeding money to exploitative publishers

Researchers were also frustrated by exploitative publishers. The big four publishers (Elsevier, Wiley, Springer and Informa) have a typical annual profit margin of 37%. Articles are donated to the publishers for free by the academics, and reviewed by other academics, also free of charge. Additionally, noted one of the participants, academics also act as journal editors, which they also do for free.

[*A comment about this statement was made on 15 August 2017 noting that some editors do get paid. While the participant’s comment stands as a record of what was said, we acknowledge that this is not an entirely accurate statement.]

In addition to this, publishers take away the copyright from the authors. As a possible solution to the latter, someone suggested that universities should adopt institutional licences on scholarly publishing (similar to the Harvard licence) which could protect the rights of their authors

Pre-print services – the future of publishing?

Could Open Research aid the publishing crisis? Novel and more open ways of publishing can certainly add value to the process. The researchers discussed the benefits of sharing pre-print papers on platforms like arXiv and bioRxiv. These services allow people to share manuscripts before publication (or acceptance by a journal). In physics, maths and computational sciences it is common to upload manuscripts even before submitting the manuscript to a journal in order to get feedback from the community and have the chance to improve the manuscript.

bioRxiv, the life sciences equivalent of arXiv, started relatively recently. One of our researchers mentioned that he was initially worried that uploading manuscripts into bioRxiv might jeopardise his career as a young researcher. However, he then saw a pre-print manuscript describing research similar to his published on bioRxiv. He was shocked when he saw how the community helped to change that manuscript and to improve it. He has since shared a lot of his manuscripts on bioRxiv and as his colleague pointed out, this has ‘never hurt him’. To the contrary, he suggested that using pre-print services promotes one’s research: it allows the author to get the work into the community very early and to get feedback. And peers will always value good quality research and the value and recognition among colleagues will come back to the author and pay back eventually.

Additionally, someone from the audience suggested that publishing work in pre-print services provides a time-stamp for researchers and helps to ensure that ideas will not be scooped by anyone – researchers are free to share their research whenever they wish and as fast they wish.

Publishers should invest money in improving science – wishful thinking?

It was also proposed that instead of exploiting academics, publishers could play an important role in improving the research process. One participant proposed a couple of simple mechanisms that could be implemented by publishers to improve the quality of research data shared:

  • Employment of in-house data experts: bioinfomaticians or data scientists, who could judge whether supporting data is of a good enough quality
  • Ensure that there is at least one bioinfomatician/data scientist on the reviewing panel for a paper
  • Ask for the data to be deposited in a public, discipline-specific repository, which would ensure quality control of the data and adherence to data standards.
  • Ask for the source code and detailed methods to be made available as well.

Quick win: minimum requirements for making shared data useful

A requirement that, as a minimum, three key elements should be made available with publications – the raw data, the source code and the methods – seems to be a quick win solution to make research data more re-usable. Raw data is necessary as it allows users to check if the data is of a good quality overall, while publishing code is important to re-run the analysis and methods need to be detailed enough to allow other researchers to understand all the processes involved in data processing. An excellent case study example comes from Daniel MacArthur who has described how to reproduce all the figures in his paper and has shared the supporting code as well.

It was also suggested that the Office of Scholarly Communication could implement some simple quality control measures to ensure that research data supporting publications is shared. As a minimum the Office could check the following:

  • Is there a data statement in the publication?
  • If there is a statement – is there a link to the data?
  • Does the link work?

This is definitely a very useful suggestion from our research community and in fact we have already taken this feedback aboard and started checking for data citations in Cambridge publications.

Shortage of skills: effective data sharing is not easy

The discussion about the importance of data sharing led to reflections that effective data sharing is not always easy. A bioinformatician complained that datasets that she had tried to re-use did not satisfy the criteria of reproducibility, nor re-usability. Most of the time there was not enough metadata available to successfully use the data. There is some data shared, there is the publication, but the description is insufficient to understand the whole research process: the miracle, or the big discovery, happens somewhere in the middle.

Open Research in practice: training required

Attendees agreed that it requires effort and skills to make research open, re-usable and discoverable by others. More training is needed to ensure that researchers are equipped with skills to allow them to properly use the internet to disseminate their research, as well as with skills allowing them to effectively manage their research data. It is clear that discipline-specific training and guidance around how to manage research data effectively and how to practise open research is desired by Cambridge researchers.

Nudging researchers towards better data management practice

Many researchers have heard or experienced first-hand horror stories of having to follow up on somebody else’s project, where it was not possible to make any sense of the research data due to lack of documentation and processes. This leads to a lot of time wasted in every research group. Research data need to be properly documented and maintained to ensure research integrity and research continuity. One easy solution is to nudge researchers towards better research data management practice could be formalised data management requirements. Perhaps as a minimum, every researchers should have a lab book to document research procedures.

The time is now: stop hypocrisy

Finally, there was a suggestion that everyone should take the lead in encouraging Open Research. The simplest way to start is to stop being what has been described as a hypocrite and submit articles to journals which are fully Open Access. This should be accompanied by making one’s reviews openly available whenever possible. All publications should be accompanied by supporting research data and researchers should ensure that they evaluate individual research papers and that their judgement is not biased by the impact factor of the journal.

Need for greater awareness and interest in publishing

One of the Open Access advocates present at the meeting stated that most researchers are completely unaware of who are the exploitative and ethical publishers and the differences between them. Researchers typically do not directly pay the exploitative publishers and are therefore not interested in looking at the bigger picture of sustainability of scholarly publishing. This is clearly an area when more training and advocacy can help and the Office of Scholarly Communication is actively involved in raising awareness in Open Access. However, while it is nice to preach in a room of converts, how do we get other researchers involved in Open Access? How should we reach out to those who can’t be bothered to come to a discussion like the one we had? This is the area where anyone who understands the benefits Open Access has a job to do.

Next steps

We are extremely grateful to everyone who came to the event and shared their frustrations and ideas on how to solve some problems. We noted all the ideas on post it notes – the number of notes at the end of the discussion was impressive, an indication of how creative the participants were in just 90 minutes. It was a very productive meeting and we wish to thank all the participants for their time and effort.

20160608_160721

We think that by acting collaboratively and supporting good ideas we can achieve a lot. As an inspiration, McGill University’s Montreal Neurological Institute and Hospital (the Neuro) in Canada have recently adopted a policy on Open Research: over the next five years all results, publications and data will be free to access by everyone.

Follow up

If you would like to host similar discussions directly in your departments/institutes, please get in touch with us at info@osc.cam.ac.uk – we would be delighted to come over and hear from researchers in your discipline.

In the meantime, if you have any additional ideas that you wish to contribute, please send them to us. Everyone who is interested in being informed about the progress here is encouraged to sign up for a mailing distribution list here.

Extended notes from the meeting and slides are available at the Cambridge University Research Repository. We are particularly grateful to Avazeh Ghanbarian, Corina Logan, Ralitsa Madsen, Jenny Molloy, Ross Mounce and Alasdair Russell (listed alphabetically by surname) for agreeing to publicly speak at the event.

Published 3 August 2016
Written by Lauren Cadwallader, Joanna Jasiewicz and Marta Teperek
Creative Commons License