Tag Archives: open data

Cambridge Data Week 2020 day 2: Who is reusing data? Successes and future trends?

Cambridge Data Week 2020 was an event run by the Office of Scholarly Communication at Cambridge University Libraries from 23–27 November 2020. In a series of talks, panel discussions and interactive Q&A sessions, researchers, funders, publishers and other stakeholders explored and debated different approaches to research data management. This blog is part of a series summarising each event.  

The rest of the blogs comprising this series are as follows:
Cambridge Data Week day 1 blog
Cambridge Data Week day 3 blog
Cambridge Data Week day 4 blog
Cambridge Data Week day 5 blog

Introduction

Reuse of data is the final element of the FAIR principles and has long been argued as a central benefit of data sharing, allowing others access to a wealth of research and making research funding more efficient by removing the need to duplicate work. Yet we are still in the process of evaluating success in this area. This webinar brought together speakers to discuss what we know about the current state of play around data reuse, what researchers can do to increase the reuse potential of their data, and possible future developments in data reuse.

Our speakers – Louise Corti (UK Data Archive) and Tiberius Ignat (Scientific Knowledge Services) – looked at data reuse from two different perspectives. Louise focused on the reuse of UK Data Service collections, sharing some examples of their most widely used data sets, discussing what makes them popular and sharing some principles that can be used both to make data more reusable and to promote it for reuse. Tiberius discussed the prevalence of data reuse by machines and the possibility of granting machines data reuse rights.

Louise’s presentation gave an overview of the portfolio of data sets hosted by the UK Data Service, looked at their top 20 most downloaded datasets and discussed the underlying principles that have led to them being widely reused. As well as demonstrating some commonalities between these datasets, Louise also outlined the principles used by the UK Data Service to promote their collections for reuse.

Tiberius’ presentation looked at data reuse from a different perspective, serving as a call to action to share research data responsibly and protect it against the reuse of machines designed to persuade humans. One of Tiberius’ main arguments was that no research data from public projects should be made available to feed and develop persuasive algorithms.

The presentations motivated an interesting discussion covering a broad range of topics. These included the reuse of qualitative data, how we can implement ethical safeguards data reuse, the idea of data ethics as a continuum, whether we can accept positive cases of algorithmic persuasion such as to promote equality and diversity, and the possibility of creating specific licences prohibiting data reuse by persuasive algorithms. See below for a video and transcript of the session.

Audience composition

We had 341 registrations with just over 65% originating from the Higher Education sector. Researchers and PhD students accounted for nearly 37% of the registrations whilst research support staff accounted for an additional 33%. We also had registrations from at least 30 countries outside of the UK including significant attendance from Denmark, Holland, Germany and Canada. We were thrilled to see that on the actual day 187 people attended the webinar.

We held five online webinars during Cambridge Data Week and were pleased to see that nearly 25% of the participants attended more than one webinar. A total of 1364 people registered and more than 700 attended all together, with the rest possibly watching the recordings at a later date. Most of all we were pleased to welcome participants from all over the world and see how important research data management topics are globally.

Where data was available, we identified the following countries apart from the UK:  Australia, Austria, Bangladesh, Brazil, Canada, Colombia, Croatia, Czech Republic, Denmark, France, Germany, Greece, Holland, Hungary, Iran, Luxembourg, Moldova, Norway, Poland, Romania, Singapore, Spain, Sweden, Switzerland, Turkey, Ukraine and the USA.

Recording , transcript and presentations

The video recording of the webinar can be found below and the recording, transcript and presentations are present in Apollo, the University of Cambridge repository

Bonus material

After the session ended, we continued the discussion with Louise and Tiberius looking in particular at one question posed by an audience member:

AI can always be used either for good or bad. Instead of locking-in, how can we enhance technology through data and regulation? 

Tiberius Ignat I think at this point we need regulation. I’m not a big fan of using regulations, to be honest. I think it’s much better to motivate people but, in this case, it’s quite a bit of control that has been lost, so I think we should have a regulation on how research data can be reused by others. This is how the internet has been made profitable during the last decade — through non-human persuasion. All these companies that are giving so much away for free are making billions of dollars when you look at the stock market. We were not clear how they were making this profit until recently when we realised that they are doing it by changing our behaviour and I think the rest of society – including research organisations – are behind them, so we need some regulation.

A good example is with GDPR. It has been introduced to protect our data, our digital footprint. On ResearchGate or Eurosport, or any other website, we used to be asked to agree to cookies or not. Recently, a new option called “Legitimate interest” has been slipped in and our digital data is again collected – less noticeably – by invoking questionable legitimate rights. The organisations whose model is based on persuading need cookie data, so they have moved the discussion away from remaining GDPR compliant to defending their legitimate interests. They are fighting to take data away from us. We can tackle this with regulation faster but in the long term we need to educate people to be more aware. We do have licenses such as Creative Commons but I’m not sure we have the right ones to protect us.

Louise Corti There are a variety of licenses, but they are abused and it’s very hard to track along the way what has gone wrong. I quite like the UK Government’s approach with some of their statistical data that has to go through a legal gateway. Some data can be made available for research, but it has to be done for the public good. We also have the Ethics Self-Assessment Tool, which is a grid you go through provided by the Statistics Authority and it asks you to think along lots of different dimensions of ethics. This helps researchers get a better sense of what they are trying to do, but whether the people we are talking about would care about it is a very different matter. Having been in research ethics for a very long time, that is by far the best tool I’ve seen and I recommend everyone uses it. The UK Data Archive uses it to evaluate some of the projects we deal with because you find often university ethics approvals are not good enough for the Statistics Authority because often they don’t understand quantitative secondary analysis, so the ethics scrutiny is not good enough. Self-Assessment is a much more nuanced thinking about the different dimensions of ethics and it helps researchers to be a bit more reflective about what’s good and what’s not.

Conclusion

Overall, the session provided a compelling blend of both the practical and conceptual elements of data reuse, each raising questions which could have easily been entire sessions in themselves. Louise’s presentation gave an excellent overview of the UK Data Service’s approach to making their datasets more reusable and promoting them to maximise their chances of being reused. Tiberius’ session raised some interesting questions surrounding data reuse and the ethics of using algorithms to persuade humans, as well as looking at some practical options for protecting research data from reuse for nefarious ends. At the end of the session, the audience were asked to participate in a poll on “What future developments are needed to increase the prevalence of data reuse?”.

Audience responses to poll held at the end of the event

The results were unsurprising to either speaker, with each touching on the idea that a change in research culture is necessary to ensure data reuse projects are seen as equal to data-generating projects. The need for cultural change is a theme that ran throughout each of the sessions in Data Week and is perhaps one of the current major challenges in scholarly communication.

Resources

Data Access and Research Transparency (DA-RT): A Joint Statement by Political Science Journal Editors

Robots appear more persuasive when pretending to be human

Behavioural evidence for a transparency–efficiency tradeoff in human–machine cooperation

The next-generation bots interfering with the US election

IBM’s AI Machine Makes A Convincing Case That It’s Mastering The Human Art Of Persuasion

AI Learns the Art of Debate

CSI-COP

Published on 25 January 2021

Written by Dominic Dixon

CCBY icon

Cambridge Data Week 2020 day 3: Is data management just a footnote to reproducibility?

Cambridge Data Week 2020 was an event run by the Office of Scholarly Communication at Cambridge University Libraries from 23–27 November 2020. In a series of talks, panel discussions and interactive Q&A sessions, researchers, funders, publishers and other stakeholders explored and debated different approaches to research data management. This blog is part of a series summarising each event:

The rest of the blogs comprising this series are as follows:
Cambridge Data Week day 1 blog
Cambridge Data Week day 2 blog
Cambridge Data Week day 4 blog
Cambridge Data Week day 5 blog

Introduction

The third day of Cambridge Data Week consisted of a panel discussion about the relationship between reproducibility and Research Data Management (RDM), looking for ways to advocate effectively to reach positive outcomes in both areas. Alexia Cardona (University of Cambridge), Lennart Stoy (European University Association), Florian Markowetz (University of Cambridge & UK Reproducibility Network), and René Schneider (Geneva School of Business Administration) offered their perspectives on whether RDM really is just a ‘footnote’ to the more popular concept of reproducibility.

The speakers agreed that we are still in need of cultural change towards better data management and reproducibility. The word ‘reproducibility’ is more likely to excite researchers and it is important to craft messages that work for each group, hence the emphasis on this term. In contrast to the Cambridge Data Week event on data peer review, the discussion here focused on engaging senior researchers, from PIs to Heads of Institutions, motivating them to be not just good data managers, but great data leaders.

Among the key elements needed to drive best practice in this area, two stood out. The first is communities. Whether these are reproducibility circles of peers, or networks like the Cambridge Data Champions, communities are key to creating and implementing guidelines for data management. The second element is a solid technological infrastructure. For instance, block chains could be used to enable reproducibility in citations in the humanities, or Persistent Identifiers, used at a very granular level, could lead to better data reuse.

Recording , transcript and presentations

The video recording of the webinar can be found below and the recording, transcript and presentations are present in Apollo, the University of Cambridge repository.

Bonus material

There were a few questions we did not have time to address during the live session, so we put them to the speakers afterwards. Here are their answers:

What are good practices regarding data deletion?

Florian Markowetz It very much depends on what kind of data you have, it’s hard to give general directions. However, drives and other hardware are becoming cheaper and cheaper, so I would say ‘save everything’.

René Schneider I would agree. I have spoken to researchers who keep all their data, because it would create too much work to sort what to keep and what to delete.

Alexia Cardona We tend to talk more about data archiving than data deletion. I often hear about data deletion where it has created problems, for example an account has been deleted in bulk when a researcher left an institution, so unpublished data and scripts are lost due to lack of communication. There are also cases on the internet of PhD students losing all their thesis when the laptop crashed, so this issue goes hand in hand with data storage and backup. Let’s focus on good practices and archiving of data, deletion is the very last thing to worry about.

Lennart Stoy It’s worth mentioning that there is often a compulsory period that data should be kept for, perhaps 3 years or 5 years according to funders mandates, so data should be stored for some time. I suppose the expense could become an issue in the coming years, some Universities are already concerned about the cost of having to buy large amounts of cloud storage space. There are also discussions in the Open Science Could teams about what to preserve in the long term. We want to make sure we preserve the higher value datasets, but of course it’s hard to define which ones those are.

Couldn’t scholarly communities of practice or learned societies create guidelines for reproducibility and good data management?

Lennart Stoy Absolutely, they must be involved as they are the ones with the specific knowledge. This is the idea behind Research Data Alliance (RDA) and the National Research Data Infrastructure (NFDI) in Germany. In those cases, you have to prove a link to the community in that field to establish a consortium. It is great when communities structure their areas of infrastructure from the bottom up.

What roles could Early Career Researchers (ECRs) have? Could they act as code-checkers to assist reproducibility, or are we asking too much of them given their busy schedules? Would they receive credit for this?

Florian Markowetz Senior academics have no excuses for not getting more involved in this once they have stable positions. It’s easy for people in my position to point to students, or to funders, saying they are not doing enough, but we should not be pointing away from ourselves, we should do the work. It could be coupled to pay rises: if you hold any role above grade 12 it’s your job now to sort this all out.

René Schneider I have been thinking about the role of data custodians or similar. If we ask researchers to spend a lot of time just checking data, like ‘warehouse workers’, we could be undervaluing their role. I don’t think it’s necessarily the researchers who should do the work, especially not ECRs, there should be other roles dedicated to this.

Alexia Cardona I second that, researchers are supposed to focus on the research, not necessarily the data checking and curation. But the unfortunate truth is that with short contracts and lack of resources the work is left to them. Another problem is the lack of rewards. For instance in my area, training, there’s no reward for people who take the time to make their training FAIR. We should embrace more openness and fairness, including rewarding those who do the work.

Lennart Stoy This is something we’ve been working on but it’s a challenging system to change because there are so many elements to disentangle. It relates to intense competition for jobs, the culture in different disciplines, and the pressure to publish in certain journals. Some Universities are very serious about implementing DORA and I hope that in a few years these will be able to show high levels of satisfaction among PhD students and ECRs. A lot depends on the leadership at the institutional level to initiate change, for instance the rector at Ghent University in Belgium has been driving DORA-inspired reward mechanisms and the Netherlands is also moving ahead and moving away from journal-based factors. The University of Bath is an example in the UK that I’ve heard mentioned a lot. We’re following progress in all these examples and will write up DORA good practice case studies to inspire other organisations. But it is a hard problem, ECRs have a lot on the line, it’s important not to jeopardise their careers.

Conclusion

This compelling discussion left us feeling that it does not matter too much which words we emphasise: reproducibility, data management, data leadership, or something else entirely. What matters is that we spark interest and commitment in the right groups of researchers to drive progress. Creating a culture where great research practices are routine will take effective advocacy, but also rewards that align with our aims and the right technical infrastructure to underpin them.

Resources

UK data service is a data repository funded by the Economic and Social Research Council (ESRC), which also provides extensive resources on data practices.

The journal PLOS Computational Biology introduced a pilot in 2019 where all papers are checked for the reproducibility of models.

Is there a reproducibility crisis? Baker’s 2016 paper in Nature reporting the results of a survey that exposed the extent of the reproducibility crisis.

San Francisco Declaration on Research Assessment (DORA), a set of recommendations for institutions, funders, publishers, metrics companies and researchers, aiming for a fairer and more varied system of research quality assessment.

Published on 25 January 2021

Written by Beatrice Gini

CCBY icon

Cambridge Data Week 2020 day 4: Supporting researchers on data management – do we need a fairy godmother?

Cambridge Data Week 2020 was an event run by the Office of Scholarly Communication at Cambridge University Libraries from 23–27 November 2020. In a series of talks, panel discussions and interactive Q&A sessions, researchers, funders, publishers and other stakeholders explored and debated different approaches to research data management. This blog is part of a series summarising each event: 

The rest of the blogs comprising this series are as follows:
Cambridge Data Week day 1 blog
Cambridge Data Week day 2 blog
Cambridge Data Week day 3 blog
Cambridge Data Week day 5 blog

Introduction 

How should researchers’ data management activities and skills be supported? What are the data management responsibilities of the funder, the institution, the research group and the individual researcher? Should we focus on training researchers so they can carry out good data management themselves or should we be funding specialist teams who can work with research groups, allowing the researchers to concentrate on research instead of data management? These were the questions addressed on day 4 of Cambridge Data Week 2020. This session benefitted from the perspectives of three speakers deriving from three different components of the research ecosystem: national funder, institutional research support and department/institute. Respectively, these were provided by Tao-Tao Chang (Arts and Humanities Research Council [AHRC]), Marta Teperek (TU Delft) and Alastair Downie (The Gurdon Institute, Cambridge). 

From a funder’s perspective, and following UKRI community consultation, Tao-Tao specifies that digital research infrastructure is recognised as an area for urgent investment, particularly in the arts and humanities, where both software and data loss are acute. Going forwards, AHRC’s key priorities will be to prevent further data loss, invest in skills, build capability, and work with the community to effect a sustained change in research culture. At an institutional level, Marta argues that it is unfair for researchers to be left unsupported to manage their data. The TU Delft model addresses this via three methods: central data support, disciplinary support by data stewards as permanent faculty staff, and hands-on support for research groups via data managers and research software engineers. Regarding the latter, an important take-home message for all researchers, regardless of institutional affiliation, is to build data management costs into grant proposals. Alastair takes up the discussion at the level of the department, research group and even individual, highlighting how researchers are locked into infrastructure silos, and locked into an unhelpful, competitive culture where altruism is a risky proposition and the career benefits of sharing seem intangible or insufficient. Alastair proposes that the climate is right and the community is ready for change, and goes on to discuss some positive changes afoot in the School of Biological Sciences to counteract these.  

Audience composition  

We had 291 registrations for the webinar, with just over 70% originating from the Higher Education sector. Researchers and PhD students accounted for 30% of the registrations whilst research support staff from various organisations accounted for an impressive 46%. On the day, we were thrilled to see that 136 people attended the webinar, participating from a wide range of countries. 

Recording, transcript and presentations 

The video recording of the webinar can be found below and the recording, transcript and presentations are present in Apollo, the University of Cambridge repository.

Bonus material 

There were a few questions we did not have time to address during the live session, so we put them to the speakers afterwards. Here are their answers: 

Talking about the technical side have you yet come across anyone using a machine implementable DMP? Setting up a data management infrastructure for a large project it’s become apparent that checking compliance with a DMP is a huge job and of course there is minimal resource for doing this.

Marta Teperek Work is being done in this area by Research Data Alliance where there are several groups working on machine actionable DMPs. Basically, the idea is that instead of asking researchers to write long essays about how they are planning to manage their data, they are asked to provide answers that are structured. These can be multiple choice options, for example, where the researcher specifies that they will be depositing large amounts of data in the repository and the repository will be notified of data coming their way. In other words, actions are made depending on what the researcher says they will do. University of Queensland is doing a lot on this already [see link to blog post here and in Resources further below].

What are the best cross-platform, mobile and desktop tools for data management?

Alastair Downie RDM encompasses a far too broad a range of activities – it’s a concept rather than a single activity that you can build into a neat little app. In the context of electronic lab notebooks, for example, there are hundreds of apps that serve that function and some of them cross over into lab management as well. Those products that try to do too much become very bloated and complex, which makes them unattractive and so we don’t see uptake of those kind of products. I think a suite approach is better than a single solution.

Institutions audit spending on research grants, they should do the same for research data and should be a requirement of holding a grant.

Alastair Downie Wellcome Trust are now challenging researchers to demonstrate that they have complied with their DMPs. It’s not particularly empirical but the fact that they are demonstrating their determination to make sure that everyone’s doing things properly is very helpful. 

Are there any specific infrastructure projects that the AHRC is sponsoring? I’m curious about what infrastructure/services would be useful for Arts and Humanities researchers

Tao-Tao Chang Not at this juncture. But we are hoping that this will change. AHRC recognises the importance of good data management practice and the need to support it. We also recognise that there is a skills gap and that all researchers at every level need support.

Is there a 2020 edition of the State of Open Data report?

Yes, this was published five days after this webinar! See the Digital Science website and further below under ‘Resources’.

Conclusion 

There are two outcomes of the webinar to draw upon here. The first raises again the question: do researchers need, or even want, a fairy godmother to support their research data management?  We held a poll at the end of the webinar, asking participants to choose which one of the following statements they believe most strongly: (1) ‘Individual researchers should learn how to manage their own data well’ or (2) ‘Researchers’ data should be managed by funded RDM specialists so that researchers can focus on research’. Of the 78 respondents, 67% chose the first option and 33% chose the second. There was not an intermediate option to incorporate both, simply because we wanted to force a choice in the direction of strongest belief when the two options are considered relative to one another. 

The results of the poll and the discussions during the webinar (between the speakers and within the chat) indicate that while individual researchers are responsible for managing their research data, support does need to be made available and promoted actively (we provide in the ‘Resources’ section some links to University of Cambridge research data management support). A second outcome reveals that support needs to be provided under several different guises. On the one hand, there is support that comes via the provision of funding, research data services and individually tailored expertise. Yet, on the other hand, there is support that will derive, albeit in a less tangible sense, from positive changes in research culture, specifically in terms of how the research of individual researchers is assessed and rewarded.  

Resources  

Some links to University of Cambridge research data management support include: the Research Data Management Policy Framework that outlines, for example, the data management responsibilities of research students and staff; our data management guide; a list of Cambridge Data Champions, searchable by areas of expertise. 

A recent Postdoc Academy podcast on ‘How can we improve the research culture at Cambridge?’ 

description of different data management support roles at TU Delft, by Alastair Dunning and Marta Teperek: data steward, data manager, research software engineer, data scientist and data champion.  

A Gurdon Computing blog post by Alastair Downie on ‘Research data management as a national service’; in other words, rather than duplicating infrastructure and services across the research landscape. 

An article by Florian Markowetz, discussed in the webinar, on ‘Five selfish reasons to work reproducibly’ (in Genome Biology)

TU Delft Open Working blog post by Marta Teperek on machine actionable Data Management Plans (DMPs) at the University of Queensland. For more information, see this article by Miksa and colleagues on the ‘Ten principles for machine-actionable data management plans’ (in PLOS Computational Biology).  

The State of Open Data 2020 report, published on 1 December 2020. 

Published on 25 January 2021

Written by Dr Sacha Jones with contributions from Tao-Tao Chang, Dr Marta Teperek, Alastair Downie and Maria Angelaki. 

CCBY icon