Open Data Sharing and reuse

The Open Research at Cambridge conference took place between 22–26 November 2021. In a series of talks, panel discussions and interactive Q&A sessions, researchers, publishers, and other stakeholders explored how Cambridge can make the most of the opportunities offered by open research. This blog is part of a series summarising each event. 

The session described here was on ‘Open data sharing and reuse’ and is summarised by the session chairs, Dominic Dixon (Research Librarian) and Dr Sacha Jones (Research Data Manager) at the Office of Scholarly Communication, Cambridge University Libraries.

The recording of the event can be found here:

Have you wondered how research data is used after it has been shared publicly as open data? What are some of the impacts of sharing data and of its subsequent reuse by others? Are there ethical factors to consider? Does the researcher or research group who shared their data openly benefit in any way from its reuse? What are the essential properties of a reusable dataset? This session on ‘Open data sharing and reuse’ explored these questions and more via presentations delivered by a panel of University of Cambridge researchers from various fields. They included: Professor Richard (Rik) Henson, Deputy Director of the MRC Cognition and Brain Sciences Unit, Professor of Cognitive Neuroscience at the Department of Psychiatry and President of the British Neuroscience Association; Professor John Suckling, Director of Research in Psychiatric Neuroimaging in the Department of Psychiatry and chair of the University of Cambridge Research Ethics Committee; Dr Mihály Fazekas, Assistant Professor at the Department of Public Policy, Central European University, and scientific director of an innovative think-tank at the Government Transparency Institute; and Professor Simon Deakin, Professor of Law in the Faculty of Law and Director of the Centre for Business Research.

All speakers discussed challenges and concerns around data sharing, including how and when to share. Rik asks, “Why wait until publication?” to share research data, and perhaps consider publishing a data paper where a dataset is celebrated in its own right, without the narrative of a traditional article. Researchers are often concerned about scooping but there’s little evidence of this and it may be a “paper tiger”. There’s an additional fear that data sharing will expose errors in work but as Rik noted, “I think we just need to get over our egos and accept that everyone makes errors”. One particular challenge can be to control what people (or bots) do with your data, but researchers have a choice over where to share (e.g., which repository to choose) and how to license their data. Something that was implicit in all talks, and stated explicitly by Simon, is that the benefits of sharing data openly vastly outweigh the costs.

Sharing data deriving from research involving human participants is understandably complex due to data protection regulations (e.g., GDPR), obtaining informed consent, and the challenge of anonymising datasets, particularly those containing qualitative data. Participants need to be informed about how their data will be used, so the message is that data sharing needs to be planned far in advance, even at the gestation of the project idea. It is important to be aware of the repository options; for example, if managed/controlled access to data is required then hear about the set-up at MRC CBU discussed by Rik, or the UK Data Service for sensitive qualitative data, as highlighted by Simon. John discusses the import and export of datasets from an ethical perspective, giving two examples from the biomedical and social sciences with a focus on secondary data use. He says that these examples illustrate just how far in the future you might need to think when considering how your data might be reused by others: it is “a lot better to ask for permission from all the stakeholders in these studies than it is to ask for their forgiveness”.

Data must be shared well for both researchers and society to reap the benefits. To do this, select an appropriate repository, adhere to any ethical/legal requirements, follow discipline-specific standards and make your data FAIR (Findable, Accessible, Interoperable, Reusable). A key element of the latter is data documentation, an issue raised repeatedly during this session. Sharing the data alongside any associated code and detailed information about the data will enable it to be reused effectively and mitigate against misuse. Mihály discusses sharing the Digiwhist project data, which has been reused by academia, policy, civil society and the media, and emphasises this: “Every time I put out bits and pieces of my data and code that was not clear, I just kept on receiving the same question over and over again. So actually, it’s in your own best interests to document your work fully because then it is a lot more efficient for you”. Providing data about data is part of being completely transparent about the research process and results, enabling others to understand exactly what was done and to build on it. In some fields, this is an essential part of research reproducibility and replicability. As another example, Simon describes sharing the CBR Leximetric datasets – currently, the 2nd most downloaded dataset in Apollo and 8th of all UK institutional repositories – where not only the data were shared but also the methodology and an extensive codebook.

In both examples, being transparent in this way has led to wider reuse of these data and many citations of the data and associated publications. The benefits of FAIR data sharing and data reuse certainly do not rest solely in the number of resulting citations. Ethical and transparent research leads to credible research and researchers, enhancing reputations and quality of outputs. These are elements that all speakers highlighted in their talks. To end on a quote from Simon about the outcome of sharing data and of its subsequent reuse: “It’s been a very very positive experience for us”.  

We’re always happy to receive any questions or comments you may have about data sharing and reuse. You can contact us at info@data.cam.ac.uk and see our Research Data website for more information.

Additional resources

University of Cambridge School of Clinical Medicine guidance on secondary data use and related ethical considerations, discussed by Professor John Suckling.

The Digiwhist project website discussed by Dr Mihály Fazekas. The Digiwhist project is also one of the University’s research projects highlighted on the University of Cambridge global impact map.

Video of a previous talk by Professor Simon Deakin for OpenConCam 2016 talk on ‘Open Access and Knowledge Production 0 “Leximetric” Data Coding’.

The FAIR principles are outlined by Wilkinson et al. (2016) in Scientific Data – “The FAIR Guiding Principles for scientific data management and stewardship”. There is also a useful guide for researchers on how to make your data FAIR.

Visit the University of Cambridge Research Data website for information on research data management, data sharing and guidance on depositing data into Apollo, the institutional repository. The site also hosts the University of Cambridge Research Data Management Policy framework, which is relevant to all research staff and students.

Open access success stories: interview with Dr. Jacqui Stanford

#ProtestingSewell at the Conservative Party Conference October 2021. All Rights Reserved.

For this post, Katherine Burchell talks to Dr Jacqui Stanford about the success of her open access doctoral thesis: Identities in Transition: theorising race and multicultural success in school contexts in Britain. The thesis is available to download from Apollo here: https://doi.org/10.17863/CAM.58378

Thanks for agreeing to talk about your research. Could you please describe what your PhD thesis is about?

I am interested in how we build a harmonious society given the challenges of our past of chattel slavery and colonialism. As a Black educator of Caribbean heritage, I also have a particular professional interest in schools that are successful for Black children. However, I’ve never been interested in a ten-point list of ‘how to build successful schools’; there was a lot of that in the late 1990s when I turned to graduate studies. And that project seemed to be inconsistent with necessary, actual and sustainable outcomes in schools. Furthermore, that project did not access or address the complexities introduced by race. Yet, as a microcosm of society, schools necessarily had developed strategies that addressed race, especially if they were harmonious spaces, where everyone could be successful.

For my PhD, I was very much interested in writing process, ie, zeroing in and exploring the complexities of transitioning from one thing to another. For example, I was very interested in white teachers and how they imagined – perhaps actually practised – but particularly how they articulated – transition from a past beset by the challenges of race to a context that was considered successful for Black and minority ethnic young people – as well as themselves. I also wanted to see what Black and white identities looked like in these contexts.

Methodological and analytical approaches were paramount in my research enterprise. I was not content with taking up established ideas. I focused on building theories at every stage of the project: a theory for research, ie, for doing race research, and for doing race research in Britain; a theory for analysing data on race; a theory for teachers’ explanations that transformed their personal stories into universal ideas. In other words, I was focused on developing and delineating generic ideas for negotiating race and racial identities as well as philosophical ideas about actually doing race research, as much as I was on delineating ideas about success in the school contexts I researched.

Has open access helped promote your work?

To begin, I remember feeling such discontent back then, and actually writing it in my thesis, that all the tremendous effort would just end up on a shelf in the UL. For me, the award of a PhD was a bonus on top of the actual experience of learning, growing and fashioning another world through research. The PhD was not just the work needed for the award of a qualification; I was intentionally seeking to articulate the aforementioned theories and understanding of the world.  I am glad I did. For following my PhD, lecturing and working internationally in policymaking, activism and community development took precedence over writing. Now, in the wake of George Floyd and Black Lives Matter, as we begin to seek ways to understand and address the racial past, I find myself instinctively returning to my PhD, revisiting theories on whiteness and blackness, how to negotiate challenges posed by race and create harmonious multicultural spaces.

I also note the irrepressible urgency, and opportunity, currently growing in society to grapple with the past. This encouraged me to make my thesis open access in Summer 2020, and I am pleased to see that it has attracted strong interest, from a range of countries, and especially from old colonial countries such as the USA, France, Belgium as well as the UK. And I am definitely interested to note that it has attracted attention from China.

Do you feel open access has helped you reach marginalised groups that your work discusses?

It would seem that my work is reaching marginalised groups, in addition to groups that, though arguably not marginalised, have urgent need for research insight. Here, for example, I would highlight white teachers faced with the responsibility of teaching Britain’s past in the wake of Black Lives Matter etc. The thesis has interest for various groups positioned in various ways in the present historical moment, both here in the UK, and in other countries with whom we share a colonial past, and with others simply seeking information on it.

How has the work been taken up in policy? (Do you feel open access has been helpful in policy?)

I have had opportunity to contribute to government policy here in the UK and internationally over the years, and it has been gratifying to see viewings of my thesis from countries where I have worked. While there has been good indication of interest from the UK, it is noticeable that viewings from the US in particular, is very strong. The ideas in my thesis underpinned international travel and work in the US lecturing, and activism as well as contributing to President Obama’s Race to the Top education policy as a reviewer. So, I am very pleased that open access has made my thesis accessible there, especially as it is also being accessed.

Here in the UK, my thesis was used to evidence arguments made in submission to the recent government consultation for the Sewell Report, aka Race Report published in March2021. Controversially, like other submissions, it was not listed in the acknowledgment; nevertheless, my PhD was offered and reviewed as an example of how we create successful multicultural schools and society. There was a significant uptick in viewings of my thesis following submission, dramatically so after publication of the report. Perhaps there is a link between the events.

It certainly feels as if the time has come to share my exploration of the UK’s history of race and schooling in relation to government policymaking, and specifically my thorough going examination of Tony Sewell’s seminal text which anticipated the Commission on Race and Ethnic and Disparities’ report. It is certainly the case that my PhD informs my current #ProtestingSewell campaign, which denounces the report as a source of legislation and policymaking on race in the UK today.

Additional Notes

On the strength of my PhD, I was the first person in my Cambridge Education department to be a Post-Doctoral Research Fellow having been the first person to be awarded the ESRC (Economic and Social Research Council) Post-Doctoral Research Fellowship. This followed the ESRC award for my PhD as well as other awards including the three-year Isaac Newton Award for young researchers, and the Barbadian High Commission award for Outstanding Students of Caribbean Heritage studying in Britain.

Bios

Jacqui Stanford: Having suffered life-changing injuries while working as a professor in the US, I am presently in rehabilitation, focusing on opportunities to make ideas and theories generated in my thesis widely available.

Katherine Burchell is Scholarly Communication Support at the Office of Scholarly Communication at Cambridge University Libraries

Informing the Elsevier negotiations: Dominic Dixon on the work of the Data Analysis Working Group

As part of our series of posts on the Elsevier negotiations, Dominic Dixon, Research Librarian at Cambridge University Libraries, explains the work of the library’s Data Analysis Working Group to access, understand and analyse the data relating to how researchers at Cambridge use Elsevier publications. These findings are also presented as a series of data visualisations on the recently launched Elsevier Data Dashboard [Cambridge University Raven account required].

Having a strong underpinning of data is critical to strengthening the University and sector position in negotiations with Elsevier. This post outlines our approach in the data analysis working group to gathering and presenting the data underpinning the negotiations, looks at some of the questions we have sought to answer, and shares some high-level findings from our analysis.

As with many data science projects, a large majority of the time has been spent on data cleaning. This is in part due to the way the exports from the platforms we used are structured but also to allow us to carry out a more fine-grained analysis than would have been possible with the data in its default state. Some of this work involved disambiguating publisher names, splitting and pivoting fields with multiple entries (e.g., funders, disciplines, and subjects), and enriching the records with metadata not included in the original files.

Publishing

To build a profile of research published by Cambridge researchers in Elsevier journals, we experimented with three platforms: Dimensions, Scopus, and Web of Science (WoS). Each of these platforms is commercial and each has varying levels of coverage and richness of metadata. A recent comparative analysis between WoS, Scopus and Dimensions found that Dimensions indexed 82.22% more journals than WoS and 48.17% more journals than Scopus. We decided to compare the coverage in each of these platforms for articles published between 2015 & 2020 by a Cambridge affiliated author. In this case, WoS (n=59, 587) returned 1% more results than Dimensions (n=58,908) and 32% more than Scopus (n=40,385).* However, filtering to Elsevier gave a different picture. We found that Dimensions (n=11,431) returned 16% more articles than WoS (n=9,504) and 44% more than Scopus (n=6,345). Given this and considering that our primary focus was research published by Elsevier, we opted to use Dimensions.

Of the 58,908 records exported from Dimensions, we found that 19% were published in Elsevier journals, making Elsevier the single most chosen publishing venue for Cambridge authors. Filtering to only articles with a Cambridge corresponding author, we again found that Elsevier was the most chosen publishing venue, with over 34% (n=4,564) of the articles published in Elsevier journals. Having looked at publishing levels more broadly, we then broke down the articles published with a Cambridge corresponding author by Open Access category. We found that 22% (n=1,137) of the articles were categorised as closed and therefore behind a paywall, 35%(n=1,585) were paid for via different routes including funder block grants administered by the University, 32% (n=1,467) were self-archived (Green OA), and 8% (n=375) were published in journals that do not charge APCs. Thus, the percentage of articles that are either behind a paywall, or are only available openly because an APC has been paid, is significantly higher than the amount that is published open access without any associated fees.

Another aspect of publishing we decided to focus on is funding, asking specifically “Who is funding the Cambridge research published with Elsevier?”. Given the inclusion of funder data in the Dimensions export, we were able to break down the articles by both funder and funder groups. This enabled us to determine who is funding the research. Looking at articles with a Cambridge affiliated author, Cambridge corresponding author, and articles resulting from grants we found that in each category over 70% were linked to at least one cOAlition S funder. The wider implication of this – specifically for the corresponding author articles – is that in the absence of a read and publish agreement, many of the funders would not pay the APCs associated with publishing in Elsevier journals.

Reading

To provide a picture of the extent to which articles published in Elsevier subscription journals are read at Cambridge, we gathered usage data from COUNTER and the Alma library management system. This allowed us to consider reading over the 6-year between 2015 and 2020 both overall and at a disciplinary level. We found that reading of Elsevier journals was consistently higher in each year than for any other publisher. Reading of Elsevier in 2020 represented 20% of all reading and was at its highest level in physical sciences and engineering. The single highest total of reading in the sub-categories within each discipline was in biochemistry, genetics, and molecular biology within the life sciences, with over 400,000 article downloads in 2020 alone.

Another question we considered is how frequently articles published in Elsevier journals are cited by researchers at Cambridge. To answer this, we took advantage of the Dimensions API to gather a dataset of the cited publications from articles published with a Cambridge affiliated author between 2015 and 2020. The resulting data set consisted of over 1.2m bibliographic records and revealed that 22% (n=269,917) of the cited articles were published by Elsevier. Interestingly, this percentage closely matches both the percentage of articles published in Elsevier journals by Cambridge affiliated authors (19%), the percentage of articles read at Cambridge (21.78%) (2015-20), as well as the percentage of publishing with Elsevier at the national level (20%). Using the Dimensions API to enrich the citation data with the open access category, we were able to see that 66% (over 174,000 publications) of the cited Elsevier content is currently paywalled. Elsevier is both the most cited and most paywalled publisher. This observation has wider implications for open research given that many of these articles would be inaccessible to those who are not affiliated with an institution that subscribes to the journals in which the articles appear.

Paying

One of the main questions we considered when looking at data relating to expenditure on Elsevier was how much we pay to publish with Elsevier journals. Our source for this data was OpenAPC – an initiative that aggregates data on open access expenditure and makes it openly available – combined with data from our internal compliance reports. Looking at the overall spend across all institutions that have contributed to the OpenAPC dataset, we can see that over €49,000,000 has been paid to Elsevier. This represents 19% of the total reported spend on article processing charges (APCs). Looking at data the data on Cambridge expenditure, we found that between 2015 and 2020, 30% (over £3,000,000) of our total spend on APCs from block grants was paid to Elsevier (the highest spend on any single publisher), with a single payment averaging at £3,302 and ranging between £450 and £7320.

Final notes

This post has covered just some of the questions we have been able to answer with the data. We think that overall, we have been able to demonstrate that Elsevier journals are among the most read and published in, but also consistently the most paywalled and expensive to publish in journals of all publishers. This serves to highlight the importance of the ongoing negotiations and of considering other options such as those explored in previous posts. Our complete findings are presented on a dashboard that is accessible to members of the University. Unfortunately, legal restrictions mean we are not able to share the dashboard or underlying datasets externally; however, we have made the Python code we used to gather the citation data available as a Jupyter notebook on Google Colab. This can be used to retrieve the dataset we used to carry out the citation analysis and is easily modifiable for other purposes (see the notebook for full details). We refer the interested reader to the Dimensions API Lab, and the ESAC guide to uncovering the publishing profile of your institution. The former was helpful for learning how to take advantage of the Dimensions API (as were the staff at Dimensions), and the latter has been useful in formulating our approach to the whole project. We are also happy to answer questions about any aspect of our work.

* The original percentage quoted here was 18%. This was incorrect and has now been corrected to 32%.