The Role of Open Data in Science Communication

Itamar Shatz has written a guest blog post for the Office of Scholarly Communication about how public trust in the scientific community increases when researchers make their data openly available to all. He also emphasizes that science communicators (e.g. press offices, journalists, publishers) have a responsibility to point attention directly at the primary source of the data. Itamar is a PhD candidate in the Department of Theoretical and Applied Linguistics at the University of Cambridge. He is also a member of the Cambridge Data Champion programme, having joined at the start of this year. He writes about science and philosophy that have practical applications at Effectiviology.com.

It’s no secret that the public’s view of the scientific community is far from ideal.

For example, a global survey published by the Wellcome Trust in 2019 showed that, on average, only 18% of people indicate that they have a high level of trust in scientists. Furthermore, the survey showed that there are stark differences between people living in different areas of the world; for instance, this rate was more than twice as high in Northern Europe (33%) and Central Asia (32%) than in Eastern Europe (15%), South America (13%), and Central Africa (12%).

Things do appear to be improving, to some degree, especially in light of the recent pandemic. For example, a recent survey in the UK, conducted by the Open Knowledge Foundation, has found that, following the COVID-19 pandemic, 64% of people are now “more likely to listen expert advice from qualified scientists and researchers”. Similar increases in public confidence have been found in other countries, such as Germany and the USA. However, despite these recent increases, there is still much room for improvement.

Open data can help increase the public’s confidence in scientists

The public’s lack of confidence in scientists is a complex, multifaceted issue, that is unlikely to be resolved by a single, neat solution. Nevertheless, one thing that can help alleviate this issue to some degree is open data, which is the practice of making data from scientific studies publicly accessible.

Research on the topic shows just how powerful this tool can be. For example, the recent survey by the Open Knowledge Foundation, conducted in the UK in response to the COVID-19 pandemic, found that 97% of those polled believed that it’s important for COVID-19 data to be openly available for people to check, and 67% believed that all COVID-19 related research and data should be openly available for anyone to use freely. Similarly, a 2019 US survey conducted before the pandemic found that 57% of Americans say that they trust the outcomes of scientific studies more if the data from the studies is openly available to the public.

Overall, such surveys strongly suggest that open data can help increase the public’s trust in scientists. However, it’s not enough for studies to just have open data for it to increase the public’s trust; if people don’t know about the open data, or if don’t fully understand what it means, then open data is unlikely to be as beneficial as it could be. As such, in the following section we will see some guidelines on how to properly incorporate open data into science communication, in order to utilize this tool as effectively as possible.

How to incorporate open data into science communication

To properly incorporate open data into science communication, there are several key things that people who engage in science communication—such as journalists and scientists—should generally do:

  • Say that the study has open data. That is, you should explicitly mention that the researchers have made the data from their research openly available. Do not assume that people will go to the original study and then learn there about the data being open.
  • Explain what open data is. That is, you should briefly explain what it means for the data to be openly available, and potentially also mention the benefits of making the data available, for example in terms of making research more transparent, and in terms of helping other researchers reproduce the results.
  • Describe what sort of data has been made openly available. For example, you can include descriptions of the type of data involved (surveys, clinical reports, brain scans, etc.), together with some concrete examples that help the audience understand the data.
  • Explain where the data can be found. For example, this can be in the article’s “supplementary information” section, though data should preferably be available in a repository where the dataset has its own persistent identifier, such as a DOI. This ensures that the audience can find and access the data, which may otherwise be hidden behind a paywall, and offers other benefits, such as allowing researchers to directly access and cite the dataset, without navigating through the article.

These practices can help people better understand the concept of open data, particularly as it pertains to the study in question, and can help increase their trust in the openness of the data, especially if it is placed somewhere that they can access themselves.

For one example of how open data might be communicated effectively in a press release, consider the following:

“The researchers have made all the data from this study openly available; this means that all the results from their experiments can be freely accessed by anyone through a repository available at: https://www.doi.org/10.xxxxx/xxxxxxx. This can help other scientists verify and reproduce their results, and will aid future research on the topic.”

Open data in different types of scientific communications

It’s important to note that there’s no single right way to incorporate open data into scientific communications. This can be attributed to various factors, such as:

  • Differences between fields (e.g. biology, economics, or psychology)
  • Differences between types of studies (e.g. computational or experimental)
  • Differences between media (e.g. press release or social media post).

Nevertheless, the guidelines outlined earlier can be beneficial as initial considerations to take into account when deciding how to incorporate open data into science communication. It is up to communicators to make the final modifications, in order to use open data as effectively as possible in their particular situation.

Summarizing what we’ve learned

Though the public’s trust in science is currently growing, there is much room for improvement. One powerful tool that can aid the academic community is open data—the practice of making data from research studies openly available. However, to benefit as much as possible from the presence of open data, it’s not sufficient for a study to merely make its data open. Rather, the accessibility of the data needs to be promoted and explained in scientific communication, and the dataset needs to be cited appropriately (see the Joint Declaration of Data Citation Principles for guidelines regarding this latter point).

What is currently being done

It is important to note that much work is already being done to promote the concept of open data. For example, organizations such as the Research Data Alliance promote discussion of the topic and publish relevant material, as in the case of their recent guidelines and recommendations regarding COVID-19 data.

In addition, at the University of Cambridge, in particular, we can already see a substantial push for open data practices, where appropriate, and from many angles as outlined in the University’s Open Research position statement. Many funding bodies mandate that data be made available, and the University facilitates the process of sharing the data via Apollo, the institutional repository. Furthermore, there are the various training courses and publications—including this very blog—led by bodies such as the Office of Scholarly Communication (OSC), which help to promote Open Research practices at the University. Most notably, there is the OSC’s Data Champion programme, which deals, among other things, with supporting researchers with open data practices.

Moving forward

Promoting the use of open data in scientific communication is something that different stakeholders can do in different ways.

For example, those engaging in science communication—such as journalists and universities’ communication offices—can mention and explain open data when covering studies. Similarly, scientists can ask relevant communicators to cite their open data, and can also mention this information themselves when they engage in science communication directly. In addition, consumers of scientific communication and other relevant stakeholders—such as the general public, politicians, regulators, and funding bodies—can ask, whenever they hear about new research findings, whether the data was made openly available, and if not, then why.

Overall, such actions will lead to increased and more effective use of open data over time, which will help increase the trust people have in scientists. Furthermore, this will help promote the adoption of open data practices in the scientific community, by making more scientists aware of the concept, and by increasing their incentives for engaging in it.

Published 19 June 2020

Written by Itamar Shatz

CCBY icon

Clearing the final hurdle – automating embargo setting

One of the biggest issues facing the Open Access Team has been keeping up with the constant stream of accepted manuscripts that need to be processed. In many cases we receive notification of an accepted manuscript well before formal publication. This has presented a significant challenge over the last five years because although we know there is a publication forthcoming (or at least we trust that there this), we have no idea as to when an article may actually be published.

This means that we have many thousands of publication records in Apollo which have ‘placeholder’ embargoes because we simply did not know the publication date at the point of archiving and therefore could not set an accurate embargo. After archiving, many of the records in Apollo may have been supplemented with a publication date thanks to metadata supplied via Symplectic Elements, but we still need to set an accurate embargo.

In other cases we might be waiting for an article to be published gold open access so that we can update Apollo with the published version of record.

While we are now very adept at archiving manuscripts in Apollo (thanks in large part to Fast Track and Orpheus) it remains a challenge to properly and accurately update Apollo records with either correct embargoes for accepted manuscripts, or the open access version of record. It is a futile task to be constantly checking whether a manuscript has been published. While the Open Access Team keeps a list of every publication that requires updating, this is a thankless job that should be highly automatable.

To that end, we have recently leveraged Orpheus to do at lot of the heavy lifting for us. By interrogating every journal article in Apollo and comparing its metadata against Orpheus we can now quickly determine which items can be updated and take the necessary next steps, changing embargoes where appropriate or identifying opportunities to archive the published version of record.

To do this we created a DSpace curation task to check every “Article” type in Apollo that had at least one file that was currently under embargo. We then compared the publication metadata against the information held in Orpheus to determine what steps needed to be taken. In total we found 9,164 items in need of some attention. The results are displayed below in a Tableau Public visual and summarised in Table 1.

Of these items, 3,864 had a published open access version archived alongside the embargoed manuscript, so we skipped any further updating of these records. This is actually a very good sign, and indicates that the Open Access Team has been going back to records and supplementing them with the open access version of record.

Amongst the remaining items, 2,794 were successfully matched against Orpheus and had their embargoes verified: 1,862 records were updated with shorter embargoes and 412 had longer embargoes applied, leaving 520 items which were unchanged because they already had the correct embargo period.

The final 2,506 items were primarily composed of records with no publication date (1,132 items), publications that could potentially be supplemented by the open access version of record (537 items) or had no embargo information in Orpheus (434 items).

Table 1. Summary of outcomes after comparing Apollo records against Orpheus.

Date archived in Apollo2014201520162017201820192020Total
The item has an open VoR version710512001019130022673864
Accepted version – embargo updated21457613223051342794
No publication date available10159327142171132
Orpheus VoR embargo: 014511854517537
No AAM embargo information available3664393326425434
Other outcome837114472316212403
Total1915415841358152541224029164

We plan to run this curation task on a regular basis and periodically check the outcomes. Any items that continually fail to update will be processed manually by the Open Access Team, but our intention and desire is to move away from manual processing wherever possible.

Published 3 April 2020

Written by Dr Arthur Smith

Image showing that this blog post is under CC-BY licence.

2019 That Was The Year That Was 

This is our traditional yearly blog about what we have been doing at the OSC in Cambridge. We are publishing it a little later than intended, but this is an indication of how busy the beginning of 2020 has been here in the Office of Scholarly Communication.

2019 saw us more in a ‘business as usual’ phase as we knuckled down and got on with supporting researchers in Cambridge. That aside, we still had some major developments in Open Research and this work will continue into 2020 and beyond.  

Policy changes 

2019 saw a number of happenings in the policy space at Cambridge. Most excitingly, the University’s Position Statement on Open Research was announced in February, making it one of the first UK universities to have such a statement. This demonstrates the University’s commitment to making open research a reality at Cambridge. 

Following on from this, in July 2019, the University together with Cambridge University Press  announced that they have signed up to the San Francisco Declaration on Research Assessment (DORA). The newly created Open Research Steering Committee, headed by the University’s Pro-Vice Chancellor for Research, will have oversight over the open research direction and the implementation of DORA. The Steering Committee and their working groups are currently looking into open research training, open research infrastructure (such as electronic research notebooks), Plan S and DORA. 

In December, an updated version of the Research Data Management Policy Framework was released. This update brings the policy framework in alignment with funder requirements and acknowledges the important roles that Principal Investigators, research staff and students, and University support staff play in good data management practices. It sits beneath the Position Statement on Open Research, with the documents being closely aligned. 

Open access news 

The Open Access Service made great strides towards automating many of its processes this year, headlined by the introduction of Orpheus and Fast Track. Orpheus is a custom database of publisher open access policies, and when combined with Fast Track for manuscript processing, it allows the Open Access Service to reduce the number of steps required to archive a manuscript in Apollo. In 2019, 8325 manuscript submissions were processed through Fast Track. In total, the Open Access Service responded to 13,609 submissions or enquiries in 2019, equal to 37 requests per day. 

Our Request a Copy service received 7,626 requests in 2019. One of the most requested items was “HIV-1 remission following CCR5Δ32/Δ32 haematopoietic stem-cell transplantation” (DOI: 10.1038/s41586-019-1027-4), which received 77 requests. The authors of the paper responded to and fulfilled each request, enabling the readers to obtain free access to the publication, and well ahead of Nature’s six-month embargo. However, since the accepted manuscript is now out of embargo, it has received a further 326 downloads to date in Apollo. The success of the Request a Copy service once again demonstrates the need for access to scholarly research at the earliest opportunity. Embargoes, even ‘short’ 6 month embargoes, are a needless barrier to the University’s research outputs. 

Data news 

Aside from the update to the Research Data Management Policy Framework (see above), the most significant development from 2019 has been the continued evolution of the Data Champion Programme

We welcomed 40 new Data Champions (DCs) from across several Schools increasing the size of our network to 86. With such a large cohort of Champions a new idea of creating departmental hubs was initiated to increase collaboration and the sharing of practices by Data Champions from the same areas. This has proved really successful in both Chemistry and Engineering, with a more coordinated approach having the effect of greater productivity from the Champions in those areas in engaging others with data management. 

In 2019, the Data Champions also tried out a mentoring scheme for the first time whereby established Champions support new Champions in finding their feet and give them ideas about how to provide support to their own community. This has proved to be a great success and the scheme is being run for a second year for the new cohort of Champions joining in early 2020. 

Finally, a new paper on the Data Champion community was published, Establishing, Developing and Sustaining a Community of Data Champions, by DC alumnus James Savage and our colleague Lauren Cadwallader in Data Science Journal. 

Thesis news 

The requirement to deposit an electronic copy of a PhD thesis in order to graduate has become normal business now. In 2019, 1197 of theses were deposited with 47% being made fully open access. In addition, around 100 requests to digitise historical theses were received from their authors and 1015 requests for scans of historical theses were received from requesters. 

Training 

In 2019 we took a broad perspective and examined how training was contributing to promoting and supporting Open Research at Cambridge. The Task Group on Open Research Training, comprised of representatives of several libraries and colleagues from other areas of the University, conducted a number projects to understand where we are at the moment and plan a strategy for the future. The details of that work will be presented at the RLUK 2020 conference in March but, as a ‘sneak peek’, here are some of the conclusions we drew: 

  • We’re stronger together: researchers will benefit if we build stronger communication between training providers. 
  • Open Research training should not be seen in isolation to the rest of research, rather it should be a key component of the way students learn to do research. 
  • Postdocs and senior researchers want to learn independently, we can support them with better-presented information online and by facilitating events and dialogue. 
  • We want to be able to constantly improve our training and demonstrate impact by exploring ways to evaluate ourselves, while also being aware of the lurking danger of irresponsible metrics in our own evaluation.  

Alongside the strategy work, we continued to expand the training we offer on Open Access, Research Data Management, publishing, copyright and more. A growing number of departments have requested sessions and we have partnered with PLOS and the Office for Postdoctoral Affairs to deliver a regular session on peer review. We delivered 56 sessions, reaching over 800 researchers and librarians. In addition, we have offered a session about complying with the REF Open Access requirements to departments; the Open Access team outdid themselves by delivering 20 sessions to individual departments in just over three months. 

Outreach activities 

In 2019 we hosted several events, from workshops to a one-day symposium dealing with open access monographs, FAIR data, preprints, reproducibility in social sciences, Plan S developments in the USA and open research in STEMM.  

Of notable interest is the Symposium on Open Monographs held in October at St Catharine’s College. This one-day event brought together researchers, funders, publishers and learned societies to discuss the benefits and challenges of an open landscape for academic books. The recordings are featured in the OSC YouTube channel and most of the presentations are available in our institutional repository, Apollo. A summary of the key themes that emerged from this symposium were later presented in Unlocking Research. 

October would not have been complete without celebrating Open Access Week. During the week we shared various blogs and online resources and we were delighted to announce the launch of our popular Research Support Ambassador Programme as an open educational resource designed to give learners either an introduction or refresher on key elements of research support. 

Systems 

Apollo has participated in a joint pilot study with Jisc, Symplectic and Sheffield Hallam University to look best approaches to integrate the Jisc Publications Router and the research information system Symplectic Elements, via institutional repositories. This pilot has involved working together to look at how well Elements could capture details of articles that Router had sent to our repositories. Router currently works with EPrints and DSpace repositories, the platforms used by Sheffield Hallam and Cambridge respectively. 

Symplectic’s Repository Tools 2 (RT2) integration module was used to harvest Apollo and de-duplicate them against any existing Elements records. We tested how well this worked for repository records deposited automatically by Router, looking in particular at the volume of duplicate publications and how early after acceptance notifications were received from Router. The study demonstrated that Router and Elements are technically compatible when used in this way. As a result of this pilot, Jisc and Symplectic are now happy to offer this solution to institutions more widely. 

Some excellent work behind the scenes has resulted in Jisc publishing a series of blogs last November. Their third blog showcases the ORCID IDs in Research Data Management workflows at the University of Cambridge and how a workflow has been implemented in order to create seamless links between researchers and their works using identifiers and different services. Such solutions improve visibility and discoverability across systems, reduce duplication of effort in entering information and avoid identification errors.

This work was made possible by Agustina Martínez García of the Office of Scholarly Communication, Owen Roberson of the Research Office, and Dean Johnson of University Information Services (UIS) who were amongst the winners of the professional services recognition scheme two years ago for their effective collaborative work on the integration of Symplectic Elements and Apollo. 

According to the blog, as of September 2019, 25,550 articles, 1,329 conference proceedings and 1,100 datasets in Apollo have ORCID IDs. 

Saying a big thank you 

2019 saw the departure of the University’s first Head of Scholarly Communication, Dr Danny Kingsley. Many of the achievements of 2019 were due to hard work Danny put in before her departure and for this we’d like to thank her for all she contributed. 

Published 26 February 

Compiled by: Maria Angelaki 

Image showing that this blog post is under CC-BY licence.

Contributions from Agustina Martínez-García, Bea Gini, Maria Angelaki, Lauren Cadwallader, Sacha Jones and Arthur Smith.