What questions reveal about researchers’ attitudes to Open Access

By Dr Bea Gini, Training Coordinator

‘Right, that concludes this part of the training session, are there any questions?’ 

I’ve asked this scores of times in the last academic year, and it’s always fascinating to hear what questions emerge. Some have come up often enough that they have earned themselves a new slide in the training session. Others can be really niche, or reveal something about a specific field that is different to all other disciplines. Sometimes a question beautifully cuts through all the frills to challenge a key aspect of what has been discussed. In all cases, they have shown thoughtfulness and a real wish to engage with Open Research. 

Over the last academic year, we trained over 300 researchers on Open Research. In this post, I teased out a few of the most interesting or common questions they have asked about Open Access (OA) to explore what they may reveal about how they relate to the idea of OA. This is not an FAQ page, nor is it a comprehensive resource about OA at Cambridge. I will resist the urge to answer any of the questions, but rather focus on the themes they raise. 

Incentives 

Naturally, many of the questions reflect the incentives in research careers. When speaking to Arts & Humanities groups, the aim to turn a PhD thesis into a monograph is common, so questions are raised over publishers’ attitudes to OA theses and possible access levels for theses in Apollo. With ‘publish or perish’ still a common mantra, we have carefully considered how PhD graduates can deposit their theses in the repository without compromising future publishing deals. Many publishers now realise that an OA thesis is not necessarily a problem, but this is still a debated issue and more conversations between publishers, students, supervisors and libraries are needed.  

With STEMM groups, Registered Reports often come up, prompting discussions of their benefits in securing a publication avenue early and improving reporting practices. And yet the bias against negative results is profoundly embedded and hard to shake. More than once, I was asked ‘but if I do the experiment and get negative results, can I still go back and change the method to see if I can get positive ones?’. The first time I was a little baffled, worrying that I had not properly explained the problems with under-reporting negative results. Yet with further discussion it became clear that the researchers agreed with the principle, but felt that publishing positive results was more likely to earn them citations and prestige. In such a competitive environment, who can blame them for trying to give themselves the best chance?  

At other times, it’s heartening to see that incentives are better aligned between researchers, the academic community, and the public at large. I’ve received growing numbers of questions about how to disseminate findings to colleagues, the general public, and the research subjects themselves. In a few cases, researchers were grappling with dissemination strategies in rural areas of the developing world, where the usual solutions like blogs and podcasts would not work. It prompted me to think more broadly about dissemination strategies, making sure that biases for particular parts of the world or audience types do not come to dominate our suggestions.  

Barriers to Open Access 

By far the most common questions I hear is ‘where can I find the money?’, usually asked with some frustration at the gap between what seems to be a great idea (Open Access) and the seemingly insurmountable barrier of Article or Book Processing Charges. This frustration is more common in the Arts, Humanities and Social Sciences, whereas in Science, Technology, Engineering and Maths grants often cover publication costs, as long as the applicant remembers to factor those in. Exorbitant costs, as well as concerns about the type of license and dealing with privacy and qualitative data, can contribute to disillusionment with the OA movement, which I fear is growing among AHSS researchers.  There is no easy solution to this, especially for researchers who are not funded through Research Councils, and for monographs that can cost close to –or even over- £10,000. But some progress has been made: Read And Publish deals may bridge that gap in some cases, and some alternative business models for monographs are emerging.  

Another common question when I speak to enthusiastic PhD students is ‘how can I convince my supervisor to publish OA?’. First of all, it’s great that these discussions are happening between students and supervisors, a great example of where supervision can be a high-value exchange of ideas. The deeper question concerns the decision-making dynamics within the student-supervisor relationship. I have seen extreme cases where supervisors delegated virtually all decisions to the student, trusting in their judgement and the pedagogic value of making mistakes; as well as the opposite, where the students were expected to follow instructions to the letter in almost every aspect of their research. As is usually the case, the optimum must rest somewhere between those extremes. When it comes to OA, are reluctant supervisors helpfully schooling their students in the strategising needed for a successful research career, or are they stifling innovation in a new generation of researchers?  

The last barrier to mention is lack of knowledge. A variety of questions arise on issues of copyright, Green and Gold OA, identifying manuscript versions, funders policies, and more. The OA landscape is still developing as we continue to experiment with business models, agreements, workflows, and policies. This means that currently there is a high level of complexity and things change year on year. Researchers, especially those in their early career, have to juggle a large and diverse portfolio of skills, so they could be forgiven for shrugging OA away with a ‘I don’t need to know’. Yet their natural curiosity and belief in the power of free information leads many of them to ask probing questions to understand this landscape. Luckily, these questions are the easiest to answer. We constantly produce and revise training materials to boost researcher’s knowledge, and we have helpdesks and webpages where the answer can be at their fingertips.  

All in all 

Taken together, these questions tell us two things. First, researchers are engaging with us, they want to understand how OA works and have the confidence to embrace it. Second, there are common barriers relating to  career incentives, costs and knowledge. By listening carefully and expanding the dialogue with all disciplines, we can work together to reduce or overcome those barriers.  

Research Data at Cambridge – highlights of the year so far

By Dr Sacha Jones, Research Data Coordinator

This year we have continued, as always, to provide support and services for researchers to help with their research data management and open data practices. So far in 2020, we have approved more than 230 datasets into our institutional repository, Apollo. This includes Apollo’s 2000th dataset on the impact of health warning labels on snack selection, which represents a shining example of reproducible research, involving the full gamut: preregistration, and sharing of consent forms, code, protocols, data. There are other studies that have sparked media interest for which the data are also openly available in Apollo, such as the data supporting research that reports the development of a wireless device that can convert sunlight, carbon dioxide and water into a carbon-neutral fuel. Or, data supporting a study that has used computational modelling to explain why blues and greens are the brightest colours in nature. Also, and in the year of COVID, a dataset was published in April on the ability of common fabrics to filter ultrafine particles, associated with an article in BMJ Open. Sharing data associated with publications is critical for the integrity of many disciplines and best practice in the majority of studies, but there is also an important responsibility of science communication in particular to bring research datasets to the forefront. This point was discussed eloquently this summer in a guest blog post in Unlocking Research by Itamar Shatz, a researcher and Cambridge Data Champion. Making datasets open permits their reuse, and if you have wondered how research data is reused and then read this comprehensive data sharing and reuse case study written by the Research Data team’s Dominic Dixon. This centres on the use and value of the Mammographic Image Society database, published in Apollo five years ago. 

This year has seen the necessary move from our usual face-to-face Research Data Management (RDM) training to provision of training online. This has led us to produce an online training session in RDM, covering topics such as data organisation, storage, back up and sharing, as well as data management plans. This forms one component of a broader Research Skills Guide – an online course for Cambridge researchers on publishing, managing data, finding and disseminating research  – developed by Dr Bea Gini, the OSC’s training coordinator. We have also contributed to a ‘Managing your study resources’ CamGuide for Master’s students, providing guidance on how to work reproducibly. In collaboration with several University stakeholders we released last month new guidance on the use of electronic research notebooks (ERNs), providing information on the features of ERNs and guidance to help researchers select one that is suitable. 

At the start of this year we invited members of the University to apply to become Data Champions, joining the pre-existing community of 72 Data Champions. The 2020 call was very successful, with us welcoming 56 new Data Champions to the programme. The community has expanded this year, not only in terms of numbers of volunteers but also in terms of disciplinary focus, where there are now Data Champions in several areas of the arts, humanities and social sciences in particular where there were none previously. During this year, we have held forums in person and then online, covering themes such as how to curate manual research records, ideas for RDM guidance materials, data management in the time of coronavirus, and data practices in the arts and humanities and how these can be best supported. We look forward to further supporting and advocating the fantastic work of the Cambridge Data Champions in the months and years to come.  

Open Access and REF 2021: “Is This Article Non-Compliant?”

By Dr Debbie Hansen, Senior Open Access Adviser, Office of Scholarly Communication

Through much of this REF period, there has been a focus on encouraging Cambridge authors to deposit their accepted manuscripts into our institutional repository.  The Open Access Team has tackled the sometimes tricky tasks of making sure the right version has been deposited with the correct embargo, advising on funders’ open access requirements and managing the payments for gold open access from the UKRI and COAF block grants. 

With the REF period ending, the University is now finalising lists of research outputs to be submitted to REF2021. Alongside this activity, some members of the Open Access Team have been focussing on compliance indicators for the REF open access policy. In Symplectic Elements, the University’s research information management system, all journal or conference articles which fall within the period of the REF open access policy are labelled as either Compliant or Non-compliant.   

Unfortunately, from an administrative point of view, this is not as straightforward as it may seem (but it is fortunate for compliance).  This compliance indicator is set automatically from calculations using the acceptance, first publication and deposit dates as well as the repository embargo lift date.  It is, if you like, a first-pass indicator.  ‘Non-compliant’ articles may end up as being compliant or REF eligible as they may, for example: 

  1. have gaps in their metadata such as missing acceptance or publication dates; 
  2. have incorrect publication dates in the external metadata records (one article can have around 10 separate metadata records (e.g. from Scopus, Crossref, Europe PMC, etc.) and Elements takes the earliest publication date from all the metadata records associated with an article.  01/01/YYYY is a common red herring where only the year of publication (YYYY) has been recorded and the month and day fields have been automatically filled with a default value);  
  3. have embargo lift dates greater than 12 months from first publication (Panel C and D articles can have embargo lengths of up to 24 months but the system does not recognise this); 
  4. be compliantly deposited in a different non-commercial open access repository; 
  5. be eligible for one of the REF exceptions to the policy; 
  6. be published gold open access and so do not need to be deposited in a repository to fulfil the REF open access criteria. 

If an article is showing as non-compliant, it generally requires individual investigation by a team member.  However, as has been raised in previous blogs, we try to develop processes to balance staff resources against the sheer numbers of articles.  For this problem, I will mention two tools we have been using to address in bulk three common article scenarios: missing acceptance or publication dates, deposited in another repository and published gold open access. 

Missing acceptance or publication dates 

Acceptance dates are not always openly provided by a journal or conference and some publication dates can be hard to find (e.g. for some humanities, arts and social science journals) or have been missed for some other reason.  In these instances, the author may be able to help.  For example, they may be able to check past correspondence with the journal or with co-authors.  

Our colleagues in the University Research Information Office, Agustina Martínez-García and Owen Roberson, developed an internal, simple to use tool, aptly named LastMinute.CAM1.  This tool uses an article’s Elements identifier to create an article-customised form that can be sent to an author to request missing information.  The form is pre-populated with article title and other information already held about an article (e.g. it’s digital object identifier (DOI)) and the author can fill in missing acceptance or publication dates.  Once the form is submitted, the data populates a new record for the article in Symplectic Elements and the data is used, alongside all the other data for that article for the compliance calculations.  We have tried to use LastMinute.CAM for this purpose on a considered basis (we do not wish to contact authors unnecessarily) and have attempted to resolve the issue of missing dates, and links to articles in other repositories (next section), in this way for hundreds of papers via mail merge lists. 

Article deposited in another repository 

Some authors have been contacted with the LastMinute.CAM form because their article was deposited late in Apollo, or there is no deposit at all, but their article may be compliant in another repository (e.g. deposited by a co-author at another university).  LastMinute.CAM is integrated with Unpaywall: the application searches Unpaywall data via its Application Programming Interface (API) and records in the form the link to the preferred open access location, together with the article version if available.  A recipient of the form can accept this, or remove it (they may know it was not compliantly deposited) or amend the repository link and version already populated in the form with an alternative.   

Having a link to an article in another repository is of course a first step.  A team member will need to check the link (we have found URLs to non-repository web pages) and investigate whether the article is compliantly deposited in the other repository.  However, when we do find a compliant deposit, this source is already recorded for us, removing some of the legwork we would otherwise need to do to complete our records. 

Article published gold open access 

Unpaywall has also been a great tool for identifying articles that have been published open access through the gold route.  The Unpaywall Simple Query Tool accepts a list of up to 1000 DOIs and returns a report of the open access status of the article associated with each DOI.  We do need to analyse the results carefully and discard, for example, those made open through the accepted manuscript and the green route, published versions without an open licence (bronze open access) and those published with an open licence but only after a defined time delay.  Once we are happy with the cleaned list this can be used as input to an Elements API script (also developed by Agustina Martínez-García) to bulk annotate articles that have been identified as being published as gold open access.  To date we have identified over 1000 articles in this way. 

Summary 

Henceforth we plan to run the gold OA bulk ‘exception’ process monthly and have in the background the option to use LastMinute.CAM further to gather missing information via targeted mail shots to authors.  We will also be addressing in an automated way those articles that were compliantly deposited and with the correct embargo applied but not recognised as compliant by the system due to a ‘perceived’ too-long embargo.  These activities will leave a far more manageable set of articles, showing as non-compliant, for which more detailed investigations into why articles are being labelled non-compliant can be made and action taken (such as the application of eligible REF exceptions) as appropriate. 

One final comment, once the submission to REF has been made there will be a period of reflection. Effective tools, like those mentioned here, that help with making our processes more efficient will feature in this review.  This review will help to define our future activities in this space.  

1 This tool is only available internally to University of Cambridge researchers, and is not indexed in Google or any search engine 

Open Access for Librarians: Putting Together the Puzzle

Claire Sewell, Research Support Librarian, Betty & Gordon Moore Library

This Open Access week I’ve been reflecting back on my time training library staff in research support. As anyone working in this area will know, an understanding of the principles of open access is key to getting to grips with many of the issues covered by the scholarly communications remit so it’s important that librarians get a good grasp of the basics. Open access is a topic rich in terminology and interconnected concepts which can make teaching it a little bit like putting together a jigsaw puzzle with no finished image to guide you. Many introductory sessions begin with an overview of what open access actually means – the process of making the outputs of funded research available online for anyone to read. So far, so simple but even this assumes some knowledge of the current academic publishing system. I often need to spend longer talking about this than I had planned before we can move onto the rest of the session and the pauses don’t stop there. Outlining the importance of open access involves explaining the REF, describing the practicalities means defining what we mean by a repository and describing the different types of OA can be hard when your audience don’t understand the concept of an embargo. 

No two audiences are ever the same as everyone has a different view of the finished picture and I need to be able to provide them with the pieces they need to complete their own OA puzzle. As a result, every session has to be adaptable to the needs of the people in the room. Whilst I still have an overall plan for any open access session, I find it’s a good idea to have some small pre-prepared slides or activities which embed key concepts that I can include if needed. I’ve also come to the realisation that it doesn’t matter which order you place your slides in as you will have to shuffle through them at random as your audience asks questions! This is not always a bad thing as it keeps me on my toes and improves my practice.  

The most common questions I get are detailed below: 

  1. Definitions of various terms – audiences need to know what things such as embargoes, repositories, author accepted manuscript and APC are, but it can be hard to explain one without an understanding of some element of another. Having some type of primer on hand can really help people to understand the language you’re using. 
  2. Manuscript versions – something a lot of people struggle with is which version of a manuscript is which and how this impacts sharing via OA. I find that a visual representation offers the best explanation and often rely on this graphic from our OA FAQs – something I’ve been told makes all the difference. 
  3. Practicalities of OA – this will vary between institutions but a common question is how you actually go through the process of making outputs open. If you can, building in time for a demonstration and/or some hands-on experience can really help learners to understand the process and find all sorts of tricky problems for you to explain! 

So, the message is – no matter who your audience is, it pays to be flexible. Much like the rest of the open access landscape one size definitely does not fit all! 

Preparing for the end of COAF

The Open Access team are getting ready for the end of Charity Open Access Fund (COAF), which is due to dissolve on 30th September 2020.  

From 1st October 2020 onward, there are going to be changes to the block grants that we receive, and as a result, there will be a change in our policies on whether or not we can cover researchers’ article processing charges (APCs).  

We have outlined how researchers should go about securing funding for the APC’s below: 

Funder name Are article processing charges covered by a block grant? How do I pay for my article processing charge? 
UKRI Yes No change: researchers should continue to upload their paper to us for a funding decision
Wellcome Trust Yes No change: researchers should continue to upload their paper to us for a funding decision
Cancer Research UK Yes No change: researchers should continue to upload their paper to us for a funding decision
British Heart Foundation YesNo change: researchers should continue to upload their paper to us for a funding decision
Blood Cancer UK No- authors must include cost in their grant application  1. For payment, contact research@bloodcancer.org.uk
2. Upload your paper to ensure REF compliance. 
Parkinson’s UK No- authors must include cost in their grant application  1. For payment, contact researchapplications@parkinsons.org.uk,
2. Upload your paper to ensure REF compliance. 
Versus Arthritis No – authors must request support direct from funder  1. Use funder’s Grant Tracker for OA support,
2. Upload your paper to ensure REF compliance. 
Multiple funders acknowledged  If your paper includes funding from UKRI, Wellcome Trust, Cancer Research UK or British Heart Foundation then we may be able to help with the APC. Researchers should upload their paper to us for a funding decision

There is no change in the funder’s open access policies for the rest of 2020. However, there are significant changes due in 2021, specifically to Wellcome Trust and Cancer Research UK.  

We have outlined the policy changes in the table below: 

Funder name Change? Outline of policy 
Wellcome Trust Changesee new policy document   1. Policy covers original research articles, 
2. Policy applies to papers submitted for publication after 1/1/2021, 
3. Papers must be made immediately open access (no embargo allowed) in Europe PMC, 
4. Papers must be published with a CC BY licence, 
5. Papers must be published in a journal that is indexed in DOAJ (Wellcome will no longer cover APCs for subscription journals)
6. The authors must retain their copyright. 
Cancer Research UK Changesee new policy document 1. Policy covers original research articles, 
2. Policy applies to all papers after 1/1/2021, 
3. Papers must be made immediately open access (no embargo allowed) in Europe PMC,
4. Papers must be published with a CC BY licence. 
Multiple funders acknowledged  Any papers acknowledging Wellcome Trust or Cancer Research UK must be compliant in order to access funds. 

To summarise:

From 1 October 2020, authors should continue to submit their papers to the Open Access Team as usual via our website. The Open Access Team will continue to advise on the best course of action to meet funder requirements, but we may not always be able to pay APCs. 

The funders’ policies remain the same until 1st January 2021. We advise authors covered by Wellcome Trust and Cancer Research UK to familiarise themselves with the changes to their funder’s open access policies, which are outlined in COAF’s table

Data sharing and reuse case study: the Mammographic Image Society database

The Mammographic Image Society (MIAS) database is a set of mammograms put together in 1992 by a consortium of UK academic institutions and archived on 8mm DAT tape, copies of which were made openly available and posted to applicants for a small administration fee. The mammograms themselves were curated from the UK National Breast Screening Programme, a major screening program that was established in the late 80s offering routine screening every three years to women aged between 50-64.

The motivations for creating the database were to make a practical contribution to computer vision research – which sought to improve the ability of computers to interpret images – and to encourage the creation of more extensive datasets. In the peer-reviewed paper bundled with the dataset, the researchers note that “a common database is a positive step towards achieving consistency in performance comparison and testing of algorithms”.

Due to increased demand, the MIAS database was made available online via third parties, albeit in a lower resolution than the original. Despite no longer working in this area of research, the lead author, John Suckling – now Director of Research in the Department of Psychiatry, part of Cambridge Neuroscience –  started receiving emails asking for access to the images at the original resolution. This led him to dig out the original 8mm DAT tapes with the intention of making the images available openly in a higher resolution. The tapes were sent to the University Information Service (UIS), who were able to access the original 8mm tape and download higher resolution versions of the images. The images were subsequently deposited in Apollo and made available under a CC BY license, meaning researchers are permitted to reuse them for further research as long as appropriate credit is given. This is the most commonly used license for open datasets and is recommended by the majority of research funding agencies.

Motivations for sharing the MIAS database openly

The MIAS database was created with open access in mind from the outset. When asked whether he had any reservations about sharing the database openly, the lead author John Suckling noted:

There are two broad categories of data sharing; data acquired for an original purpose that is later shared for secondary use; data acquired primarily for sharing. This dataset is an example of the latter. Sharing data for secondary use is potentially more problematic especially in consortia where there are a number of continuing interests in using the data locally. However, most datasets are (or should be) superseded, and then value can only be extracted if they are combined to create something greater than the sum of the parts. Here, careful drafting of acknowledgement text can be helpful in ensuring proper credit is given to all contributors.”

This distinction – between data acquired for an original purpose that is later shared for secondary use and data acquired primarily for sharing – is one that is important and often overlooked. The true value of some data can only be fully realised if openly shared. In such cases, as Suckling notes, sufficient documentation can help ensure the original researchers are given credit where it is due, as well as ensuring it can be reused effectively. This is also made possible by depositing the data on an institutional repository such as Apollo, where it will be given a DOI and its reuse will be easier to track.

Impact of the MIAS database

As of August 2020, the MIAS database has received over 5500 downloads across 27 different countries, including some developing countries where breast cancer survival rates are lower. Google Scholar currently reports over 1500 citations for the accompanying article as well as 23 citations for the dataset itself. A review of a sample of the 1500 citations revealed that many were examples of the data being reused rather than simply citations of the article. Additionally, a systematic review published in 2018 cited the MIAS database as one of the most widely used for applying breast cancer classification methods in computer aided diagnosis using machine learning, and a benchmarking review of databases used in mammogram research identified it as the most easily accessible mammographic image database. The reasons cited for this included the quality of the images, the wide coverage of types of abnormalities, and the supporting data which provides the specific locations of the abnormalities in each image.

The high impact of the MIAS database is something Suckling credits to the open, unrestricted access to the database, which has been the case since it was first created. When asked whether he has benefited from this personally, Suckling stated “Direct benefits have only been the citations of the primary article (on which I am first author). However, considerable efforts were made by a large number of early-career researchers using complex technologies and digital infrastructure that was in its infancy, and it is extremely gratifying to know that this work has had such an impact for such a large number of scientists.”. Given that the database continues to be widely cited and has been downloaded from Apollo 1358 times since January 2020, it is still clearly the case that the MIAS database is having a wide impact.

The MIAS Database Reused

As mentioned above, the MIAS database has been widely reused by researchers working in the field of medical image analysis. While originally intended for use in computer vision research, one of the main ways in which the dataset has been used is in the area of computer aided diagnosis (CAD), for which researchers have used the mammographic images to experiment with and train deep learning algorithms. CAD aims to augment manual inspection of medical images by medical professionals in order to increase the probability of making an accurate diagnosis.

A 2019 review of recent developments in medical image analysis identified lack of good quality data as one of the main barriers researchers in this area face. Not only is good quality data a necessity but it must also be well documented as this review also identified inappropriately annotated datasets as a core challenge in CAD. The MIAS database is accompanied by a peer-reviewed paper explaining its creation and content as well as a read me PDF which explains the file naming convention used for the images as well as the annotations used to indicate the presence of any abnormalities and classify them based on their severity. The presence of this extensive documentation combined with it having been openly available from the outset could explain why the database continues to be so widely used.

Reuse example: Applying Deep Learning for the Detection of Abnormalities in Mammograms

This research, published in 2019 in Information Science and Applications, looked at improving some of the current methods used in CAD and attempted to address some inherent shortcomings and increase the competency level of deep learning models when it comes the minimisation of false positives when applying CAD to mammographic imaging. The researchers used the MIAS database alongside another larger dataset in order to evaluate the performance of two existing convolutional neural networks (CNN), which are deep learning models used specifically for classifying images. Using these datasets, they were able to demonstrate that versions of two prominent CNNs were able to detect and classify the severity of abnormalities on the mammographic images with a high degree of accuracy.

While the researchers were able to make good use of the MIAS database to carry out their experiments, due to the inclusion of appropriate documentation and labelling, they do note that since it is a relatively small dataset it is not possible to rule out “overfitting”, where a deep learning model is highly accurate on the data used to train the model, but may not generalise well to other datasets. This highlights the importance of making such data openly available as it is only possible to improve the accuracy of CAD if sufficient data is available for researchers to carry out further experiments and improve the accuracy of their models. ­

Reuse example: Computer aided diagnosis system for automatic two stages classification of breast mass in digital mammogram images

This research, published in 2019 in Biomedical Engineering: Applications, Basis and Communications, used the MIAS database along with the Breast Cancer Digital Repository to test a CAD system based on a probabilistic neural network – a machine learning model that predicts the probability distribution of a given outcome –  developed to automate classification of breast masses on mammographic images. Unlike previously developed models, their model was able to segment and then carry out a two-stage classification of breast masses. This meant that rather than classifying masses into either benign or malignant, they were able to develop a system which carried out a more fine-grained classification consisting of seven different categories. Combining the two different databases allowed for an increased confidence level in the results gained from their model, again raising the importance of the open sharing of mammographic image datasets. After testing their model on images from these databases, they were able to demonstrate a significantly higher level of accuracy at detecting abnormalities than had been demonstrated by two similar models used for evaluation. On images from the MIAS Database and Breast Cancer Digital Repository their model was able to detect abnormalities with an accuracy of 99.8% and 97.08%, respectively. This was also accompanied by increased sensitivity (ability to correctly classify true positives) and specificity (ability to correctly classify false negatives).

Conclusion

Many areas of research can only move forward if sufficient data is available and if it is shared openly. This, as we have seen, is particularly true in medical imaging where despite datasets such as the MIAS database being openly available, there is a data deficiency which needs to be addressed in order to improve the accuracy of the models used in computer-aided diagnosis. The MIAS database is a clear example of a dataset that has enabled an important area of research to move forward by enabling researchers to carry out experiments and improve the accuracy of deep learning models developed for computer-aided diagnosis in medical imaging. The sharing and reuse of the MIAS database provides an excellent model for how and why future researchers should make their data openly available.

Published 20th August 2020
Written by Dominic Dixon

CCBY icon

Cambridge response to the UKRI open access policy review

Open access is transforming scholarly communication, and both the University and its Press are fully committed to the transition to open access publishing without embargo. It is inspiring us to think more deeply about how the research publishing ecosystem can be improved to the benefit of all society.

The open access policy review being conducted by UK Research and Innovation (UKRI) will have a major impact on how publicly funded research in the UK is published. The UK already has a strong commitment to open access, and we look forward to the new UKRI policy dramatically speeding up the country’s transition to open access.

Cambridge unites a world-leading research university, with a world-renowned Press and Library. We believe there is strength in this partnership, including the ability to challenge and test solutions that must work for academics, funders, publishers and research institutions. Our joint response to the UKRI policy review reflects the range of perspectives across the University and highlights some of the challenges and opportunities we face as an academic university and publisher.

In brief:

  • There are many aspects of the proposed UKRI policy that we support without reservation. For example, authors should retain their copyright, journals and publishers should be more transparent about their services and costs, and key metadata, such as funder and grant information and author IDs, are vital for efficient scholarly communication and research evaluation infrastructures.
  • There is a conflict between the need for sustainable journal publishing models that provide access to the final published article and affordability for research-intensive universities. Collectively we believe that this contradiction in approach is not sustainable and necessitates a UKRI policy that is more flexible in the short term while supporting a much bolder shift in publishing practice that will require significant changes from all stakeholders. The Library and the Press are working together to explore bold innovation and disruption for scholarly communications built round a shared commitment to the goals of open research.
  • There are also areas where we agree that allowances must be made for the different needs of different research communities. While all research communities must be able to benefit from OA, flexibility on details such as Creative Commons licenses and third party content is needed to allow research, and international collaboration, to flourish. There are concerns from academics, Library and the Press, for example, about the potential for requiring open access to all monographs in the REF-after-REF 2021 in the absence of funding for publishing these monographs, around the cost implications of requiring open access to articles and monographs that include third party content and around unintended consequences for early career researchers in certain disciplines.
  • For books, we need the time and freedom to find scalable, sustainable approaches to OA. No model has been found so far that would allow us to publish large numbers of high-quality OA books at the global scale and reach of the Press. The impact of making pre-final versions of books open access after an embargo is inadequately understood, undesirable from the perspective of researchers in particular disciplines and may be economically unrealistic (because we believe book purchasing habits will change significantly under a delayed-OA approach). While new approaches are explored, we suggest a couple of options for UKRI to consider adopting: (i) broadening the definition of ‘open’ to include ‘free to read’ and (ii) allowing books to be published under a ‘transformative programme’, perhaps along the lines of the Subscribe To Open model for journals.
  • For journal articles, we cannot ignore an essential paradox. On the one hand, zero embargo Green OA depends upon subscriptions which are becoming ever more unsustainable as more content becomes OA. On the other hand, many research-intensive organizations are unable to pay the costs of their publishing without subsidies from subscribers around the world. Our academic University would need to comply with the proposed UKRI policy predominantly through the Green OA route, while CUP needs to transition to Gold OA. To resolve this paradox during a world-wide shift to full open access, UKRI must make two transitionary allowances: modest embargoes can be applied by publishers to support the subscriptions that sustain Green OA, and Gold OA in hybrid journals must continue to be supported. We want to see a scholarly communications landscape that has diversity reflecting the breadth scholarship across the disciplines, including smaller publishers and learned societies that require support in the transition to Open Access.

As we said earlier, we look forward to the new UKRI policy dramatically speeding up the UK’s transition to OA. We hope that the fine details of the policy will allow us to fully play our part in the transformation.

This post has been developed jointly by Cambridge University Libraries and Cambridge University Press and has also been shared at https://www.cambridge.org/core/blog/?p=36924.

The Role of Open Data in Science Communication

Itamar Shatz has written a guest blog post for the Office of Scholarly Communication about how public trust in the scientific community increases when researchers make their data openly available to all. He also emphasizes that science communicators (e.g. press offices, journalists, publishers) have a responsibility to point attention directly at the primary source of the data. Itamar is a PhD candidate in the Department of Theoretical and Applied Linguistics at the University of Cambridge. He is also a member of the Cambridge Data Champion programme, having joined at the start of this year. He writes about science and philosophy that have practical applications at Effectiviology.com.

It’s no secret that the public’s view of the scientific community is far from ideal.

For example, a global survey published by the Wellcome Trust in 2019 showed that, on average, only 18% of people indicate that they have a high level of trust in scientists. Furthermore, the survey showed that there are stark differences between people living in different areas of the world; for instance, this rate was more than twice as high in Northern Europe (33%) and Central Asia (32%) than in Eastern Europe (15%), South America (13%), and Central Africa (12%).

Things do appear to be improving, to some degree, especially in light of the recent pandemic. For example, a recent survey in the UK, conducted by the Open Knowledge Foundation, has found that, following the COVID-19 pandemic, 64% of people are now “more likely to listen expert advice from qualified scientists and researchers”. Similar increases in public confidence have been found in other countries, such as Germany and the USA. However, despite these recent increases, there is still much room for improvement.

Open data can help increase the public’s confidence in scientists

The public’s lack of confidence in scientists is a complex, multifaceted issue, that is unlikely to be resolved by a single, neat solution. Nevertheless, one thing that can help alleviate this issue to some degree is open data, which is the practice of making data from scientific studies publicly accessible.

Research on the topic shows just how powerful this tool can be. For example, the recent survey by the Open Knowledge Foundation, conducted in the UK in response to the COVID-19 pandemic, found that 97% of those polled believed that it’s important for COVID-19 data to be openly available for people to check, and 67% believed that all COVID-19 related research and data should be openly available for anyone to use freely. Similarly, a 2019 US survey conducted before the pandemic found that 57% of Americans say that they trust the outcomes of scientific studies more if the data from the studies is openly available to the public.

Overall, such surveys strongly suggest that open data can help increase the public’s trust in scientists. However, it’s not enough for studies to just have open data for it to increase the public’s trust; if people don’t know about the open data, or if don’t fully understand what it means, then open data is unlikely to be as beneficial as it could be. As such, in the following section we will see some guidelines on how to properly incorporate open data into science communication, in order to utilize this tool as effectively as possible.

How to incorporate open data into science communication

To properly incorporate open data into science communication, there are several key things that people who engage in science communication—such as journalists and scientists—should generally do:

  • Say that the study has open data. That is, you should explicitly mention that the researchers have made the data from their research openly available. Do not assume that people will go to the original study and then learn there about the data being open.
  • Explain what open data is. That is, you should briefly explain what it means for the data to be openly available, and potentially also mention the benefits of making the data available, for example in terms of making research more transparent, and in terms of helping other researchers reproduce the results.
  • Describe what sort of data has been made openly available. For example, you can include descriptions of the type of data involved (surveys, clinical reports, brain scans, etc.), together with some concrete examples that help the audience understand the data.
  • Explain where the data can be found. For example, this can be in the article’s “supplementary information” section, though data should preferably be available in a repository where the dataset has its own persistent identifier, such as a DOI. This ensures that the audience can find and access the data, which may otherwise be hidden behind a paywall, and offers other benefits, such as allowing researchers to directly access and cite the dataset, without navigating through the article.

These practices can help people better understand the concept of open data, particularly as it pertains to the study in question, and can help increase their trust in the openness of the data, especially if it is placed somewhere that they can access themselves.

For one example of how open data might be communicated effectively in a press release, consider the following:

“The researchers have made all the data from this study openly available; this means that all the results from their experiments can be freely accessed by anyone through a repository available at: https://www.doi.org/10.xxxxx/xxxxxxx. This can help other scientists verify and reproduce their results, and will aid future research on the topic.”

Open data in different types of scientific communications

It’s important to note that there’s no single right way to incorporate open data into scientific communications. This can be attributed to various factors, such as:

  • Differences between fields (e.g. biology, economics, or psychology)
  • Differences between types of studies (e.g. computational or experimental)
  • Differences between media (e.g. press release or social media post).

Nevertheless, the guidelines outlined earlier can be beneficial as initial considerations to take into account when deciding how to incorporate open data into science communication. It is up to communicators to make the final modifications, in order to use open data as effectively as possible in their particular situation.

Summarizing what we’ve learned

Though the public’s trust in science is currently growing, there is much room for improvement. One powerful tool that can aid the academic community is open data—the practice of making data from research studies openly available. However, to benefit as much as possible from the presence of open data, it’s not sufficient for a study to merely make its data open. Rather, the accessibility of the data needs to be promoted and explained in scientific communication, and the dataset needs to be cited appropriately (see the Joint Declaration of Data Citation Principles for guidelines regarding this latter point).

What is currently being done

It is important to note that much work is already being done to promote the concept of open data. For example, organizations such as the Research Data Alliance promote discussion of the topic and publish relevant material, as in the case of their recent guidelines and recommendations regarding COVID-19 data.

In addition, at the University of Cambridge, in particular, we can already see a substantial push for open data practices, where appropriate, and from many angles as outlined in the University’s Open Research position statement. Many funding bodies mandate that data be made available, and the University facilitates the process of sharing the data via Apollo, the institutional repository. Furthermore, there are the various training courses and publications—including this very blog—led by bodies such as the Office of Scholarly Communication (OSC), which help to promote Open Research practices at the University. Most notably, there is the OSC’s Data Champion programme, which deals, among other things, with supporting researchers with open data practices.

Moving forward

Promoting the use of open data in scientific communication is something that different stakeholders can do in different ways.

For example, those engaging in science communication—such as journalists and universities’ communication offices—can mention and explain open data when covering studies. Similarly, scientists can ask relevant communicators to cite their open data, and can also mention this information themselves when they engage in science communication directly. In addition, consumers of scientific communication and other relevant stakeholders—such as the general public, politicians, regulators, and funding bodies—can ask, whenever they hear about new research findings, whether the data was made openly available, and if not, then why.

Overall, such actions will lead to increased and more effective use of open data over time, which will help increase the trust people have in scientists. Furthermore, this will help promote the adoption of open data practices in the scientific community, by making more scientists aware of the concept, and by increasing their incentives for engaging in it.

Published 19 June 2020

Written by Itamar Shatz

CCBY icon

Clearing the final hurdle – automating embargo setting

One of the biggest issues facing the Open Access Team has been keeping up with the constant stream of accepted manuscripts that need to be processed. In many cases we receive notification of an accepted manuscript well before formal publication. This has presented a significant challenge over the last five years because although we know there is a publication forthcoming (or at least we trust that there this), we have no idea as to when an article may actually be published.

This means that we have many thousands of publication records in Apollo which have ‘placeholder’ embargoes because we simply did not know the publication date at the point of archiving and therefore could not set an accurate embargo. After archiving, many of the records in Apollo may have been supplemented with a publication date thanks to metadata supplied via Symplectic Elements, but we still need to set an accurate embargo.

In other cases we might be waiting for an article to be published gold open access so that we can update Apollo with the published version of record.

While we are now very adept at archiving manuscripts in Apollo (thanks in large part to Fast Track and Orpheus) it remains a challenge to properly and accurately update Apollo records with either correct embargoes for accepted manuscripts, or the open access version of record. It is a futile task to be constantly checking whether a manuscript has been published. While the Open Access Team keeps a list of every publication that requires updating, this is a thankless job that should be highly automatable.

To that end, we have recently leveraged Orpheus to do at lot of the heavy lifting for us. By interrogating every journal article in Apollo and comparing its metadata against Orpheus we can now quickly determine which items can be updated and take the necessary next steps, changing embargoes where appropriate or identifying opportunities to archive the published version of record.

To do this we created a DSpace curation task to check every “Article” type in Apollo that had at least one file that was currently under embargo. We then compared the publication metadata against the information held in Orpheus to determine what steps needed to be taken. In total we found 9,164 items in need of some attention. The results are displayed below in a Tableau Public visual and summarised in Table 1.

Of these items, 3,864 had a published open access version archived alongside the embargoed manuscript, so we skipped any further updating of these records. This is actually a very good sign, and indicates that the Open Access Team has been going back to records and supplementing them with the open access version of record.

Amongst the remaining items, 2,794 were successfully matched against Orpheus and had their embargoes verified: 1,862 records were updated with shorter embargoes and 412 had longer embargoes applied, leaving 520 items which were unchanged because they already had the correct embargo period.

The final 2,506 items were primarily composed of records with no publication date (1,132 items), publications that could potentially be supplemented by the open access version of record (537 items) or had no embargo information in Orpheus (434 items).

Table 1. Summary of outcomes after comparing Apollo records against Orpheus.

Date archived in Apollo2014201520162017201820192020Total
The item has an open VoR version710512001019130022673864
Accepted version – embargo updated21457613223051342794
No publication date available10159327142171132
Orpheus VoR embargo: 014511854517537
No AAM embargo information available3664393326425434
Other outcome837114472316212403
Total1915415841358152541224029164

We plan to run this curation task on a regular basis and periodically check the outcomes. Any items that continually fail to update will be processed manually by the Open Access Team, but our intention and desire is to move away from manual processing wherever possible.

Published 3 April 2020

Written by Dr Arthur Smith

Image showing that this blog post is under CC-BY licence.

2019 That Was The Year That Was 

This is our traditional yearly blog about what we have been doing at the OSC in Cambridge. We are publishing it a little later than intended, but this is an indication of how busy the beginning of 2020 has been here in the Office of Scholarly Communication.

2019 saw us more in a ‘business as usual’ phase as we knuckled down and got on with supporting researchers in Cambridge. That aside, we still had some major developments in Open Research and this work will continue into 2020 and beyond.  

Policy changes 

2019 saw a number of happenings in the policy space at Cambridge. Most excitingly, the University’s Position Statement on Open Research was announced in February, making it one of the first UK universities to have such a statement. This demonstrates the University’s commitment to making open research a reality at Cambridge. 

Following on from this, in July 2019, the University together with Cambridge University Press  announced that they have signed up to the San Francisco Declaration on Research Assessment (DORA). The newly created Open Research Steering Committee, headed by the University’s Pro-Vice Chancellor for Research, will have oversight over the open research direction and the implementation of DORA. The Steering Committee and their working groups are currently looking into open research training, open research infrastructure (such as electronic research notebooks), Plan S and DORA. 

In December, an updated version of the Research Data Management Policy Framework was released. This update brings the policy framework in alignment with funder requirements and acknowledges the important roles that Principal Investigators, research staff and students, and University support staff play in good data management practices. It sits beneath the Position Statement on Open Research, with the documents being closely aligned. 

Open access news 

The Open Access Service made great strides towards automating many of its processes this year, headlined by the introduction of Orpheus and Fast Track. Orpheus is a custom database of publisher open access policies, and when combined with Fast Track for manuscript processing, it allows the Open Access Service to reduce the number of steps required to archive a manuscript in Apollo. In 2019, 8325 manuscript submissions were processed through Fast Track. In total, the Open Access Service responded to 13,609 submissions or enquiries in 2019, equal to 37 requests per day. 

Our Request a Copy service received 7,626 requests in 2019. One of the most requested items was “HIV-1 remission following CCR5Δ32/Δ32 haematopoietic stem-cell transplantation” (DOI: 10.1038/s41586-019-1027-4), which received 77 requests. The authors of the paper responded to and fulfilled each request, enabling the readers to obtain free access to the publication, and well ahead of Nature’s six-month embargo. However, since the accepted manuscript is now out of embargo, it has received a further 326 downloads to date in Apollo. The success of the Request a Copy service once again demonstrates the need for access to scholarly research at the earliest opportunity. Embargoes, even ‘short’ 6 month embargoes, are a needless barrier to the University’s research outputs. 

Data news 

Aside from the update to the Research Data Management Policy Framework (see above), the most significant development from 2019 has been the continued evolution of the Data Champion Programme

We welcomed 40 new Data Champions (DCs) from across several Schools increasing the size of our network to 86. With such a large cohort of Champions a new idea of creating departmental hubs was initiated to increase collaboration and the sharing of practices by Data Champions from the same areas. This has proved really successful in both Chemistry and Engineering, with a more coordinated approach having the effect of greater productivity from the Champions in those areas in engaging others with data management. 

In 2019, the Data Champions also tried out a mentoring scheme for the first time whereby established Champions support new Champions in finding their feet and give them ideas about how to provide support to their own community. This has proved to be a great success and the scheme is being run for a second year for the new cohort of Champions joining in early 2020. 

Finally, a new paper on the Data Champion community was published, Establishing, Developing and Sustaining a Community of Data Champions, by DC alumnus James Savage and our colleague Lauren Cadwallader in Data Science Journal. 

Thesis news 

The requirement to deposit an electronic copy of a PhD thesis in order to graduate has become normal business now. In 2019, 1197 of theses were deposited with 47% being made fully open access. In addition, around 100 requests to digitise historical theses were received from their authors and 1015 requests for scans of historical theses were received from requesters. 

Training 

In 2019 we took a broad perspective and examined how training was contributing to promoting and supporting Open Research at Cambridge. The Task Group on Open Research Training, comprised of representatives of several libraries and colleagues from other areas of the University, conducted a number projects to understand where we are at the moment and plan a strategy for the future. The details of that work will be presented at the RLUK 2020 conference in March but, as a ‘sneak peek’, here are some of the conclusions we drew: 

  • We’re stronger together: researchers will benefit if we build stronger communication between training providers. 
  • Open Research training should not be seen in isolation to the rest of research, rather it should be a key component of the way students learn to do research. 
  • Postdocs and senior researchers want to learn independently, we can support them with better-presented information online and by facilitating events and dialogue. 
  • We want to be able to constantly improve our training and demonstrate impact by exploring ways to evaluate ourselves, while also being aware of the lurking danger of irresponsible metrics in our own evaluation.  

Alongside the strategy work, we continued to expand the training we offer on Open Access, Research Data Management, publishing, copyright and more. A growing number of departments have requested sessions and we have partnered with PLOS and the Office for Postdoctoral Affairs to deliver a regular session on peer review. We delivered 56 sessions, reaching over 800 researchers and librarians. In addition, we have offered a session about complying with the REF Open Access requirements to departments; the Open Access team outdid themselves by delivering 20 sessions to individual departments in just over three months. 

Outreach activities 

In 2019 we hosted several events, from workshops to a one-day symposium dealing with open access monographs, FAIR data, preprints, reproducibility in social sciences, Plan S developments in the USA and open research in STEMM.  

Of notable interest is the Symposium on Open Monographs held in October at St Catharine’s College. This one-day event brought together researchers, funders, publishers and learned societies to discuss the benefits and challenges of an open landscape for academic books. The recordings are featured in the OSC YouTube channel and most of the presentations are available in our institutional repository, Apollo. A summary of the key themes that emerged from this symposium were later presented in Unlocking Research. 

October would not have been complete without celebrating Open Access Week. During the week we shared various blogs and online resources and we were delighted to announce the launch of our popular Research Support Ambassador Programme as an open educational resource designed to give learners either an introduction or refresher on key elements of research support. 

Systems 

Apollo has participated in a joint pilot study with Jisc, Symplectic and Sheffield Hallam University to look best approaches to integrate the Jisc Publications Router and the research information system Symplectic Elements, via institutional repositories. This pilot has involved working together to look at how well Elements could capture details of articles that Router had sent to our repositories. Router currently works with EPrints and DSpace repositories, the platforms used by Sheffield Hallam and Cambridge respectively. 

Symplectic’s Repository Tools 2 (RT2) integration module was used to harvest Apollo and de-duplicate them against any existing Elements records. We tested how well this worked for repository records deposited automatically by Router, looking in particular at the volume of duplicate publications and how early after acceptance notifications were received from Router. The study demonstrated that Router and Elements are technically compatible when used in this way. As a result of this pilot, Jisc and Symplectic are now happy to offer this solution to institutions more widely. 

Some excellent work behind the scenes has resulted in Jisc publishing a series of blogs last November. Their third blog showcases the ORCID IDs in Research Data Management workflows at the University of Cambridge and how a workflow has been implemented in order to create seamless links between researchers and their works using identifiers and different services. Such solutions improve visibility and discoverability across systems, reduce duplication of effort in entering information and avoid identification errors.

This work was made possible by Agustina Martínez García of the Office of Scholarly Communication, Owen Roberson of the Research Office, and Dean Johnson of University Information Services (UIS) who were amongst the winners of the professional services recognition scheme two years ago for their effective collaborative work on the integration of Symplectic Elements and Apollo. 

According to the blog, as of September 2019, 25,550 articles, 1,329 conference proceedings and 1,100 datasets in Apollo have ORCID IDs. 

Saying a big thank you 

2019 saw the departure of the University’s first Head of Scholarly Communication, Dr Danny Kingsley. Many of the achievements of 2019 were due to hard work Danny put in before her departure and for this we’d like to thank her for all she contributed. 

Published 26 February 

Compiled by: Maria Angelaki 

Image showing that this blog post is under CC-BY licence.

Contributions from Agustina Martínez-García, Bea Gini, Maria Angelaki, Lauren Cadwallader, Sacha Jones and Arthur Smith.