All posts by admin

Formatting the Future: Why Researchers Should Consider File Formats

Dr Kim Clugston, Research Data Coordinator, OSC
Dr Leontien Talboom, Technical Analyst, Digital Initiatives

Many funders and publishers now require data to be made openly available for reuse, supporting the open data movement and value for publicly funded research. But are all researchers aware of why they are being asked to share their data and how to do this appropriately? When researchers deposit their research data into Apollo (the University of Cambridge open access repository) they generally understand the benefits of sharing data and want to be a part of this. These researchers provide their data in open file formats accompanied by rich metadata so the data has the best chance of being discovered and reused most effectively. 

There are other researchers who deposit their data in a repository during the publication process; this often takes place within tight deadlines set by the publisher. For this reason, researchers often rush to upload their data, and thoughts about how this data will remain preserved and accessible for long-term use are not considered. The challenges around preserving open research data were highlighted in this article. The authors addressed the concerns that open research data can include a wide variety of different types of data files, some of which may only be accessible with proprietary software or software that is outdated or at risk of being outdated soon. How can we ensure that research data that is open now stays accessible and open for use for many years to come? 

In this blog, we will discuss the importance of making data open, ensuring this is maintained for future use (digital preservation). We will use some examples from datasets in Apollo and suggest recommendations for researchers that go beyond the normal FAIR principles to include considerations for the long term. 

Why is it important for the future?

The move to open data, following the FAIR principles, has the potential to boost knowledge, research, collaboration, transparency and decision making. In Apollo alone, there are now thousands of datasets which are available openly worldwide to be used for reference or reused as secondary data. Apollo, however, is just one of thousands of data repositories. It is easy to see how this vast amount of archived data comes with great responsibility for long term maintenance. A report outlined the pressing matter that FAIR data, whilst addressing metadata aspects well, doesn’t really address data preservation and the challenges that this brings such as the risk of software and/or hardware becoming obsolete, and therefore data reliant on these becoming inaccessible.

Tracking the reuse of datasets could provide essential information on how different file formats are holding up, but there is an ongoing challenge to track dataset reuse. Datasets are not yet routinely cited in the established way that is seen for journal articles or other publication types. This is an area that is actively being developed through initiatives such as Make Data Count and it is hoped that at some point soon, data citation will become part of the routine practice of research to further enhance visibility on how data is being credited and reused. 

In Apollo, we see great interest in the available datasets as they are viewed and downloaded frequently. The most downloaded dataset in Apollo has been downloaded over 300,000 times since it was first deposited in 2015 and, interestingly, consists of open file formats. Other highly downloaded datasets in Apollo, such as the CBR Leximetric dataset, have been used by lawyers and social scientists and successfully cited as a data source to answer new research questions. The Mammographic Image Analysis Society database was deposited in Apollo in 2015 and has been frequently downloaded and reused by researchers working in the field of medical image analysis as discussed in a previous blog. To date, Google Scholar reports it has been cited 78 times. These datasets show the value of sharing and reusing data and all are in file formats that are accessible to everyone which will help to preserve them for as long as possible. 

Digital preservation is a discipline focused on providing and maintaining long-term access to digital materials. Obsolete software is a big problem in maintaining access to files in the future. PRONOM, a file format registry, keeps track of a large amount of known file formats and provides additional information on these formats. Last year, a file format analysis of datasets in Apollo was conducted to highlight what file formats are represented in the repository. The results revealed the diverse array of different file formats which is a testament to the breadth of research conducted and the adoption of open data across many disciplines. Most of the file formats are common and can still be opened, but a large percentage of the material has not been identified or are in formats that are not immediately accessible without migrating to a different format or emulating the current file formats. Table 1 shows a few complex examples of file formats held in Apollo. 

File FormatExample in ApolloFuture Use
.dx (Spectroscopic Data Exchange Format)LinkThis is not an open-source format, meaning that opening the file is dependent on the software being available
.mnova (Mestrelab file format)LinkProprietary file format, licence for the programme is expensive
.pzfx (Prism file format)LinkOlder format for a file software program called Prism. This is now considered legacy software.

The Bit List, a list maintained by the Digital Preservation Coalition that includes contributions from members of the digital preservation community, outlines the “health” of different file formats and content types,  including research data. In fact, unpublished research data (which is another issue outside the scope of this blog!) is classified as critically endangered and uncovers the problem that the majority of researchers generally only make data open at the point of publication. But even research data published in repositories has its difficulties and is classified as vulnerable, mainly due to the dependency on many file formats having the availability of the appropriate software to open and use them. There are potential solutions on the horizon to address this problem, such as the open-source ReproZip which packages research data with the necessary files, libraries and environments so they can be run by anybody. However, this still doesn’t address the issue of obsolete software. The gold standard would be to deposit research data in open formats, so viewing and using the files is not dependent on a particular software; the files will be open and accessible as long as they are held available within a repository.  

What researchers can do

What can researchers do to make sure that when they deposit data into a repository, it will be available for them and others in 10 or even 20 years time? Awareness is the first step. Researchers should consider submitting their data to a repository, one that is suitable for their files. Choose a trusted data repository. A recent blog highlighted the potential problem of disappearing data repositories, with approximately 6% of repositories listed on the repository search registry, re3data being shut down (most reasons are unknown but some were listed as organisation or economic failure, obsolete software/hardware or external attacks). Approximately 47% of the repositories that had shut down did not provide an alternative solution to rescue the data and it is assumed that this data is lost. It may be that your funder or publisher decides the repository for you, but we have some guidance on what to look for in a trusted repository. If you are at Cambridge, you can deposit your data in Apollo which has CoreTrustSeal certification.

The data itself is arguably the most important factor, we need to make sure the data files can be found and used by anyone at any time, forever. Ideally, this means using open file formats where possible as these don’t have any restrictions. The Library of Congress and the UK National Archives both maintain registries of file formats. There is some Cambridge University guidance on choosing file formats as well as some by the UKDS. Have a look at the file formats you have on the PRONOM database, is this seen as a sustainable format? If the data you are generating is from proprietary software, it is good practice to deposit this version as well as an open format that does not require any specialist software to open them. This ensures that both options are available in case of any loss of formatting from converting to open formats. An example are the statistical software packages SPSS and NVivo which are proprietary but have the option to convert to open formats such as a CSV file. 

There may be information on how to convert your file types to open formats within your discipline. In the Chemistry department here at Cambridge, an initiative was started together with the Data Champion programme to provide a platform to allow researchers to add instructions for converting experimental derived files into open formats. Open Babel is an open-source, collaborative project aimed at providing a “chemistry toolbox” with information on how to convert chemical file formats into other formats where needed. There is also some guidance on how to export from R to open formats such as txt and csv.

In some cases, it might not be possible to provide an open file format alternative. The files you use may be subject to discipline-specific standards or you are restricted by the hardware and software you use in your research. For these, it is important to provide good documentation or a detailed README file alongside the file format so researchers know how to access and use your files. In fact good file organisation, documentation and metadata is just as important as the files themselves, as data without any documentation is considered virtually meaningless. The more information you can provide the better and might possibly save you time in the long run from potential questions from other researchers in the future. 

The future use of past research hinges on the thoughtful selection of file formats. By prioritising openness and longevity, we lay the foundation for collaboration and innovation. Choices that researchers make today shape the accessibility and integrity of data for generations to come.

Methods getting their chance to shine – Apollo wants your methods!

By Dr. Kim Clugston, Research Data Co-ordinator, Office of Scholarly Communication

Underlying all research data is always an effective and working method and this applies across all disciplines from STEMM to the Arts, Humanities and Social Sciences. Methods are a detailed description of the tools that are used in research and can come in many forms depending on the type of research. Methods are often overlooked rather than being seen as an integral research output in their own right. Traditionally, published journals include a materials and methods section, which is often a summary due to restrictions on word limits making it difficult for other researchers to reproduce the results or replicate the study. There can sometimes be an option to submit the method as “supplementary material”, but this is not always the case. There are specific journals that publish methods and may be peer-reviewed but not all are open access, rendering them hidden behind a paywall. The last decade has seen the creation of “protocol” repositories, some with the ability to comment, adapt and even insert videos. Researchers at the University of Cambridge, from all disciplines – arts, humanities, social sciences and STEMM fields – can now publish their method openly in Apollo, our institutional repository. In this blog, we discuss why it is important to publish methods openly and how the University’s researchers and students can do this in Apollo.

The protocol sharing repository, Protocols.io, was founded in 2012. Protocols can be uploaded to the platform or created within it; they can be shared privately with others or made public. The protocols can be dynamic and interactive (rather than a static document) and can be annotated, which is ideal for highlighting information that could be key to an experiment’s success. Collaboration, adaptation and reuse are possible by creating a fork (an editable clone of a version) that can be compared with any existing versions of the same protocol. Protocols.io currently hosts nearly 16,000 public protocols, showing that there is a support for this type of platform. In July this year it was announced that Protocols.io was acquired by Springer Nature. Their press statement aims to reassure that Protocols.io mission and vision will not change with the acquisition, despite Springer Nature already hosting the world’s largest collection of published protocols in the form of SpringerProtocols along with their own version of a free and open repository, Protocol Exchange. This begs the question of whether a major commercial publisher is monopolising the protocol space, and if they are, is this or will this be a problem? At the moment there do not appear to be any restrictions on exporting/transferring protocols from Protocols.io and hopefully this will continue. This is a problem often faced by researchers using proprietary Electronic Research Notebooks (ERNs), where it can be difficult to disengage from one platform and laborious to transfer notebooks to another, all while ensuring that data integrity is maintained. Because of this, researchers may feel locked into using a particular product. Time will tell how the partnership between Protocols.io and Springer Nature develops and whether the original mission and vision of Protocols.io will remain. Currently, their Open Research plan enables researchers to make an unlimited number of protocols public, with the number of private protocols limited to two (paid plans offer more options and features).

Bio-protocol exchange (under the umbrella of Bio-protocol Journal) is a platform for researchers to find, share and discuss life science protocols with protocol search and webinars. Protocols can be submitted either to Bio-protocol or as a preprint, researchers can ask authors questions, and fork to modify and share the protocol while crediting the original author. They also have an interesting ‘Request a Protocol’ (RaP) service that searches more than 6 million published research papers for protocols or allows you to request one if you are unable to find what you are looking for. A useful feature is that you can ask the community or the original authors of the protocol any question you may have about the protocol. Bio-protocol exchange published all protocols free of charge to their authors since their launch in 2011, with substantial financial backing of their founders. Unfortunately,  it was announced that protocol articles submitted to Bio-protocol after March 1 2023 will be charged an Article Processing Charge (APC) of $1200. Researchers who do not want to pay the APC can still post a protocol for free in the Bio-protocol Preprint Repository where they will receive a DOI but will not have gone through the journal’s peer review process.

As methods are integral to successful research, it is a positive move to see the creation and growth of platforms supporting protocol development and sharing. Currently, these tend to cater for research in the sciences, and serve the important role of supporting research reproducibility. Yet, methods exist across all disciplines – arts, humanities, social sciences as well as STEMM – and we see the term ‘method’ rather than ‘protocol’ as more inclusive of all areas of research.

Apollo (Cambridge University’s repository) has now joined the growing appreciation within the research community of recognising the importance of detailing and sharing methodologies. Researchers at the University can now use their Symplectic Elements account to deposit a method into Apollo. Not only does this value the method as an output in its own right, it provides the researcher with a DOI and a publication that can be automatically updated to their ORCID profile (if ORCID is linked to their Elements account). In May this year, Apollo was awarded CoreTrustSeal certification, reinforcing the University’s commitment to preserving research outputs in the long-term and should give researchers confidence that they are depositing their work in a trustworthy digital repository.

The first method to be deposited into Apollo in this way was authored by Professor John Suckling and colleagues. Professor Suckling is Director of Research in Psychiatric Neuroimaging in the Department of Psychiatry. His published method relates to an interesting project combining art and science to create artwork that aims to represent hallucinatory experiences in individuals with diagnosed psychotic or neurodegenerative disorders. He is no stranger to depositing in Apollo; in fact, he has one of the most downloaded datasets in Apollo after depositing the Mammographic Image Analysis Society database in Apollo in 2015. This record contains the images of 322 digital mammograms from a database complied in 1992. Professor Suckling is an advocate of open research and was a speaker at the Open Research at Cambridge conference in 2021.

An interesting and exciting new platform which aims to change research culture and the way researchers are recognised is Octopus. Founded by University of Cambridge researcher Dr Alexandra Freeman, Octopus is free to use for all and is funded by UKRI and developed by Jisc. Researchers can publish instantly all research outputs without word limit constraints, which can often stifle the details. Research outputs are not restricted to articles but also include, for example, code, methods, data, videos and even ideas or short pieces of work. This serves to incentivise the importance of all research outputs. Octopus aims to level up the current skew toward publishing more sensationalist work and encourages publishing all work, such as negative findings, which are often of equal value to science but often get shelved in what is termed the ‘file drawer’ problem. A collaborative research community is encouraged to work together on pieces of a puzzle, with credit given to individual researchers rather than a long list of authors. The platform supports reproducibility, transparency, accountability and aims to allow research the best chance to advance more quickly. Through Octopus, authors retain copyright and apply a Creative Commons licence to their work; the only requirement is that published work is open access and allows derivatives. It is a breath of fresh air in the current rigid publishing structure.

Clear and transparent methods underpin research and are fundamental to the reliability, integrity and advancement of research. Is the research landscape beginning to change to allow open methods, freely published, to take centre stage and for methods to be duly recognised and rewarded as a standalone research output? We certainly hope so. The University of Cambridge is committed to supporting open research, and past and present members who have conducted research at the University can share these outputs openly in Apollo. If you would like to publish a method in Apollo, please submit it here or if you have any queries email us at info@data.cam.ac.uk.

There will be an Octopus workshop at the Open Research for Inclusion: Spotlighting Different Voices in Open Research at Cambridge on Friday 17th November 2023 at Downing College.

Should the UK make a deal with Springer Nature?

This is a guest post by Prof. Stephen J. Eglen on the concurrent negotiations between the UK academic sector and the publisher Springer Nature. Prof. Eglen is a Fellow of Magdalene College and Professor of Computational Neuroscience in the Department of Applied Mathematics and Theoretical Physics at the University of Cambridge. This post does not necessarily reflect the view of Cambridge University Libraries.

The UK academic sector is currently in discussion with Springer Nature around a renewed ‘read and publish’ deal for journal content. I understand that most institutions are likely to reject the current deal, but wish to continue negotiations. My position is that further discussions with Springer Nature are futile; we should stop accepting ‘transformative deals’. The likely effect of this deal would be that more of Springer Nature’s content may be openly available to read, but with the ‘paywall’ shifted to the publish side. Here I list my key objections:

  1. There is still no justification for the high APCs (9500 EUR + taxes) for Nature tier journals. Accepting a deal, regardless of the level of discounts that could be achieved, is implicitly accepting their business model. Springer Nature declined to engage with the Journal Comparison Service run by cOAlition S that aims to help understand how costs are determined.
  2. Springer Nature’s view is that ‘gold OA’ is the only viable way to open access. Other models for open access are available, and show promise, including diamond OA journals and Subscribe to Open. However, Springer Nature assert that “they haven’t found a way of making them financially sustainable”.  If we accept a gold-only view of open access,  how can we objectively assess the sustainability of alternative models?
  3. A move to a ‘gold only’ OA world would shift the barrier from reading to publishing content. Springer Nature recently announced a waiver policy for researchers from about 70 lower income countries. This still excludes many researchers worldwide e.g. from Brazil and South Africa, perpetuating neo-colonial attitudes towards the creation of scholarly content and reinforcing existing institutional inequalities within countries. Any waiver programme for APCs should be “no-questions-asked” regardless of where researchers are based. This would need to be properly costed and part of the justification of the APC (point 1).
  4. As of January 2023, several UK institutions have rights retention policies in place, with more expected to follow in the coming months. Individual researchers can also use rights retention strategy by themselves. Rights retention statements allow researchers to meet UK funder’s requirement by depositing their author-accepted manuscript without embargo. I believe Springer Nature should publicly state that they will allow any author worldwide to maintain their rights on their own author-accepted manuscripts.
  5. Over half of Springer Nature’s hybrid journals failed to meet their 2021 targets for open access articles within hybrid journals.  Those hybrid journals that fail again this year to meet their targets will be removed from cOAlition S’s transformative journal program.  Having some journals ineligible for cOAlition S funding but part of a UK read-and-publish deal would further complicate an already confusing system.  It would also question Springer Nature’s commitment to open access.

A detailed public critique of the deal is not possible because of the confidential nature of the negotiations.  Finances aside, I feel there was one element that was simply unworkable and unethical due to it requiring scholars to keep one aspect confidential if the deal were accepted.

The UK is one of only a few countries with a  heavy reliance on transformative agreements.  Sweden has already decided that transformative agreements are not sustainable and the transition period should finish at the end of 2024. Coalition S has also confirmed it will end its support of hybrid journals by the end of 2024. I would like to see the UK move away from transformative agreements. We could instead work internationally to promote more ethical and sustainable alternatives that put scholars at the heart of scholarly communication. In particular, the APC model has been tried, and introduces as many headaches as it has tried to solve. 

It is time instead to try new approaches.  There are several interesting models being developed by forward-looking organizations that the UK could endorse.  For example, MIT press recently launched shift+OPEN as a way to flip subscription based journals to diamond open access model.  Another interesting approach is Subscribe to Open where journals drop their paywall if a threshold amount of subscriptions are received.  Money saved on dealing with legacy publishers like Springer Nature is better spent investing in our own infrastructure and new approaches.