Formatting the Future: Why Researchers Should Consider File Formats

Dr Kim Clugston, Research Data Coordinator, OSC
Dr Leontien Talboom, Technical Analyst, Digital Initiatives

Many funders and publishers now require data to be made openly available for reuse, supporting the open data movement and value for publicly funded research. But are all researchers aware of why they are being asked to share their data and how to do this appropriately? When researchers deposit their research data into Apollo (the University of Cambridge open access repository) they generally understand the benefits of sharing data and want to be a part of this. These researchers provide their data in open file formats accompanied by rich metadata so the data has the best chance of being discovered and reused most effectively. 

There are other researchers who deposit their data in a repository during the publication process; this often takes place within tight deadlines set by the publisher. For this reason, researchers often rush to upload their data, and thoughts about how this data will remain preserved and accessible for long-term use are not considered. The challenges around preserving open research data were highlighted in this article. The authors addressed the concerns that open research data can include a wide variety of different types of data files, some of which may only be accessible with proprietary software or software that is outdated or at risk of being outdated soon. How can we ensure that research data that is open now stays accessible and open for use for many years to come? 

In this blog, we will discuss the importance of making data open, ensuring this is maintained for future use (digital preservation). We will use some examples from datasets in Apollo and suggest recommendations for researchers that go beyond the normal FAIR principles to include considerations for the long term. 

Why is it important for the future?

The move to open data, following the FAIR principles, has the potential to boost knowledge, research, collaboration, transparency and decision making. In Apollo alone, there are now thousands of datasets which are available openly worldwide to be used for reference or reused as secondary data. Apollo, however, is just one of thousands of data repositories. It is easy to see how this vast amount of archived data comes with great responsibility for long term maintenance. A report outlined the pressing matter that FAIR data, whilst addressing metadata aspects well, doesn’t really address data preservation and the challenges that this brings such as the risk of software and/or hardware becoming obsolete, and therefore data reliant on these becoming inaccessible.

Tracking the reuse of datasets could provide essential information on how different file formats are holding up, but there is an ongoing challenge to track dataset reuse. Datasets are not yet routinely cited in the established way that is seen for journal articles or other publication types. This is an area that is actively being developed through initiatives such as Make Data Count and it is hoped that at some point soon, data citation will become part of the routine practice of research to further enhance visibility on how data is being credited and reused. 

In Apollo, we see great interest in the available datasets as they are viewed and downloaded frequently. The most downloaded dataset in Apollo has been downloaded over 300,000 times since it was first deposited in 2015 and, interestingly, consists of open file formats. Other highly downloaded datasets in Apollo, such as the CBR Leximetric dataset, have been used by lawyers and social scientists and successfully cited as a data source to answer new research questions. The Mammographic Image Analysis Society database was deposited in Apollo in 2015 and has been frequently downloaded and reused by researchers working in the field of medical image analysis as discussed in a previous blog. To date, Google Scholar reports it has been cited 78 times. These datasets show the value of sharing and reusing data and all are in file formats that are accessible to everyone which will help to preserve them for as long as possible. 

Digital preservation is a discipline focused on providing and maintaining long-term access to digital materials. Obsolete software is a big problem in maintaining access to files in the future. PRONOM, a file format registry, keeps track of a large amount of known file formats and provides additional information on these formats. Last year, a file format analysis of datasets in Apollo was conducted to highlight what file formats are represented in the repository. The results revealed the diverse array of different file formats which is a testament to the breadth of research conducted and the adoption of open data across many disciplines. Most of the file formats are common and can still be opened, but a large percentage of the material has not been identified or are in formats that are not immediately accessible without migrating to a different format or emulating the current file formats. Table 1 shows a few complex examples of file formats held in Apollo. 

File FormatExample in ApolloFuture Use
.dx (Spectroscopic Data Exchange Format)LinkThis is not an open-source format, meaning that opening the file is dependent on the software being available
.mnova (Mestrelab file format)LinkProprietary file format, licence for the programme is expensive
.pzfx (Prism file format)LinkOlder format for a file software program called Prism. This is now considered legacy software.

The Bit List, a list maintained by the Digital Preservation Coalition that includes contributions from members of the digital preservation community, outlines the “health” of different file formats and content types,  including research data. In fact, unpublished research data (which is another issue outside the scope of this blog!) is classified as critically endangered and uncovers the problem that the majority of researchers generally only make data open at the point of publication. But even research data published in repositories has its difficulties and is classified as vulnerable, mainly due to the dependency on many file formats having the availability of the appropriate software to open and use them. There are potential solutions on the horizon to address this problem, such as the open-source ReproZip which packages research data with the necessary files, libraries and environments so they can be run by anybody. However, this still doesn’t address the issue of obsolete software. The gold standard would be to deposit research data in open formats, so viewing and using the files is not dependent on a particular software; the files will be open and accessible as long as they are held available within a repository.  

What researchers can do

What can researchers do to make sure that when they deposit data into a repository, it will be available for them and others in 10 or even 20 years time? Awareness is the first step. Researchers should consider submitting their data to a repository, one that is suitable for their files. Choose a trusted data repository. A recent blog highlighted the potential problem of disappearing data repositories, with approximately 6% of repositories listed on the repository search registry, re3data being shut down (most reasons are unknown but some were listed as organisation or economic failure, obsolete software/hardware or external attacks). Approximately 47% of the repositories that had shut down did not provide an alternative solution to rescue the data and it is assumed that this data is lost. It may be that your funder or publisher decides the repository for you, but we have some guidance on what to look for in a trusted repository. If you are at Cambridge, you can deposit your data in Apollo which has CoreTrustSeal certification.

The data itself is arguably the most important factor, we need to make sure the data files can be found and used by anyone at any time, forever. Ideally, this means using open file formats where possible as these don’t have any restrictions. The Library of Congress and the UK National Archives both maintain registries of file formats. There is some Cambridge University guidance on choosing file formats as well as some by the UKDS. Have a look at the file formats you have on the PRONOM database, is this seen as a sustainable format? If the data you are generating is from proprietary software, it is good practice to deposit this version as well as an open format that does not require any specialist software to open them. This ensures that both options are available in case of any loss of formatting from converting to open formats. An example are the statistical software packages SPSS and NVivo which are proprietary but have the option to convert to open formats such as a CSV file. 

There may be information on how to convert your file types to open formats within your discipline. In the Chemistry department here at Cambridge, an initiative was started together with the Data Champion programme to provide a platform to allow researchers to add instructions for converting experimental derived files into open formats. Open Babel is an open-source, collaborative project aimed at providing a “chemistry toolbox” with information on how to convert chemical file formats into other formats where needed. There is also some guidance on how to export from R to open formats such as txt and csv.

In some cases, it might not be possible to provide an open file format alternative. The files you use may be subject to discipline-specific standards or you are restricted by the hardware and software you use in your research. For these, it is important to provide good documentation or a detailed README file alongside the file format so researchers know how to access and use your files. In fact good file organisation, documentation and metadata is just as important as the files themselves, as data without any documentation is considered virtually meaningless. The more information you can provide the better and might possibly save you time in the long run from potential questions from other researchers in the future. 

The future use of past research hinges on the thoughtful selection of file formats. By prioritising openness and longevity, we lay the foundation for collaboration and innovation. Choices that researchers make today shape the accessibility and integrity of data for generations to come.

The (exponential) thirst for data – The March 2024 Data Champions forum

The Data Champions were treated to a big data themed session for the March Data Champion forum, hosted at (and sponsored by) the Cambridge University Press and Assessment in their amazing Triangle building. First up was Dr James Fergusson, course director for the MPhil in Data Intensive Science, who described how the exponential growth in data accumulation, computing and artificial intelligence (A.I.) capabilities has led to a paradigm shift in the world of cosmological theorisation and research, potentially changing with it scientific research as a whole.  

Dr James Fergusson presenting to the Data Champions at the March forum

As he explained, over the last two decades cosmologists have seen a rapid increase of data points on which to base their theorisation – from merely 14 data points in 2000 to 800 million data points in 2013! Through the availability of these data points, the paradigm for research in cosmology started to shift completely – from being theory based to being based on data.  With several projects beginning soon that will see vast amounts of data generated daily for decades to come, this trend is showing no signs of slowing down. The only way to cope with this exponential increase in data generation is with computing power, which has also been growing exponentially. In tandem with these sectors of growth is the growth of machine learning (ML) capabilities as the copious amount of data not only necessitates immense amounts of computing power but also ML capabilities to process and analyse all of the data. Together, these elements are fundamentally changing the story of scientific discovery. What was once a story of an individual researcher having an intellectual breakthrough is becoming the story of machine led, automated discovery. While it used to be the case that an idea, put through the rigour of the scientific method, would lead to the generation of data, now the reverse is not only possible but become increasingly likely. Data is now generated first before a theory is discovered, and the discovery may come from AI and not a scientist. This, for James, can be considered the new scientific method. 

Dr Anne Alexander has been familiarising herself with AI, especially in her capacity as Director of Learning at Cambridge Digital Humanities (CDH) where she has been incorporating critique of AI into a methodology of research in the digital humanities, particularly in the area of Critical Pedagogy. In her work, Anne addresses how structural inequalities can be reinforced, rather than challenged by AI systems. She demonstrated this through two projects that she was involved with at CDH. One was called Ghost Fictions, a series of workshops with the aim of encouraging critical thinking about automated text generation using AI methods both in scholarly work and in social life. The project resulted in a (free to download) book titled Ghost, Robots, Automatic Writing: an AI level study guide, which was intended as a provocation of a future where books, study guides and examinations are created by Large Language Models (LLM) (perhaps a not so distant future). Another project involved using AI to create characters for a new novel, which revealed the racial biases of ChatGPT when prompted with certain names. Yet, perhaps the most worrying aspect about the transformative forms of AI is the immediate and consequential impact it has on the environment. The computational power needed to quench the thirst for the exponential amounts of data needed to train and progress AI chat bots, LLMs and image generation systems, requires vast computing power which in turn generates a lot of heat and requires large amounts of water to operate. As Anne demonstrated, this could be increasingly problematic for many places as the global climate crisis continues. Locally, we have the case of West Cambridge, which is already water stressed, but also home to the University’s data centre and where the new DAWN AI supercomputer is located. Through these examples, she posed the questions: does AI perpetuate further harm and inequality? Are the environmental costs of AI too high?    

Dr James Fergusson and Dr Anne Alexander answering questions from the Data Champions at the March forum

The themes that Anne concluded her presentation with formed the basis of the Q&A between the Data Champions and the speakers. The topic of the potential biases of AI and ML was put forward to James who agreed that his field of study could not escape it. That said, unlike the humanities, biases in physics can potentially be helpful as it may help make the scientific process as objective as possible. However, this could clearly be problematic for humanities research, which tends to deal with social systems and relations, and views of the world. The topic of the environmental cost of AI was also touched on, with which James commented that energy insufficiency is a problem and getting harder to justify, and solutions might only create new problems as the demand for this technology is not slowing down. Anne expressed her concerned and suggests that society at large should be consulted on this as the environment is a social problem thus society should have a say on what risk they are willing to be a part of. The question of the automation of science was also raised to James who admitted that preparing early career physicists for research now involves developing their software skills rather than subject knowledge expertise in physics or mathematics. 

What we can learn from the ‘promise and pitfalls of preregistration’ meeting

Dr Mandy Wigdorowitz, Open Research Community Manager, Cambridge University Libraries

The promise and pitfalls of preregistration meeting was held at the Royal Society in March 2024. It was organised to address the utility of preregistration and initiate an interdisciplinary dialogue about its epistemic and pragmatic aims. The goal of the meeting was to explore the limitations associated with preregistration, and to conceive of a practical way to guide future research that can make the most of its implementation.

Preregistration is the practice of publicly declaring a study’s hypotheses, methods, and analyses before conducting a research study. Researchers are encouraged to be as specific as possible when writing preregistration plans, detailing every aspect of the research methodology and analyses, including, for instance, the study design, sample size, procedure for dealing with outliers, blinding and manipulation of conditions, and how multiple analyses will be controlled for. By doing so, researchers commit to a time-stamped study plan which will reduce the potential for flexibility in analysis and interpretation that may lead to biased results. Preregistration is a community-led response to the replication crisis and aims to mitigate questionable research practices (QRPs) that have come to light in recent years, some of which include HARKing (Hypothesising After Results are Known), p-hacking (the inappropriate manipulation of data analysis to enable a favoured result to be presented as statistically significant), and publication bias (the unbalanced publication of statistically significant findings or positive results over null and/or unexpected findings) (Simmons et al., 2011; Stefan & Schönbrodt, 2023).

The meeting brought together scholars and publishers from a range of disciplines and institutions to discuss whether preregistration has indeed lived up to these aims and whether and to what extent it has solved the problems it was envisioned to address.

It became clear that the problems associated with QRPs have not simply disappeared with the uptake and implementation of preregistration. From the perspective of meta-research, the success of preregistration appears to be largely disciplinary and legally dependent, with some disciplines mandating and normalising it (e.g., clinical trial registration in biomedical research), others greatly encouraging and (sometimes) requiring it (e.g., psychological science research), and others having no expectations about its use (e.g., economics research). The effectiveness of preregistration was shown to be linked to these dependencies, but also related to the quality and detail of the preregistration plan itself. Researchers are the arbiters of their research choices and if they choose to write vague or ambiguous preregistration plans, the problems that preregistration are assumed to address will inevitably persist.

Various preregistration templates exist (such as on the Open Science Framework, OSF) and some incentives for preregistration are recognised, such as the preregistration badges awarded by some journals, making it a systematic and straightforward exercise. In practice, however, it is not always the case that sufficient information is provided, and even in cases where preregistered plans are detailed, they are not always followed for various pragmatic or other (not always nefarious) reasons. As such, the research community are cautioned to not assume that preregistration equates to better or more trustworthy research. Rather, the preregistration plan needs to be critically reviewed as a standalone document in conjunction with the published study. This is important because preregistration plans that are usually deposited into repositories (e.g., OSF, National Library of Medicine’s Clinical Trials Registry) are seldom evaluated as entities of their own or against their corresponding research articles. Note that this is unlike registered reports which are a type of journal article that details a study’s protocol that does get peer reviewed before data is collected and if reviewed favourably, is given an in-principal acceptance regardless of the study outcomes.

Other discussions centred around the utility of preregistration in exploratory versus confirmatory research, whether preregistration can improve our theories, and how the process of conducting multiple but slightly varied analyses and selecting the most desired outcome (also referred to the ‘garden of forking paths’) affects the claims we make.

The overall sentiment from the meeting was that while preregistration does not solve all the issues that have arisen from QRPs, it ultimately leads to more transparency of the research process, accountability on the part of the researchers conducting the research, and it facilitates deeper engagement with one’s own research prior to any collection or analysis of data.

Since attending the meeting, I have taken away valuable insights that have made me critically reflect on my own research choices, and from a practice perspective, I have downloaded the OSF preregistration template and am documenting the plans for a research project.

Given the strides that have been taken toward improving the transparency, credibility and reproducibility of research, researchers at Cambridge need to consider whether preregistration plans should be included as another type of output that can be deposited on the institutional repository, Apollo. We have recently added Methods and preprints as output types which have broadened the options for sharing and which align with open research practices. Including preregistration could be a valuable and timely addition.  

References

Stefan, A. M., & Schönbrodt, F. D. (2023). Big little lies: a compendium and simulation of p-hacking strategies. Royal Society Open Science, 10(2), 220346. https://doi.org/10.1098/rsos.220346

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359-1366. https://doi.org/10.1177/0956797611417632