Category Archives: Open Research at Cambridge Conference

Formatting the Future: Why Researchers Should Consider File Formats

Dr Kim Clugston, Research Data Coordinator, OSC
Dr Leontien Talboom, Technical Analyst, Digital Initiatives

Many funders and publishers now require data to be made openly available for reuse, supporting the open data movement and value for publicly funded research. But are all researchers aware of why they are being asked to share their data and how to do this appropriately? When researchers deposit their research data into Apollo (the University of Cambridge open access repository) they generally understand the benefits of sharing data and want to be a part of this. These researchers provide their data in open file formats accompanied by rich metadata so the data has the best chance of being discovered and reused most effectively. 

There are other researchers who deposit their data in a repository during the publication process; this often takes place within tight deadlines set by the publisher. For this reason, researchers often rush to upload their data, and thoughts about how this data will remain preserved and accessible for long-term use are not considered. The challenges around preserving open research data were highlighted in this article. The authors addressed the concerns that open research data can include a wide variety of different types of data files, some of which may only be accessible with proprietary software or software that is outdated or at risk of being outdated soon. How can we ensure that research data that is open now stays accessible and open for use for many years to come? 

In this blog, we will discuss the importance of making data open, ensuring this is maintained for future use (digital preservation). We will use some examples from datasets in Apollo and suggest recommendations for researchers that go beyond the normal FAIR principles to include considerations for the long term. 

Why is it important for the future?

The move to open data, following the FAIR principles, has the potential to boost knowledge, research, collaboration, transparency and decision making. In Apollo alone, there are now thousands of datasets which are available openly worldwide to be used for reference or reused as secondary data. Apollo, however, is just one of thousands of data repositories. It is easy to see how this vast amount of archived data comes with great responsibility for long term maintenance. A report outlined the pressing matter that FAIR data, whilst addressing metadata aspects well, doesn’t really address data preservation and the challenges that this brings such as the risk of software and/or hardware becoming obsolete, and therefore data reliant on these becoming inaccessible.

Tracking the reuse of datasets could provide essential information on how different file formats are holding up, but there is an ongoing challenge to track dataset reuse. Datasets are not yet routinely cited in the established way that is seen for journal articles or other publication types. This is an area that is actively being developed through initiatives such as Make Data Count and it is hoped that at some point soon, data citation will become part of the routine practice of research to further enhance visibility on how data is being credited and reused. 

In Apollo, we see great interest in the available datasets as they are viewed and downloaded frequently. The most downloaded dataset in Apollo has been downloaded over 300,000 times since it was first deposited in 2015 and, interestingly, consists of open file formats. Other highly downloaded datasets in Apollo, such as the CBR Leximetric dataset, have been used by lawyers and social scientists and successfully cited as a data source to answer new research questions. The Mammographic Image Analysis Society database was deposited in Apollo in 2015 and has been frequently downloaded and reused by researchers working in the field of medical image analysis as discussed in a previous blog. To date, Google Scholar reports it has been cited 78 times. These datasets show the value of sharing and reusing data and all are in file formats that are accessible to everyone which will help to preserve them for as long as possible. 

Digital preservation is a discipline focused on providing and maintaining long-term access to digital materials. Obsolete software is a big problem in maintaining access to files in the future. PRONOM, a file format registry, keeps track of a large amount of known file formats and provides additional information on these formats. Last year, a file format analysis of datasets in Apollo was conducted to highlight what file formats are represented in the repository. The results revealed the diverse array of different file formats which is a testament to the breadth of research conducted and the adoption of open data across many disciplines. Most of the file formats are common and can still be opened, but a large percentage of the material has not been identified or are in formats that are not immediately accessible without migrating to a different format or emulating the current file formats. Table 1 shows a few complex examples of file formats held in Apollo. 

File FormatExample in ApolloFuture Use
.dx (Spectroscopic Data Exchange Format)LinkThis is not an open-source format, meaning that opening the file is dependent on the software being available
.mnova (Mestrelab file format)LinkProprietary file format, licence for the programme is expensive
.pzfx (Prism file format)LinkOlder format for a file software program called Prism. This is now considered legacy software.

The Bit List, a list maintained by the Digital Preservation Coalition that includes contributions from members of the digital preservation community, outlines the “health” of different file formats and content types,  including research data. In fact, unpublished research data (which is another issue outside the scope of this blog!) is classified as critically endangered and uncovers the problem that the majority of researchers generally only make data open at the point of publication. But even research data published in repositories has its difficulties and is classified as vulnerable, mainly due to the dependency on many file formats having the availability of the appropriate software to open and use them. There are potential solutions on the horizon to address this problem, such as the open-source ReproZip which packages research data with the necessary files, libraries and environments so they can be run by anybody. However, this still doesn’t address the issue of obsolete software. The gold standard would be to deposit research data in open formats, so viewing and using the files is not dependent on a particular software; the files will be open and accessible as long as they are held available within a repository.  

What researchers can do

What can researchers do to make sure that when they deposit data into a repository, it will be available for them and others in 10 or even 20 years time? Awareness is the first step. Researchers should consider submitting their data to a repository, one that is suitable for their files. Choose a trusted data repository. A recent blog highlighted the potential problem of disappearing data repositories, with approximately 6% of repositories listed on the repository search registry, re3data being shut down (most reasons are unknown but some were listed as organisation or economic failure, obsolete software/hardware or external attacks). Approximately 47% of the repositories that had shut down did not provide an alternative solution to rescue the data and it is assumed that this data is lost. It may be that your funder or publisher decides the repository for you, but we have some guidance on what to look for in a trusted repository. If you are at Cambridge, you can deposit your data in Apollo which has CoreTrustSeal certification.

The data itself is arguably the most important factor, we need to make sure the data files can be found and used by anyone at any time, forever. Ideally, this means using open file formats where possible as these don’t have any restrictions. The Library of Congress and the UK National Archives both maintain registries of file formats. There is some Cambridge University guidance on choosing file formats as well as some by the UKDS. Have a look at the file formats you have on the PRONOM database, is this seen as a sustainable format? If the data you are generating is from proprietary software, it is good practice to deposit this version as well as an open format that does not require any specialist software to open them. This ensures that both options are available in case of any loss of formatting from converting to open formats. An example are the statistical software packages SPSS and NVivo which are proprietary but have the option to convert to open formats such as a CSV file. 

There may be information on how to convert your file types to open formats within your discipline. In the Chemistry department here at Cambridge, an initiative was started together with the Data Champion programme to provide a platform to allow researchers to add instructions for converting experimental derived files into open formats. Open Babel is an open-source, collaborative project aimed at providing a “chemistry toolbox” with information on how to convert chemical file formats into other formats where needed. There is also some guidance on how to export from R to open formats such as txt and csv.

In some cases, it might not be possible to provide an open file format alternative. The files you use may be subject to discipline-specific standards or you are restricted by the hardware and software you use in your research. For these, it is important to provide good documentation or a detailed README file alongside the file format so researchers know how to access and use your files. In fact good file organisation, documentation and metadata is just as important as the files themselves, as data without any documentation is considered virtually meaningless. The more information you can provide the better and might possibly save you time in the long run from potential questions from other researchers in the future. 

The future use of past research hinges on the thoughtful selection of file formats. By prioritising openness and longevity, we lay the foundation for collaboration and innovation. Choices that researchers make today shape the accessibility and integrity of data for generations to come.

The (exponential) thirst for data – The March 2024 Data Champions forum

The Data Champions were treated to a big data themed session for the March Data Champion forum, hosted at (and sponsored by) the Cambridge University Press and Assessment in their amazing Triangle building. First up was Dr James Fergusson, course director for the MPhil in Data Intensive Science, who described how the exponential growth in data accumulation, computing and artificial intelligence (A.I.) capabilities has led to a paradigm shift in the world of cosmological theorisation and research, potentially changing with it scientific research as a whole.  

Dr James Fergusson presenting to the Data Champions at the March forum

As he explained, over the last two decades cosmologists have seen a rapid increase of data points on which to base their theorisation – from merely 14 data points in 2000 to 800 million data points in 2013! Through the availability of these data points, the paradigm for research in cosmology started to shift completely – from being theory based to being based on data.  With several projects beginning soon that will see vast amounts of data generated daily for decades to come, this trend is showing no signs of slowing down. The only way to cope with this exponential increase in data generation is with computing power, which has also been growing exponentially. In tandem with these sectors of growth is the growth of machine learning (ML) capabilities as the copious amount of data not only necessitates immense amounts of computing power but also ML capabilities to process and analyse all of the data. Together, these elements are fundamentally changing the story of scientific discovery. What was once a story of an individual researcher having an intellectual breakthrough is becoming the story of machine led, automated discovery. While it used to be the case that an idea, put through the rigour of the scientific method, would lead to the generation of data, now the reverse is not only possible but become increasingly likely. Data is now generated first before a theory is discovered, and the discovery may come from AI and not a scientist. This, for James, can be considered the new scientific method. 

Dr Anne Alexander has been familiarising herself with AI, especially in her capacity as Director of Learning at Cambridge Digital Humanities (CDH) where she has been incorporating critique of AI into a methodology of research in the digital humanities, particularly in the area of Critical Pedagogy. In her work, Anne addresses how structural inequalities can be reinforced, rather than challenged by AI systems. She demonstrated this through two projects that she was involved with at CDH. One was called Ghost Fictions, a series of workshops with the aim of encouraging critical thinking about automated text generation using AI methods both in scholarly work and in social life. The project resulted in a (free to download) book titled Ghost, Robots, Automatic Writing: an AI level study guide, which was intended as a provocation of a future where books, study guides and examinations are created by Large Language Models (LLM) (perhaps a not so distant future). Another project involved using AI to create characters for a new novel, which revealed the racial biases of ChatGPT when prompted with certain names. Yet, perhaps the most worrying aspect about the transformative forms of AI is the immediate and consequential impact it has on the environment. The computational power needed to quench the thirst for the exponential amounts of data needed to train and progress AI chat bots, LLMs and image generation systems, requires vast computing power which in turn generates a lot of heat and requires large amounts of water to operate. As Anne demonstrated, this could be increasingly problematic for many places as the global climate crisis continues. Locally, we have the case of West Cambridge, which is already water stressed, but also home to the University’s data centre and where the new DAWN AI supercomputer is located. Through these examples, she posed the questions: does AI perpetuate further harm and inequality? Are the environmental costs of AI too high?    

Dr James Fergusson and Dr Anne Alexander answering questions from the Data Champions at the March forum

The themes that Anne concluded her presentation with formed the basis of the Q&A between the Data Champions and the speakers. The topic of the potential biases of AI and ML was put forward to James who agreed that his field of study could not escape it. That said, unlike the humanities, biases in physics can potentially be helpful as it may help make the scientific process as objective as possible. However, this could clearly be problematic for humanities research, which tends to deal with social systems and relations, and views of the world. The topic of the environmental cost of AI was also touched on, with which James commented that energy insufficiency is a problem and getting harder to justify, and solutions might only create new problems as the demand for this technology is not slowing down. Anne expressed her concerned and suggests that society at large should be consulted on this as the environment is a social problem thus society should have a say on what risk they are willing to be a part of. The question of the automation of science was also raised to James who admitted that preparing early career physicists for research now involves developing their software skills rather than subject knowledge expertise in physics or mathematics. 

Dear Data,…

Valentine’s day week for the international data community is not only a time for expressing your love to the significant others in your life. As it is also Love Data Week, it is also a time to reflect on your love for all things data! That was the goal for the Research Data team this year! The theme of this year’s Love Data Week was “My Kind of Data”, suggesting that data workers – researchers and analysts alike – have a relationship to data that is personal, often idiosyncratic, and almost always heartfelt. The Research Data team, as supporters of the University’s researchers, are interested in such relationships and are always eager to discover the distinctive needs that the disciplinary differences between the University’s departments create. This year, the Research Data team decided that they wanted to find out from students and researchers from the Arts, Humanities and Social Sciences (AHSS) what was their kind of data.

To do so, the Research Data team positioned themselves at the Foyer of the Alison Richard Building on the University’s Sidgwick Site, which is home to several AHSS departments, for two mornings on Monday the 12th and Thursday the 15th of February. Across the city, Data Champion Lizzie Sparrow was leading the charge with science, technology, engineering, mathematics (STEMM) students and researchers by holding her own pop-up at the West Hub. Like the Research Data team, and as a Research Support Librarian (Engineering) herself, Lizzie is also interested in the relationships that researchers have with data. Her approach, however, would likely be different. Unlike researchers in the STEMM subjects, the term data for AHSS students and researchers can sometimes feel exclusionary as they may not consider what they generate through research as data. From our perspective on the other hand, any material that goes on to form any part of their research is one’s data. To bring attention to this, the team tried to engage passers-by with the provocation “you have research data, change our minds!” The provocation was successful and many conversations were had on the different ways that members of the Sidgwick community understood data in their research.

The Research Data Team from the Office of Scholarly Communication (Cambridge University Library), from left to right: Clair Castle, Lutfi Othman, Kim Clugston.

The team was pleased to find that there was a general interest in the services of the Research Data team among the Sidgwick community, and we were happy to be able to share with others how we can help them with their data management and planning.

Some treats for those who stop by.
Our Open Research poster, designed by Clair Castle.

The team tried to capture the sentiments of the conversations had by asking the Sidgwick community to partake in 2 short activities as they departed our pop-up to better understand  their relationship with data (in exchange for Love Hearts sweets!). Firstly, we asked them to describe to us what data was to them, a question that we are extremely fond of asking! As usual, the answers were informative and they helped us to gain a sense of the varying data types that the Sidgwick community worked with – from political tracts and archival materials to balance sheets and land deeds from the early modern era.

Activity 1: Lots of different data types in the AHSS community!

For the second activity, we asked them what term best captured the materials that formed the basis of their scholarly work: data, research materials, or other? To our surprise, the majority of people we spoke to over both days saw themselves as working with data, more than double the number that saw themselves working with research materials, with a small number seeing themselves as working with both, interchangeably. This finding illustrated something that has been increasingly discussed in the Research Data team office: that finding alternatives to the term data may make our services and initiatives more appealing to members of the AHSS community. This is something we will take into account when targeting our outreach in the future. Yet, one thing is certain – our Research Data services are needed by the AHSS community just as much as it is by the STEMM community.

Activity 2: More generators of ‘data’ than we expected!

The pop-ups at the Alison Richard building were encouraging and it is hoped that fruitful relationships will transpire from these events. This is something that we may hold again soon. It was a good way to communicate our message and make others aware of the services of the Research Data team. Over at the West Hub Lizzie was not as encouraged, having only managed to have in depth chats with a couple of people. She reported that lots of people were very determinedly on their way somewhere and not up for stopping to talk. The time and/or location did not seem right for the intended audience. I suppose, we shouldn’t stand in between a student and their food. In any case, there were lots to take away from this Love Data Week pop-ups, and lots to reflect when we plan for our next pop-up, be it for Love Data Week 2025 or just as a periodic service to the research community here at Cambridge. Perhaps when the weather is nicer in the summer, we will do a pop-up outdoors in the middle of the Sidgwick site, or at research events throughout the University. If you have any ideas on where it would be good for us to hold such a pop-up, do let us know!