Dr Agustina Martínez-García, Head of Open Research Systems, Digital Initiatives
We are pleased to announce that our Diamond Open Access Journals at Cambridge platform has launched in May and can be accessed at https://diamond-oa.lib.cam.ac.uk/home. This service will be available initially as part of a one-year pilot project undertaken by the Open Research Systems (ORS) and Office of Scholarly Communication (OSC) teams within Cambridge University Library (CUL).
Project overview
The main aim of the Diamond project is to support Cambridge’s research community in the context of a changing open research and scholarly publishing environment. To meet increasing demand to share research findings we are scoping, assessing, and implementing future services and systems that meet those needs, while contributing to a growing wider open research community and ecosystem. The pilot is being launched off the back of a project to understand the community-led publishing landscape at Cambridge (findings to be shared soon). Researchers in the Office of Scholarly Communication uncovered a vibrant ecosystem of DIY publishing projects at Cambridge that the library is exploring how to support through technical and resource-based approaches.
As part of the project, we are engaging with Cambridge researchers and exploring whether open and community-developed platforms meet their needs around institutional publishing and can be used as the basis for service development in this area. We are using the DSpace repository platform to support this pilot. DSpace is a widely adopted, open-source repository platform, and it is currently the solution underpinning Apollo, Cambridge’s Institutional Repository. In its newest version, it offers advanced functionality and features that can potentially make it a suitable platform for journal publishing, an area we are keen to explore with this pilot.
Where we are at
Main activities of the project are focusing on:
Exploring the implementation of suitable infrastructure, built on interoperable, open, and widely adopted platforms.
Gathering use cases of community-led open access journals at Cambridge, focusing on discipline, journal type, frequency of publication, production standards.
Gathering insights and inform future service development in this area by a) assessing the suitability of the DSpace open-source repository platform as a journal publishing platform; and b) estimating the associated costs and resourcing requirements, both in terms of service management and infrastructure (long-term access, storage, and preservation costs).
The following four Cambridge student-led journals have agreed initially to participate in the pilot, and we are also exploring opening participation to additional journals in the upcoming months.
Cambridge Journal of Climate Research (Climate Research Society, first issue now available in the Diamond platform)
Cambridge Journal of Human Behaviour (Anthropology)
Cambridge Journal of Visual Culture (History of Art)
Scroope (Architecture)
What’s next
The next iteration of work for the pilot will focus on assessing the resources and costs involved in transitioning from pilot to service. Ensuring long-term preservation and access comes with several associated costs and it is critical to assess these when evaluating sustainable approaches to service development. Examples of cost elements that we will consider include onboarding (initial implementation) fees, hosting and maintenance fees, volume of content and storage costs, persistent identifier (DOIs and ISSN) minting and publisher databases indexing services costs, etc. We will also explore suitable long-term content preservation options, including approaches such as integrations with existing preservation services such as CLOCKSS (https://clockss.org/), or assessing in-house preservation via the services that are currently being developed as part of CUL’s Digital Preservation Programme.
Dr Kim Clugston, Research Data Coordinator, OSC Dr Leontien Talboom, Technical Analyst, Digital Initiatives
Many funders and publishers now require data to be made openly available for reuse, supporting the open data movement and value for publicly funded research. But are all researchers aware of why they are being asked to share their data and how to do this appropriately? When researchers deposit their research data into Apollo (the University of Cambridge open access repository) they generally understand the benefits of sharing data and want to be a part of this. These researchers provide their data in open file formats accompanied by rich metadata so the data has the best chance of being discovered and reused most effectively.
There are other researchers who deposit their data in a repository during the publication process; this often takes place within tight deadlines set by the publisher. For this reason, researchers often rush to upload their data, and thoughts about how this data will remain preserved and accessible for long-term use are not considered. The challenges around preserving open research data were highlighted in this article. The authors addressed the concerns that open research data can include a wide variety of different types of data files, some of which may only be accessible with proprietary software or software that is outdated or at risk of being outdated soon. How can we ensure that research data that is open now stays accessible and open for use for many years to come?
In this blog, we will discuss the importance of making data open, ensuring this is maintained for future use (digital preservation). We will use some examples from datasets in Apollo and suggest recommendations for researchers that go beyond the normal FAIR principles to include considerations for the long term.
Why is it important for the future?
The move to open data, following the FAIR principles, has the potential to boost knowledge, research, collaboration, transparency and decision making. In Apollo alone, there are now thousands of datasets which are available openly worldwide to be used for reference or reused as secondary data. Apollo, however, is just one of thousands of data repositories. It is easy to see how this vast amount of archived data comes with great responsibility for long term maintenance. A report outlined the pressing matter that FAIR data, whilst addressing metadata aspects well, doesn’t really address data preservation and the challenges that this brings such as the risk of software and/or hardware becoming obsolete, and therefore data reliant on these becoming inaccessible.
Tracking the reuse of datasets could provide essential information on how different file formats are holding up, but there is an ongoing challenge to track dataset reuse. Datasets are not yet routinely cited in the established way that is seen for journal articles or other publication types. This is an area that is actively being developed through initiatives such as Make Data Count and it is hoped that at some point soon, data citation will become part of the routine practice of research to further enhance visibility on how data is being credited and reused.
In Apollo, we see great interest in the available datasets as they are viewed and downloaded frequently. The most downloaded dataset in Apollo has been downloaded over 300,000 times since it was first deposited in 2015 and, interestingly, consists of open file formats. Other highly downloaded datasets in Apollo, such as the CBR Leximetric dataset, have been used by lawyers and social scientists and successfully cited as a data source to answer new research questions. The Mammographic Image Analysis Society database was deposited in Apollo in 2015 and has been frequently downloaded and reused by researchers working in the field of medical image analysis as discussed in a previous blog. To date, Google Scholar reports it has been cited 78 times. These datasets show the value of sharing and reusing data and all are in file formats that are accessible to everyone which will help to preserve them for as long as possible.
Digital preservation is a discipline focused on providing and maintaining long-term access to digital materials. Obsolete software is a big problem in maintaining access to files in the future. PRONOM, a file format registry, keeps track of a large amount of known file formats and provides additional information on these formats. Last year, a file format analysis of datasets in Apollo was conducted to highlight what file formats are represented in the repository. The results revealed the diverse array of different file formats which is a testament to the breadth of research conducted and the adoption of open data across many disciplines. Most of the file formats are common and can still be opened, but a large percentage of the material has not been identified or are in formats that are not immediately accessible without migrating to a different format or emulating the current file formats. Table 1 shows a few complex examples of file formats held in Apollo.
Older format for a file software program called Prism. This is now considered legacy software.
The Bit List, a list maintained by the Digital Preservation Coalition that includes contributions from members of the digital preservation community, outlines the “health” of different file formats and content types, including research data. In fact, unpublished research data (which is another issue outside the scope of this blog!) is classified as critically endangered and uncovers the problem that the majority of researchers generally only make data open at the point of publication. But even research data published in repositories has its difficulties and is classified as vulnerable, mainly due to the dependency on many file formats having the availability of the appropriate software to open and use them. There are potential solutions on the horizon to address this problem, such as the open-source ReproZip which packages research data with the necessary files, libraries and environments so they can be run by anybody. However, this still doesn’t address the issue of obsolete software. The gold standard would be to deposit research data in open formats, so viewing and using the files is not dependent on a particular software; the files will be open and accessible as long as they are held available within a repository.
What researchers can do
What can researchers do to make sure that when they deposit data into a repository, it will be available for them and others in 10 or even 20 years time? Awareness is the first step. Researchers should consider submitting their data to a repository, one that is suitable for their files. Choose a trusted data repository. A recent blog highlighted the potential problem of disappearing data repositories, with approximately 6% of repositories listed on the repository search registry, re3data being shut down (most reasons are unknown but some were listed as organisation or economic failure, obsolete software/hardware or external attacks). Approximately 47% of the repositories that had shut down did not provide an alternative solution to rescue the data and it is assumed that this data is lost. It may be that your funder or publisher decides the repository for you, but we have some guidance on what to look for in a trusted repository. If you are at Cambridge, you can deposit your data in Apollo which has CoreTrustSeal certification.
The data itself is arguably the most important factor, we need to make sure the data files can be found and used by anyone at any time, forever. Ideally, this means using open file formats where possible as these don’t have any restrictions. The Library of Congress and the UK National Archives both maintain registries of file formats. There is some Cambridge University guidance on choosing file formats as well as some by the UKDS. Have a look at the file formats you have on the PRONOM database, is this seen as a sustainable format? If the data you are generating is from proprietary software, it is good practice to deposit this version as well as an open format that does not require any specialist software to open them. This ensures that both options are available in case of any loss of formatting from converting to open formats. An example are the statistical software packages SPSS and NVivo which are proprietary but have the option to convert to open formats such as a CSV file.
There may be information on how to convert your file types to open formats within your discipline. In the Chemistry department here at Cambridge, an initiative was started together with the Data Champion programme to provide a platform to allow researchers to add instructions for converting experimental derived files into open formats. Open Babel is an open-source, collaborative project aimed at providing a “chemistry toolbox” with information on how to convert chemical file formats into other formats where needed. There is also some guidance on how to export from R to open formats such as txt and csv.
In some cases, it might not be possible to provide an open file format alternative. The files you use may be subject to discipline-specific standards or you are restricted by the hardware and software you use in your research. For these, it is important to provide good documentation or a detailed README file alongside the file format so researchers know how to access and use your files. In fact good file organisation, documentation and metadata is just as important as the files themselves, as data without any documentation is considered virtually meaningless. The more information you can provide the better and might possibly save you time in the long run from potential questions from other researchers in the future.
The future use of past research hinges on the thoughtful selection of file formats. By prioritising openness and longevity, we lay the foundation for collaboration and innovation. Choices that researchers make today shape the accessibility and integrity of data for generations to come.
The Data Champions were treated to a big data themed session for the March Data Champion forum, hosted at (and sponsored by) the Cambridge University Press and Assessment in their amazing Triangle building. First up was Dr James Fergusson, course director for the MPhil in Data Intensive Science, who described how the exponential growth in data accumulation, computing and artificial intelligence (A.I.) capabilities has led to a paradigm shift in the world of cosmological theorisation and research, potentially changing with it scientific research as a whole.
As he explained, over the last two decades cosmologists have seen a rapid increase of data points on which to base their theorisation – from merely 14 data points in 2000 to 800 million data points in 2013! Through the availability of these data points, the paradigm for research in cosmology started to shift completely – from being theory based to being based on data. With several projects beginning soon that will see vast amounts of data generated daily for decades to come, this trend is showing no signs of slowing down. The only way to cope with this exponential increase in data generation is with computing power, which has also been growing exponentially. In tandem with these sectors of growth is the growth of machine learning (ML) capabilities as the copious amount of data not only necessitates immense amounts of computing power but also ML capabilities to process and analyse all of the data. Together, these elements are fundamentally changing the story of scientific discovery. What was once a story of an individual researcher having an intellectual breakthrough is becoming the story of machine led, automated discovery. While it used to be the case that an idea, put through the rigour of the scientific method, would lead to the generation of data, now the reverse is not only possible but become increasingly likely. Data is now generated first before a theory is discovered, and the discovery may come from AI and not a scientist. This, for James, can be considered the new scientific method.
Dr Anne Alexander has been familiarising herself with AI, especially in her capacity as Director of Learning at Cambridge Digital Humanities (CDH) where she has been incorporating critique of AI into a methodology of research in the digital humanities, particularly in the area of Critical Pedagogy. In her work, Anne addresses how structural inequalities can be reinforced, rather than challenged by AI systems. She demonstrated this through two projects that she was involved with at CDH. One was called Ghost Fictions, a series of workshops with the aim of encouraging critical thinking about automated text generation using AI methods both in scholarly work and in social life. The project resulted in a (free to download) book titled Ghost, Robots, Automatic Writing: an AI level study guide, which was intended as a provocation of a future where books, study guides and examinations are created by Large Language Models (LLM) (perhaps a not so distant future). Another project involved using AI to create characters for a new novel, which revealed the racial biases of ChatGPT when prompted with certain names. Yet, perhaps the most worrying aspect about the transformative forms of AI is the immediate and consequential impact it has on the environment. The computational power needed to quench the thirst for the exponential amounts of data needed to train and progress AI chat bots, LLMs and image generation systems, requires vast computing power which in turn generates a lot of heat and requires large amounts of water to operate. As Anne demonstrated, this could be increasingly problematic for many places as the global climate crisis continues. Locally, we have the case of West Cambridge, which is already water stressed, but also home to the University’s data centre and where the new DAWN AI supercomputer is located. Through these examples, she posed the questions: does AI perpetuate further harm and inequality? Are the environmental costs of AI too high?
The themes that Anne concluded her presentation with formed the basis of the Q&A between the Data Champions and the speakers. The topic of the potential biases of AI and ML was put forward to James who agreed that his field of study could not escape it. That said, unlike the humanities, biases in physics can potentially be helpful as it may help make the scientific process as objective as possible. However, this could clearly be problematic for humanities research, which tends to deal with social systems and relations, and views of the world. The topic of the environmental cost of AI was also touched on, with which James commented that energy insufficiency is a problem and getting harder to justify, and solutions might only create new problems as the demand for this technology is not slowing down. Anne expressed her concerned and suggests that society at large should be consulted on this as the environment is a social problem thus society should have a say on what risk they are willing to be a part of. The question of the automation of science was also raised to James who admitted that preparing early career physicists for research now involves developing their software skills rather than subject knowledge expertise in physics or mathematics.