5000 datasets now in Apollo

Written by Clair Castle, Dr Kim Clugston, Dr Lutfi Bin Othman, Dr Agustina Martínez-García. 

 How the ‘second life’ of datasets is impacting the research world. Researchers share their stories.

“Research data is the evidence that underpins all research findings. It’s important across disciplines: arts, humanities, social sciences, and STEMM. Preserving and sharing datasets, through Apollo, advances knowledge across research, not only in Cambridge, but across the world – furthering Cambridge’s mission for society and our mission as a national research library.”

Dr Jessica Gardner, University Librarian & Director of Library Services

The research data produced and collected during research takes many different forms: numerical data, digital images, sound recordings, films, interview transcripts, survey data, artworks, texts, musical scores, maps, and fieldwork observations. Apollo collects them all.  

Apollo is the University of Cambridge repository for research datasets. Managed by the Research Data team at Cambridge University Library, Apollo stores and preserves the University’s research outputs.  

The Research Data team guides researchers through all aspects of research data management – how to plan, create, organise, curate and share research materials, whatever form they take – and assists researchers in meeting funders’ expectations and good research practice.  

In this blog post, upon reaching our 5000 datasets milestone, we share researcher stories about the impact their datasets have had, and continue to have, across research – and explain how researchers at the University can benefit from depositing their datasets on Apollo.

“Sharing data propels research forward. It recognises the importance of the original datasets in their own right, and the researchers who worked on them. Many of the research funders, supporting work at the University of Cambridge, require that research data is made openly available with as few restrictions as possible. Our researchers are fully supported to do this with Apollo and the Research Data team. I’m really excited that Apollo has reached the 5000 dataset milestone.” 

Professor Sir John Aston, Pro Vice-Chancellor for Research at the University of Cambridge 

Why should researchers share their research outputs on a repository?  

Making research data openly available is recognised as an important aspect of research integrity and in recent years has garnered support from funders, publishers and researchers. Open data supports the FAIR principles and many funders now include data sharing practices within their policies as part of the application process. Publishers and funders often require a data availability statement (DAS) to be included in publications. It is worth mentioning (including in a DAS) that there are situations where data cannot be shared, particularly if data contains personal or sensitive information or where there is no permission to share it. But a lot of data can be shared and this movement towards open data promotes greater trust, both among researchers and for engagement with the general public.   

Illustration of why it is good to share research data. The illustration is explained in the text of the blog immediately below.

In the UK, funding bodies often mandate openly sharing the data supporting their research grants. A large proportion of funding for research is from taxpayers’ money or charity donations so making data available openly for reuse provides value for money. It also allows the data behind claims to be accessed for traceability, transparency and reproducibility. Open data increases efficiency, as it prevents work being repeated that may have already been done; for this reason, it is encouraged to publish negative results too. Publishing data gives researchers credit for the work they have done, giving them more visibility in their field, and increasing the discoverability of their research which could lead to potential collaborations and increased citations. Open data also means that researchers have access to valuable datasets that could educate, enhance and further their research when applied by practitioners worldwide.  

The second life of data   

Apollo supports data from all disciplines, and this is represented by the various formats that the repository holds in its collection –  from movie files, images, audio recordings, or code, to the more common text and CSV files. The repository now also hosts methods. Researchers are encouraged to deposit these outputs onto the repository to facilitate the impact and re-use of data underlying their research, so that their research data can be cited as a form of scholarly output in their own right. In 2023, there were over 95,000 views of datasets and software and associated metadata items on Apollo, and over 37,000 files were downloaded (source: IRUS). This proves that datasets and software deposited on Apollo are easy to discover and are highly used.  

One example is a dataset deposited by Douglas Brion at the end of his PhD in the Engineering department. Brion’s dataset, titled Data set for “Generalisable 3D printing error detection and correction via multi-head neural networks” has been downloaded 2,600 times. This dataset has also been featured in 20 online news publications (including in the University’s Research blog) and has an Altmetric attention score of 151. Brion’s dataset is also one of the larger outputs on Apollo, comprising over 1.2 million labelled images and over 900,000 pre-filtered images.   

The open availability of Brion’s data that can be used to train AI (a significant trajectory for research currently) is welcomed by researchers such as AI specialist Bill Marino, a PhD candidate and Data Champion from the Department of Computer Science and Technology: “It’s really important that AI researchers are able to reproduce each other’s findings. The opaque nature of some of today’s AI models means that access to data is a key ingredient of AI reproducibility. This effort really helps get us there.”   

Brion considers that sharing his data “has significantly enhanced the impact and reach” of his research and that “it has increased the visibility and credibility of my work, as other scientists can validate and build upon my findings.” On the benefits of depositing data on a repository, he says that sharing “ensures that the data is preserved and accessible for the long term, which is crucial for reproducibility and transparency in research”. He adds, “Repositories often provide metadata and tools that make it easier for other researchers to find and use the data”, which “promotes a culture of openness and collaboration, which can accelerate scientific discovery and innovation.”  

Photo of a researcher searching Apollo, the University of Cambridge repository, on a computer.

Research data supporting “Regime transitions and energetics of sustained stratified shear flows” is a dataset from another depositor, Adrien Lefauve, from the Department of Applied Mathematics and Theoretical Physics and consists of MATLAB codes and accompanying movies files. Lefauve is, in fact, a frequent dataset depositor with 10 datasets published in Apollo. He considers that data sharing gives his data “a second life” by allowing researchers to reuse his data in pursuit of new projects but admits that “there is also a selfish reason for doing it!”. He explains that “After several months or years without having worked on a dataset, I sometimes need to go back to it, either by myself or when I hand it over to a colleague or student to test new ideas. Having a well-structured, user-friendly and thoroughly documented dataset is invaluable and will save you a lot of time and frustration when you need to resurrect your own research.”  

Lefauve’s dataset has been cited in other publications and he encourages other researchers to look at his datasets and reuse them: “When people see that datasets can be cited in their own right and attract citations, it can encourage them to make the extra effort to deposit their data”. Lefauve is an advocate for sharing data on a repository and in his view data sharing is: “not only important for research integrity and reproducibility, but it also ensures that research funds are used efficiently. My datasets are usually from laboratory experiments which can take a lot of time and resources to perform. Hence, I feel there is a duty to ensure the data can be used to the fullest by the community. It also helps build a researcher’s profile and credentials as a valuable contributor to the community, beyond simply publication output, which often only use a small fraction of a dataset.”  

Lefauve describes his field (fluid mechanics) as one that has benefited from the explosion of open data that is made available to the research community, but he is also aware that for a dataset to be reused, it requires comprehensive documentation and curation. Lefauve hopes that sharing data in a repository “will become increasingly commonplace as the next generation is taught that this is an essential part of data-intensive research.”  

How to deposit data on Apollo, and why choose Apollo 

There are thousands of data repositories to submit data to, so how to choose the right one? Funders may specify a disciplinary or institutional repository (see re3data.org for a directory of all repositories). Members of the University of Cambridge can deposit their data on the institutional repository, Apollo. Apollo has CoreTrustSeal certification, which means it has met the 16 requirements to be a sustainable and trustworthy infrastructure. Research outputs can be deposited as several types, such as dataset, code or method.  

We have a step by step guide to uploading a dataset, which is submitted through Symplectic Elements, the University’s research information management system. There is also a helpful information guide about Symplectic Elements on the Research Information Sharepoint site. The Research Data team are on hand to help researchers with any queries they might have during this process.    

The importance of good metadata  

Researchers may think that the files are the most important aspect when depositing a dataset, but we cannot emphasise enough the importance of providing good metadata (data about data) to go alongside the files. This is the area that we find researchers need some encouragement with, but we hope that the experiences of the researchers we have featured above highlight the importance that good metadata has for their data. No one knows their data better than the person who generated it, so they are in the best position to describe it. A good description of a dataset enables users with no prior knowledge about the dataset to be able to discover, understand and reuse the data correctly, avoiding misinterpretation, without having access to the paper it supports. Be aware that others may discover datasets in isolation from a paper that it supports: we recommend that researchers avoid referring to the paper or using the abstract of the paper to describe their dataset. An article abstract describes the contents of the article, not of the dataset. It can also be really useful for researchers to describe their methods and how their files are organised for example, by providing README files. These give the dataset context as to how the data was generated, collected and processed. Good metadata will also enhance a dataset’s discoverability.  

Another benefit of sharing data on Apollo is that our datasets are indexed on Google’s Dataset Search, a search engine for datasets. It is best practice to cite any datasets used in research in the bibliography/reference list of the paper, thesis etc. In fact, there is new guidance for Data Reuse on the Apollo website which describes how to use Apollo to discover research data and how to cite it. We advise that researchers start doing this now (if they don’t already) so they get into a good habit: it will encourage others to do the same and make it a lot easier for others to reuse data and for researchers to receive recognition for it. Citation data for datasets are displayed on Apollo and alongside this it is possible to track the attention that a dataset receives via an Altmetric Attention Score.   

Apollo repository key milestones  

Illustration of Apollo repository key milestones represented as a timeline. The illustration is explained in the text of the blog immediately below.

Since its inception in 2016, when it started minting DOIs (Digital Object Identifiers), Apollo has continued to hit milestones and develop into the robust, safe and resilient repository infrastructure that it is today.  

Apollo has continued to support FAIR principles by incorporating new and critical functionality to further enhance discovery, access and long-term preservation of the University research outputs it holds. For example, integration with our CRIS (Current Research Information System), Symplectic Elements, to streamline the depositing process, and integration with JISC Publications Router to automatically deposit metadata-rich records in Apollo (2016, 2019, 2021).  

2000 datasets were deposited in Apollo by 2020. DOI versioning was enabled in 2023, as well as accepting more research output types than ever before, such as methods and pre-prints. A major milestone was hit in 2023 when Apollo achieved CoreTrustSeal certification and status as a trustworthy repository.  

The latest milestone will be for research outputs published within Octopus, a novel approach to publication, to be preserved together with associated publications and underpinning research datasets in Apollo to facilitate sharing and re-use (2024-25). In future we want to develop our ability to collect and interpret data citation statistics for Apollo so we can better assess the impact of the research data generated at the University.  

How we can support researchers  

The Research Data team is here to help!   

We can be contacted by email at info@data.cam.ac.uk. Researchers can also request a consultation with us to discuss any aspect of their research data management (RDM) needs, including data management plans, data storage and backup, data organisation, data deposition and sharing, funder data policies, or to request bespoke training.   

Remember that there is also an amazing network of Data Champions that can be called upon for advice, particularly from a disciplinary perspective.  

We deliver regular RDM training as part of the Research Skills Programme.   

Finally, there is our Research Data website for comprehensive advice and information.   

Flipping academic journals to diamond open access: Notes on community governance

In this blog post, Dr Caroline Edwards, Executive Director, Open Library of Humanities and Senior Lecturer in Contemporary Literature & Culture, Birkbeck, University of London asks: How do we ensure that a flipped diamond open access journal can remain independent? How do we prepare for the long-term financial security of flipped journals and protect against their potential vulnerability to commercial acquisition in the decades to come?

Flipping academic journals to diamond open access (OA) presents a series of challenges to an academic publisher. You need certain niche competencies. Firstly, nothing happens without the complete trust of an editorial team that shares your appetite for risk. Then, you need the backing of an entire academic community, willing to follow the editorial team to a new journal (in cases where editors don’t own the journal IP, which is most cases) and undertake a boycott of the old “zombie” title. Underpinning all of this, you need the financial and technological resources to provide the necessary infrastructure for the flipped journal in perpetuity, to offer it a safe home with a long-term future that doesn’t require any author fees. This involves things like setting up and maintaining a new journal site, running a digital publishing platform for managing submission, review, and production processes, having the capacity to manage metadata integration with university library catalogues and discoverability databases, providing memberships at robust digital preservation organisations, and ongoing research and development to stay abreast of rapid changes within the digital publishing landscape. The list goes on.

The growing list of journals flipping to diamond OA from their commercial publishers is well known. Retraction Watch keeps an up-to-date list of editorial boards that have resigned from for-profit models and moved their titles to not-for-profit, community-governed models. Each has its own story, told across published statements, academic blogs, and in newspaper articles covering high-profile editorial resignations and academic boycotts. But what gets talked about less frequently is the community governance structure that will support the journal moving forwards. How do we ensure that a flipped journal can remain independent? How do we prepare for its long-term financial security and protect against future vulnerability to commercial acquisition in the decades to come?

At the Open Library of Humanities (OLH), I spend much of my time talking to editors about their journals. There is a depressingly common story. It usually goes something like this. Many academic journals were launched between the 1960s and 1980s, in a collaboration between university professors and small or independent publishing houses. Things worked pretty well until their small publisher was bought out in the 1990s or 2000s by a larger company, often overseen by a global parent company. They muddled along with a high turnover of staff on the publisher side. Over time, the publishing managers became harder to get hold of, production was outsourced overseas, and editors became increasingly aware of a decline in production quality. 

With the acceleration to open access in the 2010s, editors came under pressure to double or triple their article acceptance rates – with a drop in subscriptions revenue, commercial publishers had to recoup costs via article processing charges (APCs). The more volume they could pump out, the better their profit margins. Even when journal editors rejected unsuitable or poor-quality articles, publishers found a way to fast-track this academic content by surreptitiously channelling it through their digital platforms to their hundreds of other journals using the same platform. Not all editors were even aware that the transfer of rejected articles had taken place.

If the Editor(s)-in-Chief had the temerity to stand up for their academic principles and refuse to increase their journal’s article acceptance rates, at this point they could face legal challenges or dismissal. In several explosive cases in recent years, Editors-in-Chief have been fired by their commercial publishers after refusing to back down over these issues. Sacking an internationally renowned editor whose reputation has become synonymous with the journal’s own reputation isn’t for the faint-hearted. It says something about the desperation of commercial publishers and their shrinking profits that they would be willing to trash a journal’s reputation so comprehensively – among the very academic communities whose uncompensated labour produced that reputation in the first place.

At this point in my conversations with editors, I ask a difficult question: Who owns the journal? “The publisher” they say, or “we don’t know.” Sometimes they reply: “The founding editor has passed away; we’ve asked their children, but no one can find any paperwork.” Without the rights to the journal title, its name, and logo, editors must set up a new journal. Ensuring the continuity between the old (now trashed) journal title and the new journal title requires coordinating a mass resignation of editors and authors from the old journal, preferably along with a boycott by peer reviewers for the foreseeable future.

At the OLH we’ve spent almost a decade flipping academic journals to diamond OA, supported by a growing number of libraries worldwide who share our vision for a not-for-profit academic publishing future. It wasn’t called “diamond” when we launched in 2015, but the term has come to mean not-for-profit and community-governed OA. Our publishing model is inspired by an explicitly political project – if the OLH and similar university-owned journal publishers are to thrive, they need to divert university library funding away from the big 5 commercial publishers (Elsevier, Wiley, Taylor & Francis, and Springer). This happens hand-in-hand with library advocacy. When expensive journals flip to diamond OA, librarians are empowered to cancel individual journal subscriptions. In the age of big bundles and journal packages, journal flipping allows them to renegotiate extortionate deals with commercial publishers in light of the shrinking number of titles in each package.

Since launching as a publisher in 2015, the OLH has flipped 20 journals in this way. It hasn’t always been easy, and we have learned a lot along the way. In cases where journals own their own intellectual property (IP), usually via a scholarly association or legal governing body, the process of migrating decades of back content requires highly complex, skilled technical work. In cases where journals don’t own their IP, editors are unable to take the journal title with them. In these cases, a new journal needs to be established to continue the mission of the original title. This leaves behind zombie journals; the undead husks of formerly respected titles, that commercial publishers refuse to close but cannot run when the entire scholarly community has agreed to boycott it. The case of Wiley’s Journal of Political Philosophy, which relaunched with the OLH as Political Philosophy in February 2024, is a case in point.

Some of the journals that the OLH has flipped to diamond OA have set up a nonprofit organisation to protect themselves. Zygon: Journal of Religion and Science, a former Wiley journal that dates back to 1966, was able to do this because the original editors had the foresight to protect their IP before their publisher Blackwell was taken over by Wiley in 2007 (the journal had previously been published by two different university presses, the University of Chicago Press (1966–1978) and Wilfrid Laurier University Press (1979–1989)). The Zygon editorial team set up its own not-for-profit scholarly corporation in Chicago in 2019, following a joint venture established in 1965 among founding partners. As a 501(c)(3) organization, the Zygon: Journal of Religion and Science NFP not-for-profit scholarly corporation is a charitable organisation exempt from federal income tax. This route is being taken by other OLH journals including Theory & Social Inquiry (formerly Theory & Society), Political Philosophy (formerly the Journal of Political Philosophy), and Free & Equal: A Journal of Ethics and Public Affairs (formerly Philosophy & Public Affairs).

Several of the OLH’s journals have been owned by scholarly associations since their inception, including Quaker Studies (founded by the Quaker Studies Research Association (QSRA)), Architectural Histories (founded by the European Architectural History Network (EAHN)), Digital Studies / Le champ numérique (founded by the Canadian Society for Digital Humanities/Société canadienne des humanités numériques (CSDH/SCHN), Marvell Studies (founded by the Andrew Marvell Society), Open Screens (founded by the British Association of Film, Television and Screen Studies (BAFTSS)), and The Parish Review (founded by the International Flann O’Brien Society).

In other cases, independent journals joining the OLH have made the decision to affiliate themselves with scholarly societies. This has been the case for [in]Transition: Journal of Videographic Film & Moving Image Studies, which has become the official video essay journal of the Society for Cinema and Media Studies (SCMS), and C21 Literature: Journal of 21st-century Writings, which became the official journal of the British Association for Contemporary Literary Studies (BACLS) when the new association was founded in 2017. This kind of affiliation secures the community governance of journals. Scholarly associations have articles of association that usually include the criteria for appointing journal editors, terms of office, and processes for collectively undertaking decisions about the journal’s functioning and health. 

Another route to long-term protection against commercial acquisition is for journals to join forces. This was the approach taken by 3 of the OLH’s journals who resigned en masse from Elsevier in 2015 – Lingua (which relaunched as Glossa), LabPhon, and the Journal of Portuguese Linguistics. Editors of these titles set up a community organisation, LingOA: Linguistics in Open Access as a Dutch Stichting (literally a “foundation”), a not-for-profit legal entity with limited liability similar to a trust, which is controlled by a board of directors and cannot have any shareholders. 

With support from the Center of Science and Technology Studies at the University of Leiden, Radboud University Library, the Netherlands Organisation for Scientific Research (NWO), the Association of Dutch Universities (VSNU), and the Royal Netherlands Academy of Arts and Sciences (KNAW), LingOA was able to provide financial support for the journals beyond their funding agreement with the OLH. One of the OLH’s newest journals, Syntactic Theory and Research (STAR, which left Wiley) has also joined the LingOA Stichting.

Other journals that have joined the OLH in 2023-2024 will need to establish their own legal and ownership entities, and we continue to offer help and advice to editorial teams undertaking this important work. Our goal at the OLH is to liberate university research from commercial control. Flipping journals to diamond OA is the first step; enshrining community governance is the crucial next step. As more funding bodies mandate diamond OA and not-for-profit academic publishing infrastructure (such as this recent announcement by the NWO), the tide is turning against commercial actors. Now is the time for editors and scholarly communities to regain control of their scholarship.

This post does not necessarily reflect the view of Cambridge University Libraries.

Data Diversity Podcast #3 – Dr Nick H. Wise (4/4)

Thank you for staying with us throughout this four-part series with Dr Nick Wise, scientist and an engineer, who has made his name as a scientific sleuth. By now, it is hoped that he needs no introduction (though if you would like one, please look back at the previous posts).

In this final post, we get Nick’s take on what he thinks the repercussions should be for engaging in fraud, and we get a parting tip from Nick on what researchers should do when performing a literature search on papers in their field. Below are some excerpts from the conversation, which can be listened to in full here.


Most people don’t go into science wanting to fake stuff. With such cases, it can often be a sign that there’s a real problem in the lab or in the group. Why else would someone feel so compelled to do this? If the pressure is coming from the university demanding papers from them, then it’s the problem with the university. 


Repercussions for research fraud 

LO: You have mentioned that some editors have been let go from their positions as editors – are there any other repercussions for getting involved with fraud? 

NW: Often, institutions are the worst in terms of responding. Recently, I was at the World Conference on Research Integrity in Athens and spoke to other investigators like me, including publishers and people in the research integrity space. Some publishers have informed me that even when they want to make a retraction and have gone to the author’s or editor’s institution to inform them that a staff member has been involved with fraud, often the institution doesn’t reply at all, or even if they do, they will not do anything. They are very defensive, and they do not want any bad publicity for the institution and so they will not respond at all. Even in a well-regarded western University where someone has been caught fabricating their data, the response could just be that they have been relieved of teaching duties for six months, but they’ve kept their job and there will be no publicity that we know.  

In Spain, a professor that has just been made Rector, the Head of the University of Salamanca, the oldest university in Spain, has been linked to questionable publication practices for the last decade or so. He was found to have his name on an incredible number of papers which have been cited an incredible number of times, including by people who don’t exist. There has been a fight in the Spanish press to try highlight this. But despite of all this press, including national press in Spain, this person has become the Rector of the University of Salamanca. And it’s basically the same the world over: institutions very much go into protection mode even if publishers have agreed on retracting the papers. Often there are no career repercussions at all. Sometimes, they will just go and be editor of a different journal or for a different publisher. 

LO: In your opinion, what should happen to an academic or researcher who has engaged in fraud? 

NW: I think it really depends on the nature of the fraud and the position that the researcher holds. If a PhD student has done something and if they have been caught after, say, the first offence, then I think there should be leniency. Regardless of if they have bought an authorship, or if they have tried to fake some data, they still have a way out and it should be offered to them. Again, a lot of the drive for PhD students faking some data is because their P.I. (Principal Investigator) is demanding results, demanding that things happen faster, or demanding ground-breaking results. At some point, people become desperate. Most people don’t go into science wanting to fake stuff. With such cases, it can often be a sign that there’s a real problem in the lab or in the group. Why else would someone feel so compelled to do this? If the pressure is coming from the university demanding papers from them, then it’s the problem with the university. A lot of this drive is external to researchers. But if you have someone that is a tenured professor who has been doing this for a long time and they have been caught out on a decade or more of fabricated results, those feel like that should be the end of the road. It really depends on the nature of what has been done, the stage of career of the person, and how much fraud has been committed. 

LO: Do you ever worry about being called out for being sued for defamation? 

NW: I have thought about it, and I try to err on the side of caution and make sure that there is fairly hard evidence for anything I say publicly. You can have suspicions without saying anything publicly – you would just go to the publisher. But when I find an advert for a named paper and then six months later a paper with that same title is published, then it is clear cut that someone should investigate. But fortunately, so far, I have not been threatened with anything. 

I think it is also partly due to the fact that accusing people of making up their data is more personal. When authorship is bought, by the time I find it, some of these people would have already got what they needed. If they needed to have a publication in order to graduate, once they have graduated, they do not care if the publication is retracted. Often when you read a retraction notice after the authorship has been sold, they will normally say that none of the authors responded. This may also be down to the fact that they know that they have been caught but there is nothing to defend. But when you are accusing someone of making up data, I think that is far more personal attack. When someone has bought authorship, they do not have a personal connection to the paper, so they move on. They are probably annoyed, but they cannot do anything about it. 

Parting advice

LO: To end, are there any takeaways that you would like to share? 

 NW: I would encourage all researchers to download the PubPeer plugin, which means that whenever they are looking at a paper, it will flag whether there are any comments about that paper, or indeed any comments in the reference or the reference papers on PubPeer. If someone else has found a problem with that paper, they can just quickly go and check and be more informed. 


We are grateful for Dr Nick Wise sharing his perspective on the publishing industry and research culture that many of us are not privy to. Nick has highlighted many issues which raise pressing concerns for research integrity. We thank him for his time speaking with us and we hope that readers will take his advice on using PubPeer when they embark on literature searching (and of course, refrain from committing fraud, lest you will have Nick on your case).