All posts by Lutfi Bin Othman

5000 datasets now in Apollo

Written by Clair Castle, Dr Kim Clugston, Dr Lutfi Bin Othman, Dr Agustina Martínez-García. 

 How the ‘second life’ of datasets is impacting the research world. Researchers share their stories.

“Research data is the evidence that underpins all research findings. It’s important across disciplines: arts, humanities, social sciences, and STEMM. Preserving and sharing datasets, through Apollo, advances knowledge across research, not only in Cambridge, but across the world – furthering Cambridge’s mission for society and our mission as a national research library.”

Dr Jessica Gardner, University Librarian & Director of Library Services

The research data produced and collected during research takes many different forms: numerical data, digital images, sound recordings, films, interview transcripts, survey data, artworks, texts, musical scores, maps, and fieldwork observations. Apollo collects them all.  

Apollo is the University of Cambridge repository for research datasets. Managed by the Research Data team at Cambridge University Library, Apollo stores and preserves the University’s research outputs.  

The Research Data team guides researchers through all aspects of research data management – how to plan, create, organise, curate and share research materials, whatever form they take – and assists researchers in meeting funders’ expectations and good research practice.  

In this blog post, upon reaching our 5000 datasets milestone, we share researcher stories about the impact their datasets have had, and continue to have, across research – and explain how researchers at the University can benefit from depositing their datasets on Apollo.

“Sharing data propels research forward. It recognises the importance of the original datasets in their own right, and the researchers who worked on them. Many of the research funders, supporting work at the University of Cambridge, require that research data is made openly available with as few restrictions as possible. Our researchers are fully supported to do this with Apollo and the Research Data team. I’m really excited that Apollo has reached the 5000 dataset milestone.” 

Professor Sir John Aston, Pro Vice-Chancellor for Research at the University of Cambridge 

Why should researchers share their research outputs on a repository?  

Making research data openly available is recognised as an important aspect of research integrity and in recent years has garnered support from funders, publishers and researchers. Open data supports the FAIR principles and many funders now include data sharing practices within their policies as part of the application process. Publishers and funders often require a data availability statement (DAS) to be included in publications. It is worth mentioning (including in a DAS) that there are situations where data cannot be shared, particularly if data contains personal or sensitive information or where there is no permission to share it. But a lot of data can be shared and this movement towards open data promotes greater trust, both among researchers and for engagement with the general public.   

Illustration of why it is good to share research data. The illustration is explained in the text of the blog immediately below.

In the UK, funding bodies often mandate openly sharing the data supporting their research grants. A large proportion of funding for research is from taxpayers’ money or charity donations so making data available openly for reuse provides value for money. It also allows the data behind claims to be accessed for traceability, transparency and reproducibility. Open data increases efficiency, as it prevents work being repeated that may have already been done; for this reason, it is encouraged to publish negative results too. Publishing data gives researchers credit for the work they have done, giving them more visibility in their field, and increasing the discoverability of their research which could lead to potential collaborations and increased citations. Open data also means that researchers have access to valuable datasets that could educate, enhance and further their research when applied by practitioners worldwide.  

The second life of data   

Apollo supports data from all disciplines, and this is represented by the various formats that the repository holds in its collection –  from movie files, images, audio recordings, or code, to the more common text and CSV files. The repository now also hosts methods. Researchers are encouraged to deposit these outputs onto the repository to facilitate the impact and re-use of data underlying their research, so that their research data can be cited as a form of scholarly output in their own right. In 2023, there were over 95,000 views of datasets and software and associated metadata items on Apollo, and over 37,000 files were downloaded (source: IRUS). This proves that datasets and software deposited on Apollo are easy to discover and are highly used.  

One example is a dataset deposited by Douglas Brion at the end of his PhD in the Engineering department. Brion’s dataset, titled Data set for “Generalisable 3D printing error detection and correction via multi-head neural networks” has been downloaded 2,600 times. This dataset has also been featured in 20 online news publications (including in the University’s Research blog) and has an Altmetric attention score of 151. Brion’s dataset is also one of the larger outputs on Apollo, comprising over 1.2 million labelled images and over 900,000 pre-filtered images.   

The open availability of Brion’s data that can be used to train AI (a significant trajectory for research currently) is welcomed by researchers such as AI specialist Bill Marino, a PhD candidate and Data Champion from the Department of Computer Science and Technology: “It’s really important that AI researchers are able to reproduce each other’s findings. The opaque nature of some of today’s AI models means that access to data is a key ingredient of AI reproducibility. This effort really helps get us there.”   

Brion considers that sharing his data “has significantly enhanced the impact and reach” of his research and that “it has increased the visibility and credibility of my work, as other scientists can validate and build upon my findings.” On the benefits of depositing data on a repository, he says that sharing “ensures that the data is preserved and accessible for the long term, which is crucial for reproducibility and transparency in research”. He adds, “Repositories often provide metadata and tools that make it easier for other researchers to find and use the data”, which “promotes a culture of openness and collaboration, which can accelerate scientific discovery and innovation.”  

Photo of a researcher searching Apollo, the University of Cambridge repository, on a computer.

Research data supporting “Regime transitions and energetics of sustained stratified shear flows” is a dataset from another depositor, Adrien Lefauve, from the Department of Applied Mathematics and Theoretical Physics and consists of MATLAB codes and accompanying movies files. Lefauve is, in fact, a frequent dataset depositor with 10 datasets published in Apollo. He considers that data sharing gives his data “a second life” by allowing researchers to reuse his data in pursuit of new projects but admits that “there is also a selfish reason for doing it!”. He explains that “After several months or years without having worked on a dataset, I sometimes need to go back to it, either by myself or when I hand it over to a colleague or student to test new ideas. Having a well-structured, user-friendly and thoroughly documented dataset is invaluable and will save you a lot of time and frustration when you need to resurrect your own research.”  

Lefauve’s dataset has been cited in other publications and he encourages other researchers to look at his datasets and reuse them: “When people see that datasets can be cited in their own right and attract citations, it can encourage them to make the extra effort to deposit their data”. Lefauve is an advocate for sharing data on a repository and in his view data sharing is: “not only important for research integrity and reproducibility, but it also ensures that research funds are used efficiently. My datasets are usually from laboratory experiments which can take a lot of time and resources to perform. Hence, I feel there is a duty to ensure the data can be used to the fullest by the community. It also helps build a researcher’s profile and credentials as a valuable contributor to the community, beyond simply publication output, which often only use a small fraction of a dataset.”  

Lefauve describes his field (fluid mechanics) as one that has benefited from the explosion of open data that is made available to the research community, but he is also aware that for a dataset to be reused, it requires comprehensive documentation and curation. Lefauve hopes that sharing data in a repository “will become increasingly commonplace as the next generation is taught that this is an essential part of data-intensive research.”  

How to deposit data on Apollo, and why choose Apollo 

There are thousands of data repositories to submit data to, so how to choose the right one? Funders may specify a disciplinary or institutional repository (see re3data.org for a directory of all repositories). Members of the University of Cambridge can deposit their data on the institutional repository, Apollo. Apollo has CoreTrustSeal certification, which means it has met the 16 requirements to be a sustainable and trustworthy infrastructure. Research outputs can be deposited as several types, such as dataset, code or method.  

We have a step by step guide to uploading a dataset, which is submitted through Symplectic Elements, the University’s research information management system. There is also a helpful information guide about Symplectic Elements on the Research Information Sharepoint site. The Research Data team are on hand to help researchers with any queries they might have during this process.    

The importance of good metadata  

Researchers may think that the files are the most important aspect when depositing a dataset, but we cannot emphasise enough the importance of providing good metadata (data about data) to go alongside the files. This is the area that we find researchers need some encouragement with, but we hope that the experiences of the researchers we have featured above highlight the importance that good metadata has for their data. No one knows their data better than the person who generated it, so they are in the best position to describe it. A good description of a dataset enables users with no prior knowledge about the dataset to be able to discover, understand and reuse the data correctly, avoiding misinterpretation, without having access to the paper it supports. Be aware that others may discover datasets in isolation from a paper that it supports: we recommend that researchers avoid referring to the paper or using the abstract of the paper to describe their dataset. An article abstract describes the contents of the article, not of the dataset. It can also be really useful for researchers to describe their methods and how their files are organised for example, by providing README files. These give the dataset context as to how the data was generated, collected and processed. Good metadata will also enhance a dataset’s discoverability.  

Another benefit of sharing data on Apollo is that our datasets are indexed on Google’s Dataset Search, a search engine for datasets. It is best practice to cite any datasets used in research in the bibliography/reference list of the paper, thesis etc. In fact, there is new guidance for Data Reuse on the Apollo website which describes how to use Apollo to discover research data and how to cite it. We advise that researchers start doing this now (if they don’t already) so they get into a good habit: it will encourage others to do the same and make it a lot easier for others to reuse data and for researchers to receive recognition for it. Citation data for datasets are displayed on Apollo and alongside this it is possible to track the attention that a dataset receives via an Altmetric Attention Score.   

Apollo repository key milestones  

Illustration of Apollo repository key milestones represented as a timeline. The illustration is explained in the text of the blog immediately below.

Since its inception in 2016, when it started minting DOIs (Digital Object Identifiers), Apollo has continued to hit milestones and develop into the robust, safe and resilient repository infrastructure that it is today.  

Apollo has continued to support FAIR principles by incorporating new and critical functionality to further enhance discovery, access and long-term preservation of the University research outputs it holds. For example, integration with our CRIS (Current Research Information System), Symplectic Elements, to streamline the depositing process, and integration with JISC Publications Router to automatically deposit metadata-rich records in Apollo (2016, 2019, 2021).  

2000 datasets were deposited in Apollo by 2020. DOI versioning was enabled in 2023, as well as accepting more research output types than ever before, such as methods and pre-prints. A major milestone was hit in 2023 when Apollo achieved CoreTrustSeal certification and status as a trustworthy repository.  

The latest milestone will be for research outputs published within Octopus, a novel approach to publication, to be preserved together with associated publications and underpinning research datasets in Apollo to facilitate sharing and re-use (2024-25). In future we want to develop our ability to collect and interpret data citation statistics for Apollo so we can better assess the impact of the research data generated at the University.  

How we can support researchers  

The Research Data team is here to help!   

We can be contacted by email at info@data.cam.ac.uk. Researchers can also request a consultation with us to discuss any aspect of their research data management (RDM) needs, including data management plans, data storage and backup, data organisation, data deposition and sharing, funder data policies, or to request bespoke training.   

Remember that there is also an amazing network of Data Champions that can be called upon for advice, particularly from a disciplinary perspective.  

We deliver regular RDM training as part of the Research Skills Programme.   

Finally, there is our Research Data website for comprehensive advice and information.   

Data Diversity Podcast #3 – Dr Nick H. Wise (4/4)

Thank you for staying with us throughout this four-part series with Dr Nick Wise, scientist and an engineer, who has made his name as a scientific sleuth. By now, it is hoped that he needs no introduction (though if you would like one, please look back at the previous posts).

In this final post, we get Nick’s take on what he thinks the repercussions should be for engaging in fraud, and we get a parting tip from Nick on what researchers should do when performing a literature search on papers in their field. Below are some excerpts from the conversation, which can be listened to in full here.


Most people don’t go into science wanting to fake stuff. With such cases, it can often be a sign that there’s a real problem in the lab or in the group. Why else would someone feel so compelled to do this? If the pressure is coming from the university demanding papers from them, then it’s the problem with the university. 


Repercussions for research fraud 

LO: You have mentioned that some editors have been let go from their positions as editors – are there any other repercussions for getting involved with fraud? 

NW: Often, institutions are the worst in terms of responding. Recently, I was at the World Conference on Research Integrity in Athens and spoke to other investigators like me, including publishers and people in the research integrity space. Some publishers have informed me that even when they want to make a retraction and have gone to the author’s or editor’s institution to inform them that a staff member has been involved with fraud, often the institution doesn’t reply at all, or even if they do, they will not do anything. They are very defensive, and they do not want any bad publicity for the institution and so they will not respond at all. Even in a well-regarded western University where someone has been caught fabricating their data, the response could just be that they have been relieved of teaching duties for six months, but they’ve kept their job and there will be no publicity that we know.  

In Spain, a professor that has just been made Rector, the Head of the University of Salamanca, the oldest university in Spain, has been linked to questionable publication practices for the last decade or so. He was found to have his name on an incredible number of papers which have been cited an incredible number of times, including by people who don’t exist. There has been a fight in the Spanish press to try highlight this. But despite of all this press, including national press in Spain, this person has become the Rector of the University of Salamanca. And it’s basically the same the world over: institutions very much go into protection mode even if publishers have agreed on retracting the papers. Often there are no career repercussions at all. Sometimes, they will just go and be editor of a different journal or for a different publisher. 

LO: In your opinion, what should happen to an academic or researcher who has engaged in fraud? 

NW: I think it really depends on the nature of the fraud and the position that the researcher holds. If a PhD student has done something and if they have been caught after, say, the first offence, then I think there should be leniency. Regardless of if they have bought an authorship, or if they have tried to fake some data, they still have a way out and it should be offered to them. Again, a lot of the drive for PhD students faking some data is because their P.I. (Principal Investigator) is demanding results, demanding that things happen faster, or demanding ground-breaking results. At some point, people become desperate. Most people don’t go into science wanting to fake stuff. With such cases, it can often be a sign that there’s a real problem in the lab or in the group. Why else would someone feel so compelled to do this? If the pressure is coming from the university demanding papers from them, then it’s the problem with the university. A lot of this drive is external to researchers. But if you have someone that is a tenured professor who has been doing this for a long time and they have been caught out on a decade or more of fabricated results, those feel like that should be the end of the road. It really depends on the nature of what has been done, the stage of career of the person, and how much fraud has been committed. 

LO: Do you ever worry about being called out for being sued for defamation? 

NW: I have thought about it, and I try to err on the side of caution and make sure that there is fairly hard evidence for anything I say publicly. You can have suspicions without saying anything publicly – you would just go to the publisher. But when I find an advert for a named paper and then six months later a paper with that same title is published, then it is clear cut that someone should investigate. But fortunately, so far, I have not been threatened with anything. 

I think it is also partly due to the fact that accusing people of making up their data is more personal. When authorship is bought, by the time I find it, some of these people would have already got what they needed. If they needed to have a publication in order to graduate, once they have graduated, they do not care if the publication is retracted. Often when you read a retraction notice after the authorship has been sold, they will normally say that none of the authors responded. This may also be down to the fact that they know that they have been caught but there is nothing to defend. But when you are accusing someone of making up data, I think that is far more personal attack. When someone has bought authorship, they do not have a personal connection to the paper, so they move on. They are probably annoyed, but they cannot do anything about it. 

Parting advice

LO: To end, are there any takeaways that you would like to share? 

 NW: I would encourage all researchers to download the PubPeer plugin, which means that whenever they are looking at a paper, it will flag whether there are any comments about that paper, or indeed any comments in the reference or the reference papers on PubPeer. If someone else has found a problem with that paper, they can just quickly go and check and be more informed. 


We are grateful for Dr Nick Wise sharing his perspective on the publishing industry and research culture that many of us are not privy to. Nick has highlighted many issues which raise pressing concerns for research integrity. We thank him for his time speaking with us and we hope that readers will take his advice on using PubPeer when they embark on literature searching (and of course, refrain from committing fraud, lest you will have Nick on your case).

Data Diversity Podcast #3 – Dr Nick H. Wise (3/4)

Welcome back to the penultimate post featuring Dr Nick H. Wise, Research Associate in Architectural Fluid Mechanics at the Department of Engineering, University of Cambridge. If you have been with us for the previous two posts, you would know that besides being a scientist and an engineer, Nick has made his name as a scientific sleuth who, based on an article on the blog Retraction Watch which was written in 2022, is responsible for more than 850 retractions, leading Times Higher Education to dub him as a research fraudbuster. Since then, through his X account @Nickwizzo, he has continued his investigations, tracking cases of fraud and in some cases, naming and shaming the charlatans. In this four-part series, we will learn from Nick about some of the shady activities that taint the scientific publishing industry today.

In part three, we learn from Nick about how researchers try to generate more citations from a single piece of research through a trick called ‘salami slicing’ and the blurred lines between illegality and desperately coping to meet with the unrealistic expectations of academia (to the point of engaging with fraud). Below are some excerpts from the conversation, which can be listened to in full here


Citation count was once a proxy for quality and now it is citation count regardless of quality. People are only looking at the citation count, and not the actual quality. Actually assessing quality takes a lot more effort. 


‘Salami slicing’ and the Game of Citations

LO: What do you think is better for science? A slower, more thoughtful process of publishing and everything in between? Or more information, more research, but then things like fraud slip through and occur more frequently?

NW: I don’t think there’s necessarily more research. Another phenomenon that paper mills take advantage of is salami slicing. Imagine you have completed a research project. Now you could write this up as one, thirty-page paper or two, twenty-page papers. You could write two comprehensive papers or try to put out multiple ten-page papers where you have some minor parameters changed. I see this happening in nanofluids research because it is an area of research close to mine. The nanofluid is simply a base liquid – it might be water, it might be ethanol – and into that you mix these very small nanoscale particles of some other material, such as gold, silver, or iron oxide. And in this sort of mixture of liquid and particles, you want to investigate its fluid flow and describe this with some differential equations. You can use computers to solve the differential equations and then plot some results about velocity profiles and heat transfer coefficients, etcetera. Now, you could write a paper for a given situation where you say, I’m not going to specify the liquid, but here is a general and viscosity of this liquid. If you want to apply this to your own research, you plug in the density and viscosity of your liquid, and likewise the particles. I’m not going to specify which particles are used, because all that changes is their density and their heat transfer coefficient properties. So that’s one way you could do it.

Another way to do it is to go I’m going to write a paper about water and gold particles; that’s one paper. Then you can write another paper which has water and silver particles, and then you can write one with ethanol and iron oxide, and there are so many varieties. You can also vary the geometry that this flow is going around, and you can add in an electric field and a magnetic field, etcetera. You can build up in this n-factorial way. There are thirty possible liquids multiplied by a hundred possible particles and multiplied by however many geometric configurations. You can see that this is what they are doing. Rather than writing a few quite general comprehensive papers, they are writing hundreds of very specific papers which enables them to produce more papers and sell more authorships and put more citations in. But this overwhelm of papers produced; there’s still only so many peer reviewers, and so many editors. And this phenomenon happens in lots of fields, they find something where there are just these variables that they can keep writing almost the same paper. Yet, the paper is original. It has not been done before. It is incredibly derivative, but that is not necessarily a barrier to publication.

LO: What I’m getting from this is, this is part of the whole system, and the issue at hand is definitely enabled by certain motivations like getting more citations. You can take one big piece of salami or publish that in one book, or you can slice the salami thirty ways. And if they are in the position to slice the salami, they say why not, I suppose, right? A game is there to be played.

NW: Right, they are playing the game that is in front of them. And again, there are people who do this who are not from a paper mill. They just want to maximize the number of citations and publications. The question is why are they doing this? Why do they want to maximize their publications? Because they want a promotion, or they want a tenured job. There are also countries where you get a cash reward for publishing a paper in a good journal so the more papers you publish, the more money you get paid. Your government might have told all the universities that they need to increase their ranking in the World University rankings. How do you do that? By increasing your research output and the citations you get. That is another driver. These drivers come from all sorts of places but there is always an emphasis on numbers. Citation count was once a proxy for quality and now it is citation count regardless of quality. People are only looking at the citation count and not the actual quality. Assessing quality takes a lot more effort.

LO: Citations used to be a proxy for quality, but that is not the case anymore. But it still implies the quality of the research, or you would hope.

NW: You would hope, but only because there is an assumption that the only reason something has a lot of citations is because it is good quality. Citations are also easier to count. Quality is much harder to account for, but that incentivizes people to do things like cite their colleagues. Again, you could still track it if people from the same university were citing each other. But then you get bigger scale things with middlemen who organize people from across the world to cite each other or just do it for cash. If you are publishing and you are producing papers to order, each one of those papers has a reference section which is real estate. You can throw in and have some genuine references which are relevant to this paper, but you can also throw in some irrelevant references that someone paid you to include. You can also pay someone to include references that are actually relevant to a topic.

LO: If it is relevant to a topic, it is almost like merely encouraging someone to be aware of certain work as opposed to a scam, which sounds like a gray area.

NW: Well, I would say that as soon as someone is paying money, then it starts to be illegitimate. But I mean if someone emails you and says “I’ve just published this paper, I think you might be interested, it’s in your research field: maybe read it or maybe you do cite it”, it’s different from someone emailing you to say “I’ll pay you £50 if you cite my paper” and you do. Then I would say that you have crossed a line. So, it does get very gray. Then there are these organized paper mills who are doing this as a business and that is where I think it becomes quite clear that it is probably not legitimate.

Facebook (authorship) marketplace

NW: You could go on Facebook and there are people selling authorship of their paper as a one off. There are PhD students in some country with no research funding who say “it costs $2500 for the article processing charge for me to publish where I would like to publish, I do not have $2500 so if you pay the $2500, you can be first author on the paper” and that is the only way they can get their paper published. They’re not doing this as a business, they’re just doing this once for this one paper. And you get people responding. Quite often professors or more established academics with access to budgets are the ones who will say yes. And the only thing that the person has done is to provide the funding for the publication.

The minimum thing that one is supposed to have done to be considered an author is to have either written the draft or reviewed and edited the paper. You might have also done data analysis or conceptualization. I think we would agree that if all this person does is just pay the fee for publication, then that is not acceptable. But what if they read the paper and then made a couple of comments? Now they have reviewed and edited it, and so now they have done review, editing and funding. There are many big labs around the world that have some very senior scientist whose name is on every single paper that comes out of the lab. And what have they done? Well, they provided all the funding, and they have reviewed the paper. I bet there are some who have barely glanced at the paper. But let’s say that they have reviewed the paper, and they provided the funding for the publication. Is that what makes it different to the person on Facebook who has found some random professor from another country to pay for their publication? Where is the difference? I don’t think it is an easy line to draw. In this way, the move to Open Access publishing requiring large fees for publication has also driven quite a bit of this phenomenon.

LO: It also seems like you have developed a bit of empathy. Maybe you’ve looked at so many cases and you see that it’s not always clear.

NW: Absolutely. Again, if you have the people running a paper mill, or if you have some professor who is being bribed and waving through dozens of papers, I don’t have much empathy for them. But the Masters or PhD student who has been told that they have to publish papers to get their PhD or even a Masters and they have this demand placed on them, or they even have produced a paper but they need this on the all this money to get it published, I don’t blame them for what they’re doing. It’s the situation they’ve been placed in. It is the system that they are part of. I have a lot of empathy for them.


Look out for the final post coming next week, where we get Nick’s take on what he thinks should be the repercussions for engaging in fraud, and we get a parting tip from Nick on what researchers should do when performing a literature search on papers in their field.