Data Diversity Podcast #3 – Dr Nick H. Wise (2/4)

We are back again with our second blog post featuring Dr Nick H. Wise, Research Associate in Architectural Fluid Mechanics at the Department of Engineering, University of Cambridge. As is the theme of the Data Diversity podcast, we spoke to Nick about his experience as a researcher, but this is a special edition of the podcast. Besides being a scientist and an engineer, Nick has made his name as a scientific sleuth who, based on an article on the blog Retraction Watch which was written in 2022, is responsible for more than 850 retractions, leading Times Higher Education to dub him as a research fraudbuster. Since then, through his X account @Nickwizzo, he has continued his investigations, tracking cases of fraud and in some cases, naming and shaming the charlatans.

In this four-part series, we will learn from Nick about some of the shady activities that taint the scientific publishing industry today. In this second part, we get Nick’s take on the peer review process and fake research data, and I ask his opinion on where the fault lies in the publication of fraudulent research. Below are some excerpts from the conversation, which can be listened to in full here


There are indices like Scopus or Web of Science or SCI, all these different bodies who claim journals are trustworthy, but every journal is going to get attacked by fraud and some will slip through. It is what you do afterwards that matters. 


On the peer review process

LO: As an Early Career Researcher, scientist, engineer, and researcher yourself, is your trust in the whole system still intact? Do you still see value in the peer review process? 

NW: It has absolutely changed how I read a paper and how I view particular journals. When you see a problem happening in a journal that you have read in your research or a journal you have considered submitting to, it really gives you pause for thought. There is an entire ecosystem of journals, right from the from the very good down to the very bad, that are implicated. There are indices like Scopus or Web of Science or SCI, all these different bodies who claim journals are trustworthy, but every journal is going to get attacked by fraud and some will slip through. It is what you do afterwards that matters. Another phenomenon that particularly happens with publishers with a wide list of journals, is that the paper mill will legitimately buy the journal. They may even take it over in a hostile way: they will make a clone of the journal and the website, and they will even redirect the publisher’s link to a different website. They now control a journal that is officially on this trustworthy list. Now they have a short period of time before someone notices and in that time, they will try to publish as many papers as possible and charge everyone for publication. They will absolutely cram this journal with any content. It does not even have to be relevant to the topic because they’re fully in control of the whole process up until the publisher notices and removes the journal from the list. For an author who needs a journal in a paper published in a well-regarded journal, they have achieved what they needed but as soon as the journal is removed from the list, then it becomes worthless. But there is a large supply of these journals, and they will keep trying to take them over. This tends to happen with low tier journals, but there are also paper mills which are targeting journals with an impact factor of over five, over ten – the supposedly absolute top tier journals. 

Between incompetence and conspiration

LO: These days, fraud is so convincing, scams are so rampant, and they always target your insecurities, the insecurity here being authors who want citations. 

NW: I would say that it is not a scam or fraud for the researcher, in the normal sense. These people are selling citations, and the buyer gets citations as opposed to someone getting cheated for their money and getting nothing in return. They are scamming the publishers and scamming the scientific community, but they are not scamming an actual person paying the money. It is a business that is operating as it says it is.  

LO: What does it say, though, that fraudulent papers are still getting through the peer review process. It’s still quite a long way from first draft to publication, and we have seen some cases where remnants of text from Chat GPT replies like “as a large language model…” gets through the review process. In your mind, what does it say about the industry? What’s happening here? 

NW: I think that it is somewhere between incompetence, people in a rush, and peer reviewers being bypassed or being paid. They could also be colluding with authors or the paper mill. To be fair, there are dodgy things that get through a legitimate peer review in the first place. All the peer reviewers are independent but how many people read every single word right of a paper they peer review? Not everyone. People have different standards that they hold themselves to. There is no agreed standard of what you are supposed to do to peer review a paper. As I’m sure anyone who has received peer review reports would know, sometimes you receive a five-page PDF document with hundreds of bullet points, and sometimes you receive a paragraph which maybe took them half an hour to put together. Legitimate peer reviewers could just not do a good job. Then there are also people who pride themselves on doing a load of peer reviews, and in fact you can get certificates from the publisher about how many peer reviews you do. There are people who say they peer review nearly a paper a day – I doubt that they are doing a great job at it.  

Even if someone is reading the text, how much is a peer reviewer supposed to be checking the data? Should someone be trying to run statistical analysis to see if they have been fudged? Should they be spotting that the image is manipulated? Is that something we should expect the peer reviewer to be doing? Or should a peer reviewer go into a review assuming the work is honest? It becomes a different process if you are also thinking about whether a piece of work is fraudulent or not. The easiest things to find are the people who are very lazy or very incompetent and there is just something that is so blatant that it is hard to miss. But if most people are trying to cover their tracks, then it comes down to just how well they have managed to do that. Again, if you are including remnants of Chat GPT like “as a large language model” in your text, you are either extremely lazy, or maybe you don’t read English. But if someone got rid of that bit, you would not notice from reading the abstract. You might think this is a bit bland, but people can write bland text; that is allowed. 

Sometimes peer reviewers are definitely compromised, and I don’t know what the balance is. When you see a bad paper, say a paper with an obvious problem or with chat GPT remnants lying around: is that bad peer reviewing or have they been paid not to notice, or even not to do it? I don’t know what the balance is there. I suspect it is more on the bad peer reviewing side than the criminal or the fraudulent to be honest, but I don’t know. There are times when you think OK, well, maybe they were paying the peer reviewers but did the editor look through this? Did the copy editor? We might want to think that copy editors and type setters are going through and questioning these things like this. It really depends on the journal. I have had things come back where they have gone through and changed from a comma to a dash, so they are clearly going through everything character by character. And there are other journals where the typesetter is clearly just taking everything with no thought. Their job is just to transfer what they have been given into the journal paper and they don’t do any spell checking or checking for grammar or anything. But should that be their job? I don’t know. Then there are journals where the only priority appears to be publishing as many papers as quickly as possible. And if you have made that your priority, even if everyone is acting in good faith, you are going to let a lot more things through. If you are just trying to push everything out the door and do things as quickly as possible, you are not going to give the things as much scrutiny. 

Fake research data

Even from doing my own research, I’ve realized that it would be very easy to fake some data. It would be very hard for anyone who wasn’t in the lab to know if data has been faked. There is no real way for someone to check. Even if you go open data; one experiment might need a few gigabytes of video footage to produce one data point. You can say what you have done to produce that data point, but for someone to go and check its validity, they would in theory need access to gigabytes and gigabytes of data that is not shared. But yes, there have been some things where it has been very easy to check. For instance, in material science, there are lots of experiments which result in the spectra diagram, basically producing a squiggly line on a graph. One thing that would always be true, and you don’t need any subject expertise to know this, is that the line should not double back on itself. Every X value should have one Y value. Well, if you are faking this by drawing it by hand with a mouse, it is quite hard to not double back and there are plenty of published Spectra which have bits where a peak bends over. And it is clearly because someone has drawn it by hand, and some of them are very bad. And that is again where you question what is happening with peer review because it is obvious that something is wrong. Sometimes they will even go outside the lines of the bounding box. I do see some of those because they are quite easy to spot. 


Stay tuned as we release the third conversation with Nick next week. In the penultimate post, we learn from Nick about how researchers try to generate more citations from a single piece of research from a trick called ‘salami slicing’ and the blurred lines between illegality and desperately coping to meet with the unrealistic expectations of academia to the point of engaging with fraud.

Data Diversity Podcast #3 – Dr Nick H. Wise (1/4)

In our third instalment of the Data Diversity Podcast, we are joined by Dr Nick H. Wise, Research Associate in Architectural Fluid Mechanics at the Department of Engineering, University of Cambridge. As is the theme of the podcast, we spoke to Nick about his experience as a researcher, but this is a special edition of the podcast. Besides being a scientist and an engineer, Nick has made his name as a scientific sleuth who, based on an article on the blog Retraction Watch which was written in 2022, is responsible for more than 850 retractions, leading Times Higher Education to dub him as a research fraudbuster. Since then, through his X account @Nickwizzo, he has continued his investigations, tracking cases of fraud and in some cases, naming and shaming the charlatans. Nick was kind to share with us many great insights over a 90-minute conversation, and as such we have decided to release a four part-series dedicated to the topic of research integrity. 

In this four-part series, we will learn from Nick about some of the shady activities that taint the scientific publishing industry today. In part one, we learn how Nick was introduced into the world of publication fraud and how that led him to investigate the industry behind it. Below are some excerpts from the conversation, which can be listened to in full here


I have found evidence of a papermill bribing some editors and there have been many, at least tens, if not hundreds, of editors that have been let go or told to stop being editors by journals in the last year because they have been found to be compromised. This could be because of bribery or some other way of being compromised. This is what I try to uncover. – Dr Nick H. Wise


Tortured Phrases and PubPeer: Nick’s beginnings as a Scientific Sleuth  

My background is in fluid dynamics where I mostly think about fluid dynamics within buildings. For instance, I think about the air flows generated by different heating systems and things like pollutant transport such as smells or COVID which can travel with the air and interact with other each other. That was my PhD and the post-doc in the Engineering department.

About three years ago whilst trying to avoid writing my thesis, I saw a tweet from the great Elizabeth Bik, who is possibly the most famous research fraud investigator. She mostly looks at biomedical images and her great skill is she would be able to look through a paper and see photos of Western blots of microscopy slides and see if parts of an image are identical to other parts, or if the image overlaps with images from different papers. She has an incredible memory and ability to spot these images. She’s been doing this for over 10 years and has caused many retractions. I was aware of her work but there was no way for me to assist with that because it is not my area of research. I don’t have an appreciation of what these images should look like.

But about three years ago she shared a preprint written by three computer scientists on her Twitter account about a phenomenon they called ‘tortured phrases’. In doing their research and reading the literature, these computer scientists noticed that there were papers with very weird language in them. What they surmised was that to overcome plagiarism checks by software like Turnitin, people would run text through paraphrasing software. These software were very crude in that they would go word by word. For instance, it would look at a word and replace it with the first synonym it found in a thesaurus. It would do this word for word, which makes the text barely readable. However, it is novel and so it will not flag any plagiarism checking software. Eventually, if you as a publisher have outsourced the plagiarism checks to some software, and neither your editor or peer reviewer reads the text to check if it makes sense, then this will get through peer review process without any problem and the paper would get published.  

For an example of tortured phrases: sometimes there’s not only one way to say something. Particularly if English is not someone’s first language, you don’t want to be too harsh on anyone who’s just chosen a word which just isn’t what a native speaker would pick. But there are some phrases where there’s only one right way to say it. For instance, artificial intelligence is the phrase for the phenomenon you want to talk about, and if instead you use “man-made consciousness”, that’s not the phrase you need to use, particularly if the original text said artificial intelligence brackets AI, and your text says “man-made consciousness” brackets AI. It’s going to be very clear what has happened.  

The three computer scientists highlighted this phenomenon of ‘tortured phrases’, but entirely from within the computer science field. I wondered if a similar phenomenon was happening in my own field in fluid dynamics. Samples of these paraphrasing software are freely available online as little widgets so I took some standard phrases from fluid dynamics, which were the kind that would not make sense if you swapped the words around and generated a few of these tortured phrases, I googled them and up popped hundreds of papers featuring these phrases. That was the beginning for me. 

I started reporting papers with these phrases on a website called PubPeer, which is a website for post-publication peer review. I commented on these papers and started being in conversation with the computer scientists who wrote the paper on ‘tortured phrases’ because they built a tool to scrape the literature and automatically tabulate these papers featuring these phrases. They basically had a dictionary of phrases which they knew would be spat out by the software because some of this paraphrasing software are so crude, such that if you put in “artificial intelligence”, you are always going to get out “man-made consciousness” or a handful of variants. It didn’t come up with a lot of different things. If you could just search for “man-made consciousness” and it brings up many papers, you knew what has been going on. I contributed a lot of new ‘fingerprints’, which is what they call their dictionary that they would search the literature for. That is my origin story. 

On Paper Mills and the Sale of Authorships 

There is also the issue of meta-science, which has nothing to do with the text of the paper or with the data itself, but more to do with how someone may add a load of references through the paper which are not relevant, or they are all references to one person or a colleague. In that way you would be gaming the system to boost profiles, careers, and things like H-index. Because having more publications and more citations is so desirable, there is a market for this. It is easy to find online advertisements for authorship of scientific papers ranging from $100 to over $1000, depending on the impact factor of the journal, and the position of authorship you want: first authorship, seventh authorship, or whether you want to be the corresponding author, these sorts of factors. Likewise, you can buy citations.  

There are also organizations known as paper mills. For example, as an author I might have written the paper and want, or need, to make some money and so I go to this broker and say: I want to sell authorships, I’ll be author number six, but I can sell the first five authorships. Can you put me in touch with someone selling authorships? At the same time, there are people who go to them saying I want to buy an authorship, and they put two and two together acting as a middleman. Also, some of these paper mills do not want to wait for someone to come to them with a paper – they will write papers to order. They have an in-house team of scientific writers who produce papers. This does not necessarily mean that the paper is bad. Depending on where they want the paper to publish, the paper might have to be good if it has to get published. So, they will employ people with degrees, qualified people or PhD students who need to earn some money, and then they will sell the authorships and get the papers published. This is a big business. 

There is a whole industry behind it, and something I have moved onto investigating quite a lot is where these papers are going. When I identify these papers, I try to find out where they are being published, how they’re being published, who is behind them, who is running these paper mills, who is collaborating with them. Something I found out which resulted in an article in Science was that paper mills want to guarantee acceptance as much as they can. If a paper is not accepted, it creates a lot of work for them and it means a longer time before their customers get what they paid for. For example, if a paper that they wrote and sold authorships for gets rejected, they’re going to have to resubmit it to another journal. So something paper mills will do is they will submit a paper to 10 journals at once and publish with whichever journal gave them the easiest time. But still, they want to try and guarantee acceptance and one way to do that is to simply bribe the editor. I have found evidence of a papermill bribing some editors and there have been many, at least tens, if not hundreds, of editors that have been let go or told to stop being editors by journals in the last year because they have been found to be compromised. This could be because of bribery or some other way of being compromised. This is what I try to uncover.

Although I’m not fighting this alone, it can feel like that. Publishers are doing things to some extent and they’re doing things that they can’t tell you about as well. And then there’s other people like me investigating this in their free time or as a side project. Not enough of us are doing it because it is a multi-million-dollar industry that is generating these papers. More papers are being published than ever before so it is a big fight.


Stay tuned as we release the rest of the conversation with Nick over the next month. In the next post, we get Nick’s take on the peer review process and fake research data, and I ask his opinion on where the fault lies in the publication of fraudulent research. 

Data Diversity Podcast #2 – Dr Alfredo Cortell-Nicolau

In our second instalment of the Data Diversity Podcast, we are joined by archaeologist Dr Alfredo Cortell-Nicolau, a Senior Teaching Associate in Quantitative and Computational Methods in Archaeology and Biological Anthropology at the McDonald Institute for Archaeological Research and Data Champion.

As is the theme of the podcast, we spoke to Alfredo about his relationship with data and learned from his experiences as a researcher. The conversation also touched on the different interpersonal, and even diplomatic, skills that an archaeologist must possess to carry out their research, and how one’s relationship with individuals such as landowners and government agents might impact their access to data. Alfredo also sheds light on some of the considerations that archaeologists must go through when storing physical data and discussed some ways that artificial intelligence is impacting the field. Below are some excerpts from the conversation, which can be listened to in full here.

I see data in a twofold way. This implies that there are different ways to liaise with the data. When you’re talking about the actual arrowhead or the actual pot, then you would need to liaise with all the different regional and national laws regarding heritage and how they want you to treat the data because it’s going to be different for every country and even for every region. Then, of course, when you’re using all these morphometric information, all the CSV files, the way to liaise with the data becomes different. You have to think of data in this twofold way.

Dr Alfredo Cortell-Nicolau

Lutfi Othman (LO): What is data to you?

Alfredo Cortell-Nicolau (ACN): In archaeology in general, there are two ways to see the data. In my case for example, one way to see it is that the data is as the arrowhead and that’s the primary data. But then when I conduct my studies, I extract lots of morphometric measures and I produce a second level of data, which are CSV files with all of these measurements and different information about the arrowheads. So, what is the data? Is it the arrowhead or is it the file with information about the arrowhead? This raises some issues in terms of who owns the data and how you are going to treat the data because it’s not the same. In my case, I always share my data and make everything reproducible. But when I share my data, I’m sharing the data that I collected from the arrowheads. I’m not sharing the arrowheads because they are not mine to share.

This is kind of a second layer of thought when you’re working with Archaeology. When you’re studying, for example, pottery residues, then you’re sharing the information of the residues and not the pot that you used to obtain those residues. There are two levels of data. Which is the actual data itself? The data which can be reanalyzed in different ways by different people, or the data that you extracted only for your specific analysis? I see data in this twofold way. This implies that there are different ways to liaise with the data. When you’re talking about the actual arrowhead or the actual pot, then you would need to liaise with all the different regional and national laws regarding heritage and how they want you to treat the data because it’s going to be different for every country and even for every region. Then, of course, when you’re using all these morphometric information, all the CSV files, the way to liaise with the data becomes different. You have to think of data in this twofold way.

On some of the barriers to sharing of archaeological data

ACN: There are some issues in how you would acknowledge that the field archaeologist is the one who got the data. Say that you might have excavated a site in the 1970s and some other researcher comes later, and they may be doing many publications after that excavation, but you are not always giving the proper attribution to the field archaeologist because you cited the first excavation in the first publication, and you’re done. Sometimes, that makes field archaeologists reluctant to share the data because they don’t feel that their work is acknowledged enough. This is one issue which we need to try to solve. Take for example a huge radiocarbon database of 5000 dates: if I use that database, I will cite whoever produced that database, but I will not be citing everyone who actually contributed indirectly to that database. How do I include all of these citations? Maybe we can discuss something like meta-citations, but there must be some way in which everyone feels they are getting something out of sharing the data. Otherwise, there might be a reaction where they think “well, I just won’t share. There’s nothing in for me to share it so why should I share my data”, which would be understandable.

On dealing with local communities, archaeological site owners and government officials

ACN: When we have had to deal with private owners, local politicians and different heritage caretakers, not everyone feels the same way. Not everyone feels the same way about everything, and you do need a lot of diplomatic skills to navigate through this because to excavate the site you need all kinds of permits. You need the permit of the owner of the site, the municipality, the regional authorities, the museum where you’re going to store the material. You need all of these to work and you need the money, of course. Different levels of discussion with indigenous communities is another layer of complexity which you have to deal with. In some cases, like in the site where we’re excavating now, the owner is the sweetest person in the world, and we are so lucky to have him. I called him two days ago because we were going to go to the site, and I was just joking with him, saying I’ll try not to break anything in your cave, and he was like, “this is not my cave. This is heritage for everyone. This is not mine. This is for everyone to know and to share”. It is so nice to find people like that. That may happen also with some kinds of indigenous communities. The levels of politics and negotiation are probably different in every case.

On how archaeologists are perceived

LO: When you approach a field or people, how do they view the archaeologists and the work?

ACN: It really depends on the owner. The one that we’re working with now, he’s super happy because he didn’t know that he had archaeology in his cave. When we told him, he was happy because he’s able to bring something to the community and he wants his local community to be aware that there is something valuable in terms of heritage. This is one good example. But we have also had other examples, for instance, where the owner of the cave was a lawyer and the first thing he thought was “are there going to be legal problems for me? If something happens in the cave, who’s the legal responsibility.” In another case there was there was another person that just didn’t care, she said “you want to come? Fine. The field is there, just do whatever you want.” So, there are different sensibilities to this. Some people are really happy about the heritage and don’t see it as a nuisance that they have to deal with. 

LO: How about yourself as a researcher, archaeologist: do you see yourself as the custodian of sorts, or someone who’s trying to contribute to this or local heritage for the place? Or is it almost scientific and you’re there to dig.

ACN: When I approach the different owners, I think the most important thing is to let them know that they have something valuable to the local community and they can be a part of that. They can be a part of being valuable to the local community. Also, you must make it clear that it’s not going to be a nuisance for them and they don’t have to do anything. I think the most important part is letting them know how it can be valuable for the community. I usually like them to be involved, and they can come and see the cave and see what we are doing. In the end it’s their land and if they see that we are producing something that is valuable to the community then it is good for them. In this case, the type of data that we produce is the primary type of data, that is, the actual different pottery sherds, the different arrowheads, etcetera. In this current excavation, we got an arrowhead that is probably some 4- or 5000 years old and you get (the land owners) to touch this arrowhead that no one in 5000 years has seen. If you can get the owners to think of it in this way, that they’re doing something valuable for your community, then they will be happier to participate in this whole thing and to just let us do whatever we want to do, which is science.

LO: How do you store physical data? Or do you let the landowner store it?

ACN: That depends on the national and regional laws and different countries have different laws about this. The cave where I’m working right now is in Spain, so I’m going to talk about the Spanish law, which is the one that I that I follow and it’s going to be different depending on every country. In our case, with the different assemblages that you find, you have a period of up to 10 years where you can store them yourself in your university and that period is for you to do your research with them. After that period, it goes to whichever museum they are supposed to be going, which depends on the law that says that it has to be the museum that is the closest to the cave or site where they were excavated. Here, the objects can then be displayed and the museum is the ones responsible for managing them, and storing them long term.

There is one additional thing: If you are excavating a site that has already been excavated, then there is a principle of keeping the objects and assemblages together. For example, there is this cave that was excavated in the 1950s and they store all the assemblages in the Museum of Prehistory of Valencia, which was the only museum in the whole region. Now, they excavated it again a few years ago and now there are museums that are closer to the cave but because the bulk of the assemblages are in Valencia and they don’t want to have it separated in two museums, they still have to go to Valencia. This is the principle of not having the assemblages separated and it is the most important one.


As always, we learn so much by engaging with our researchers about their relationship with data, and we thank Alfredo for joining us for this conversation. Please let us know how you think the podcast is going and if there are any question relation to research data that you would like us to ask!

Thomas Roulet on sustainable publishing models

Knowledge Rights 21 recently published a short video by Thomas Roulet, Professor of Organisational Sociology and Leadership at the Judge Business School at the University of Cambridge. In it, Prof. Roulet discusses the operations of M@n@gement, the no-fee open access journal published by L’Association Internationale de Management Stratégique (AIMS). The journal is a good example of the turn to community-led forms of open access publishing and how publishing can be organised by communities and sustained by professional associations.

This video is reproduced under a CC BY licence and with the permission of Prof. Roulet. The original video was shared on the Knowledge Rights 21 blog here: https://www.knowledgerights21.org/video/sustainable-publishing-models-thomas-roulet/

Thoth Archiving Network goes live at Cambridge 

Dr Agustina Martínez-García, Head of Open Research Systems, Digital Initiatives

Cambridge University Library (CUL) is piloting participation in the Thoth Archiving Network, which allows small presses to use a simple deposit option to archive their publications in multiple repository locations, creating the opportunity to safeguard against the complete loss of their open books catalogue, should they cease to operate. 

Participation in the pilot has allowed us to explore the implementation of suitable infrastructure, built on interoperable, open, and widely adopted platforms to support discovery, access, and long-term availability of open scholarly works. 

Work done so far 

We are pleased to share that the Cambridge repository platform participating in the Thoth network is now live at https://thoth-arch.lib.cam.ac.uk/home, and now includes a full back catalogue of two open monograph publishers. This repository is based on the open-source DSpace software

Through the implementation phase, we have worked very closely with the Thoth technical team to support the implementation and testing of standard and automated deposit mechanisms into DSpace-based repositories. This work has allowed us to further our knowledge and expertise on scholarly and research platforms by using well adopted repository platforms (DSpace) in a new area: open access books and monographs. It has also provided us with the opportunity to test the implementation of additional infrastructure to support discovery, access, and dissemination of such open access content, and potentially experiment with other types of scholarly work. 

What’s next 

Now that the repository platform is live, we would like to gather insights about volume of content, required storage and staff resources (both infrastructure and user support). This will help us estimating associated costs for provision of such a service as well as preservation costs for the longer term, during the 3-year pilot.  

In terms of long-term preservation, we will explore several preservation options, including preserving the content in-house as part of the Libraries’ wider Digital Preservation Programme. The types of material hosted in this platform can provide an exemplary use case of scholarly content that is “preservation ready”, uses open and standard file formats (i.e., PDF and epub) and is accompanied by rich, high quality descriptive metadata. 

See this post by the Open Book Futures Team for more details about the pilot:  

https://copim.pubpub.org/pub/thoth-archiving-network-goes-live-at-university-of-cambridge/release/1

Diamond Open Access Journals platform launch at Cambridge

Dr Agustina Martínez-García, Head of Open Research Systems, Digital Initiatives

We are pleased to announce that our Diamond Open Access Journals at Cambridge platform has launched in May and can be accessed at https://diamond-oa.lib.cam.ac.uk/home. This service will be available initially as part of a one-year pilot project undertaken by the Open Research Systems (ORS) and Office of Scholarly Communication (OSC) teams within Cambridge University Library (CUL).  

Project overview

The main aim of the Diamond project is to support Cambridge’s research community in the context of a changing open research and scholarly publishing environment. To meet increasing demand to share research findings we are scoping, assessing, and implementing future services and systems that meet those needs, while contributing to a growing wider open research community and ecosystem. The pilot is being launched off the back of a project to understand the community-led publishing landscape at Cambridge (findings to be shared soon). Researchers in the Office of Scholarly Communication uncovered a vibrant ecosystem of DIY publishing projects at Cambridge that the library is exploring how to support through technical and resource-based approaches.  

As part of the project, we are engaging with Cambridge researchers and exploring whether open and community-developed platforms meet their needs around institutional publishing and can be used as the basis for service development in this area. We are using the DSpace repository platform to support this pilot. DSpace is a widely adopted, open-source repository platform, and it is currently the solution underpinning Apollo, Cambridge’s Institutional Repository. In its newest version, it offers advanced functionality and features that can potentially make it a suitable platform for journal publishing, an area we are keen to explore with this pilot. 

Where we are at

Main activities of the project are focusing on: 

  • Exploring the implementation of suitable infrastructure, built on interoperable, open, and widely adopted platforms. 
  • Gathering use cases of community-led open access journals at Cambridge, focusing on discipline, journal type, frequency of publication, production standards. 
  • Gathering insights and inform future service development in this area by a) assessing the suitability of the DSpace open-source repository platform as a journal publishing platform; and b) estimating the associated costs and resourcing requirements, both in terms of service management and infrastructure (long-term access, storage, and preservation costs). 

The following four Cambridge student-led journals have agreed initially to participate in the pilot, and we are also exploring opening participation to additional journals in the upcoming months. 

  • Cambridge Journal of Climate Research (Climate Research Society, first issue now available in the Diamond platform
  • Cambridge Journal of Human Behaviour (Anthropology) 
  • Cambridge Journal of Visual Culture (History of Art) 
  • Scroope (Architecture) 

What’s next

The next iteration of work for the pilot will focus on assessing the resources and costs involved in transitioning from pilot to service. Ensuring long-term preservation and access comes with several associated costs and it is critical to assess these when evaluating sustainable approaches to service development. Examples of cost elements that we will consider include onboarding (initial implementation) fees, hosting and maintenance fees, volume of content and storage costs, persistent identifier (DOIs and ISSN) minting and publisher databases indexing services costs, etc. We will also explore suitable long-term content preservation options, including approaches such as integrations with existing preservation services such as CLOCKSS (https://clockss.org/), or assessing in-house preservation via the services that are currently being developed as part of CUL’s Digital Preservation Programme. 

Formatting the Future: Why Researchers Should Consider File Formats

Dr Kim Clugston, Research Data Coordinator, OSC
Dr Leontien Talboom, Technical Analyst, Digital Initiatives

Many funders and publishers now require data to be made openly available for reuse, supporting the open data movement and value for publicly funded research. But are all researchers aware of why they are being asked to share their data and how to do this appropriately? When researchers deposit their research data into Apollo (the University of Cambridge open access repository) they generally understand the benefits of sharing data and want to be a part of this. These researchers provide their data in open file formats accompanied by rich metadata so the data has the best chance of being discovered and reused most effectively. 

There are other researchers who deposit their data in a repository during the publication process; this often takes place within tight deadlines set by the publisher. For this reason, researchers often rush to upload their data, and thoughts about how this data will remain preserved and accessible for long-term use are not considered. The challenges around preserving open research data were highlighted in this article. The authors addressed the concerns that open research data can include a wide variety of different types of data files, some of which may only be accessible with proprietary software or software that is outdated or at risk of being outdated soon. How can we ensure that research data that is open now stays accessible and open for use for many years to come? 

In this blog, we will discuss the importance of making data open, ensuring this is maintained for future use (digital preservation). We will use some examples from datasets in Apollo and suggest recommendations for researchers that go beyond the normal FAIR principles to include considerations for the long term. 

Why is it important for the future?

The move to open data, following the FAIR principles, has the potential to boost knowledge, research, collaboration, transparency and decision making. In Apollo alone, there are now thousands of datasets which are available openly worldwide to be used for reference or reused as secondary data. Apollo, however, is just one of thousands of data repositories. It is easy to see how this vast amount of archived data comes with great responsibility for long term maintenance. A report outlined the pressing matter that FAIR data, whilst addressing metadata aspects well, doesn’t really address data preservation and the challenges that this brings such as the risk of software and/or hardware becoming obsolete, and therefore data reliant on these becoming inaccessible.

Tracking the reuse of datasets could provide essential information on how different file formats are holding up, but there is an ongoing challenge to track dataset reuse. Datasets are not yet routinely cited in the established way that is seen for journal articles or other publication types. This is an area that is actively being developed through initiatives such as Make Data Count and it is hoped that at some point soon, data citation will become part of the routine practice of research to further enhance visibility on how data is being credited and reused. 

In Apollo, we see great interest in the available datasets as they are viewed and downloaded frequently. The most downloaded dataset in Apollo has been downloaded over 300,000 times since it was first deposited in 2015 and, interestingly, consists of open file formats. Other highly downloaded datasets in Apollo, such as the CBR Leximetric dataset, have been used by lawyers and social scientists and successfully cited as a data source to answer new research questions. The Mammographic Image Analysis Society database was deposited in Apollo in 2015 and has been frequently downloaded and reused by researchers working in the field of medical image analysis as discussed in a previous blog. To date, Google Scholar reports it has been cited 78 times. These datasets show the value of sharing and reusing data and all are in file formats that are accessible to everyone which will help to preserve them for as long as possible. 

Digital preservation is a discipline focused on providing and maintaining long-term access to digital materials. Obsolete software is a big problem in maintaining access to files in the future. PRONOM, a file format registry, keeps track of a large amount of known file formats and provides additional information on these formats. Last year, a file format analysis of datasets in Apollo was conducted to highlight what file formats are represented in the repository. The results revealed the diverse array of different file formats which is a testament to the breadth of research conducted and the adoption of open data across many disciplines. Most of the file formats are common and can still be opened, but a large percentage of the material has not been identified or are in formats that are not immediately accessible without migrating to a different format or emulating the current file formats. Table 1 shows a few complex examples of file formats held in Apollo. 

File FormatExample in ApolloFuture Use
.dx (Spectroscopic Data Exchange Format)LinkThis is not an open-source format, meaning that opening the file is dependent on the software being available
.mnova (Mestrelab file format)LinkProprietary file format, licence for the programme is expensive
.pzfx (Prism file format)LinkOlder format for a file software program called Prism. This is now considered legacy software.

The Bit List, a list maintained by the Digital Preservation Coalition that includes contributions from members of the digital preservation community, outlines the “health” of different file formats and content types,  including research data. In fact, unpublished research data (which is another issue outside the scope of this blog!) is classified as critically endangered and uncovers the problem that the majority of researchers generally only make data open at the point of publication. But even research data published in repositories has its difficulties and is classified as vulnerable, mainly due to the dependency on many file formats having the availability of the appropriate software to open and use them. There are potential solutions on the horizon to address this problem, such as the open-source ReproZip which packages research data with the necessary files, libraries and environments so they can be run by anybody. However, this still doesn’t address the issue of obsolete software. The gold standard would be to deposit research data in open formats, so viewing and using the files is not dependent on a particular software; the files will be open and accessible as long as they are held available within a repository.  

What researchers can do

What can researchers do to make sure that when they deposit data into a repository, it will be available for them and others in 10 or even 20 years time? Awareness is the first step. Researchers should consider submitting their data to a repository, one that is suitable for their files. Choose a trusted data repository. A recent blog highlighted the potential problem of disappearing data repositories, with approximately 6% of repositories listed on the repository search registry, re3data being shut down (most reasons are unknown but some were listed as organisation or economic failure, obsolete software/hardware or external attacks). Approximately 47% of the repositories that had shut down did not provide an alternative solution to rescue the data and it is assumed that this data is lost. It may be that your funder or publisher decides the repository for you, but we have some guidance on what to look for in a trusted repository. If you are at Cambridge, you can deposit your data in Apollo which has CoreTrustSeal certification.

The data itself is arguably the most important factor, we need to make sure the data files can be found and used by anyone at any time, forever. Ideally, this means using open file formats where possible as these don’t have any restrictions. The Library of Congress and the UK National Archives both maintain registries of file formats. There is some Cambridge University guidance on choosing file formats as well as some by the UKDS. Have a look at the file formats you have on the PRONOM database, is this seen as a sustainable format? If the data you are generating is from proprietary software, it is good practice to deposit this version as well as an open format that does not require any specialist software to open them. This ensures that both options are available in case of any loss of formatting from converting to open formats. An example are the statistical software packages SPSS and NVivo which are proprietary but have the option to convert to open formats such as a CSV file. 

There may be information on how to convert your file types to open formats within your discipline. In the Chemistry department here at Cambridge, an initiative was started together with the Data Champion programme to provide a platform to allow researchers to add instructions for converting experimental derived files into open formats. Open Babel is an open-source, collaborative project aimed at providing a “chemistry toolbox” with information on how to convert chemical file formats into other formats where needed. There is also some guidance on how to export from R to open formats such as txt and csv.

In some cases, it might not be possible to provide an open file format alternative. The files you use may be subject to discipline-specific standards or you are restricted by the hardware and software you use in your research. For these, it is important to provide good documentation or a detailed README file alongside the file format so researchers know how to access and use your files. In fact good file organisation, documentation and metadata is just as important as the files themselves, as data without any documentation is considered virtually meaningless. The more information you can provide the better and might possibly save you time in the long run from potential questions from other researchers in the future. 

The future use of past research hinges on the thoughtful selection of file formats. By prioritising openness and longevity, we lay the foundation for collaboration and innovation. Choices that researchers make today shape the accessibility and integrity of data for generations to come.

The (exponential) thirst for data – The March 2024 Data Champions forum

The Data Champions were treated to a big data themed session for the March Data Champion forum, hosted at (and sponsored by) the Cambridge University Press and Assessment in their amazing Triangle building. First up was Dr James Fergusson, course director for the MPhil in Data Intensive Science, who described how the exponential growth in data accumulation, computing and artificial intelligence (A.I.) capabilities has led to a paradigm shift in the world of cosmological theorisation and research, potentially changing with it scientific research as a whole.  

Dr James Fergusson presenting to the Data Champions at the March forum

As he explained, over the last two decades cosmologists have seen a rapid increase of data points on which to base their theorisation – from merely 14 data points in 2000 to 800 million data points in 2013! Through the availability of these data points, the paradigm for research in cosmology started to shift completely – from being theory based to being based on data.  With several projects beginning soon that will see vast amounts of data generated daily for decades to come, this trend is showing no signs of slowing down. The only way to cope with this exponential increase in data generation is with computing power, which has also been growing exponentially. In tandem with these sectors of growth is the growth of machine learning (ML) capabilities as the copious amount of data not only necessitates immense amounts of computing power but also ML capabilities to process and analyse all of the data. Together, these elements are fundamentally changing the story of scientific discovery. What was once a story of an individual researcher having an intellectual breakthrough is becoming the story of machine led, automated discovery. While it used to be the case that an idea, put through the rigour of the scientific method, would lead to the generation of data, now the reverse is not only possible but become increasingly likely. Data is now generated first before a theory is discovered, and the discovery may come from AI and not a scientist. This, for James, can be considered the new scientific method. 

Dr Anne Alexander has been familiarising herself with AI, especially in her capacity as Director of Learning at Cambridge Digital Humanities (CDH) where she has been incorporating critique of AI into a methodology of research in the digital humanities, particularly in the area of Critical Pedagogy. In her work, Anne addresses how structural inequalities can be reinforced, rather than challenged by AI systems. She demonstrated this through two projects that she was involved with at CDH. One was called Ghost Fictions, a series of workshops with the aim of encouraging critical thinking about automated text generation using AI methods both in scholarly work and in social life. The project resulted in a (free to download) book titled Ghost, Robots, Automatic Writing: an AI level study guide, which was intended as a provocation of a future where books, study guides and examinations are created by Large Language Models (LLM) (perhaps a not so distant future). Another project involved using AI to create characters for a new novel, which revealed the racial biases of ChatGPT when prompted with certain names. Yet, perhaps the most worrying aspect about the transformative forms of AI is the immediate and consequential impact it has on the environment. The computational power needed to quench the thirst for the exponential amounts of data needed to train and progress AI chat bots, LLMs and image generation systems, requires vast computing power which in turn generates a lot of heat and requires large amounts of water to operate. As Anne demonstrated, this could be increasingly problematic for many places as the global climate crisis continues. Locally, we have the case of West Cambridge, which is already water stressed, but also home to the University’s data centre and where the new DAWN AI supercomputer is located. Through these examples, she posed the questions: does AI perpetuate further harm and inequality? Are the environmental costs of AI too high?    

Dr James Fergusson and Dr Anne Alexander answering questions from the Data Champions at the March forum

The themes that Anne concluded her presentation with formed the basis of the Q&A between the Data Champions and the speakers. The topic of the potential biases of AI and ML was put forward to James who agreed that his field of study could not escape it. That said, unlike the humanities, biases in physics can potentially be helpful as it may help make the scientific process as objective as possible. However, this could clearly be problematic for humanities research, which tends to deal with social systems and relations, and views of the world. The topic of the environmental cost of AI was also touched on, with which James commented that energy insufficiency is a problem and getting harder to justify, and solutions might only create new problems as the demand for this technology is not slowing down. Anne expressed her concerned and suggests that society at large should be consulted on this as the environment is a social problem thus society should have a say on what risk they are willing to be a part of. The question of the automation of science was also raised to James who admitted that preparing early career physicists for research now involves developing their software skills rather than subject knowledge expertise in physics or mathematics. 

What we can learn from the ‘promise and pitfalls of preregistration’ meeting

Dr Mandy Wigdorowitz, Open Research Community Manager, Cambridge University Libraries

The promise and pitfalls of preregistration meeting was held at the Royal Society in March 2024. It was organised to address the utility of preregistration and initiate an interdisciplinary dialogue about its epistemic and pragmatic aims. The goal of the meeting was to explore the limitations associated with preregistration, and to conceive of a practical way to guide future research that can make the most of its implementation.

Preregistration is the practice of publicly declaring a study’s hypotheses, methods, and analyses before conducting a research study. Researchers are encouraged to be as specific as possible when writing preregistration plans, detailing every aspect of the research methodology and analyses, including, for instance, the study design, sample size, procedure for dealing with outliers, blinding and manipulation of conditions, and how multiple analyses will be controlled for. By doing so, researchers commit to a time-stamped study plan which will reduce the potential for flexibility in analysis and interpretation that may lead to biased results. Preregistration is a community-led response to the replication crisis and aims to mitigate questionable research practices (QRPs) that have come to light in recent years, some of which include HARKing (Hypothesising After Results are Known), p-hacking (the inappropriate manipulation of data analysis to enable a favoured result to be presented as statistically significant), and publication bias (the unbalanced publication of statistically significant findings or positive results over null and/or unexpected findings) (Simmons et al., 2011; Stefan & Schönbrodt, 2023).

The meeting brought together scholars and publishers from a range of disciplines and institutions to discuss whether preregistration has indeed lived up to these aims and whether and to what extent it has solved the problems it was envisioned to address.

It became clear that the problems associated with QRPs have not simply disappeared with the uptake and implementation of preregistration. From the perspective of meta-research, the success of preregistration appears to be largely disciplinary and legally dependent, with some disciplines mandating and normalising it (e.g., clinical trial registration in biomedical research), others greatly encouraging and (sometimes) requiring it (e.g., psychological science research), and others having no expectations about its use (e.g., economics research). The effectiveness of preregistration was shown to be linked to these dependencies, but also related to the quality and detail of the preregistration plan itself. Researchers are the arbiters of their research choices and if they choose to write vague or ambiguous preregistration plans, the problems that preregistration are assumed to address will inevitably persist.

Various preregistration templates exist (such as on the Open Science Framework, OSF) and some incentives for preregistration are recognised, such as the preregistration badges awarded by some journals, making it a systematic and straightforward exercise. In practice, however, it is not always the case that sufficient information is provided, and even in cases where preregistered plans are detailed, they are not always followed for various pragmatic or other (not always nefarious) reasons. As such, the research community are cautioned to not assume that preregistration equates to better or more trustworthy research. Rather, the preregistration plan needs to be critically reviewed as a standalone document in conjunction with the published study. This is important because preregistration plans that are usually deposited into repositories (e.g., OSF, National Library of Medicine’s Clinical Trials Registry) are seldom evaluated as entities of their own or against their corresponding research articles. Note that this is unlike registered reports which are a type of journal article that details a study’s protocol that does get peer reviewed before data is collected and if reviewed favourably, is given an in-principal acceptance regardless of the study outcomes.

Other discussions centred around the utility of preregistration in exploratory versus confirmatory research, whether preregistration can improve our theories, and how the process of conducting multiple but slightly varied analyses and selecting the most desired outcome (also referred to the ‘garden of forking paths’) affects the claims we make.

The overall sentiment from the meeting was that while preregistration does not solve all the issues that have arisen from QRPs, it ultimately leads to more transparency of the research process, accountability on the part of the researchers conducting the research, and it facilitates deeper engagement with one’s own research prior to any collection or analysis of data.

Since attending the meeting, I have taken away valuable insights that have made me critically reflect on my own research choices, and from a practice perspective, I have downloaded the OSF preregistration template and am documenting the plans for a research project.

Given the strides that have been taken toward improving the transparency, credibility and reproducibility of research, researchers at Cambridge need to consider whether preregistration plans should be included as another type of output that can be deposited on the institutional repository, Apollo. We have recently added Methods and preprints as output types which have broadened the options for sharing and which align with open research practices. Including preregistration could be a valuable and timely addition.  

References

Stefan, A. M., & Schönbrodt, F. D. (2023). Big little lies: a compendium and simulation of p-hacking strategies. Royal Society Open Science, 10(2), 220346. https://doi.org/10.1098/rsos.220346

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359-1366. https://doi.org/10.1177/0956797611417632

Dear Data,…

Valentine’s day week for the international data community is not only a time for expressing your love to the significant others in your life. As it is also Love Data Week, it is also a time to reflect on your love for all things data! That was the goal for the Research Data team this year! The theme of this year’s Love Data Week was “My Kind of Data”, suggesting that data workers – researchers and analysts alike – have a relationship to data that is personal, often idiosyncratic, and almost always heartfelt. The Research Data team, as supporters of the University’s researchers, are interested in such relationships and are always eager to discover the distinctive needs that the disciplinary differences between the University’s departments create. This year, the Research Data team decided that they wanted to find out from students and researchers from the Arts, Humanities and Social Sciences (AHSS) what was their kind of data.

To do so, the Research Data team positioned themselves at the Foyer of the Alison Richard Building on the University’s Sidgwick Site, which is home to several AHSS departments, for two mornings on Monday the 12th and Thursday the 15th of February. Across the city, Data Champion Lizzie Sparrow was leading the charge with science, technology, engineering, mathematics (STEMM) students and researchers by holding her own pop-up at the West Hub. Like the Research Data team, and as a Research Support Librarian (Engineering) herself, Lizzie is also interested in the relationships that researchers have with data. Her approach, however, would likely be different. Unlike researchers in the STEMM subjects, the term data for AHSS students and researchers can sometimes feel exclusionary as they may not consider what they generate through research as data. From our perspective on the other hand, any material that goes on to form any part of their research is one’s data. To bring attention to this, the team tried to engage passers-by with the provocation “you have research data, change our minds!” The provocation was successful and many conversations were had on the different ways that members of the Sidgwick community understood data in their research.

The Research Data Team from the Office of Scholarly Communication (Cambridge University Library), from left to right: Clair Castle, Lutfi Othman, Kim Clugston.

The team was pleased to find that there was a general interest in the services of the Research Data team among the Sidgwick community, and we were happy to be able to share with others how we can help them with their data management and planning.

Some treats for those who stop by.
Our Open Research poster, designed by Clair Castle.

The team tried to capture the sentiments of the conversations had by asking the Sidgwick community to partake in 2 short activities as they departed our pop-up to better understand  their relationship with data (in exchange for Love Hearts sweets!). Firstly, we asked them to describe to us what data was to them, a question that we are extremely fond of asking! As usual, the answers were informative and they helped us to gain a sense of the varying data types that the Sidgwick community worked with – from political tracts and archival materials to balance sheets and land deeds from the early modern era.

Activity 1: Lots of different data types in the AHSS community!

For the second activity, we asked them what term best captured the materials that formed the basis of their scholarly work: data, research materials, or other? To our surprise, the majority of people we spoke to over both days saw themselves as working with data, more than double the number that saw themselves working with research materials, with a small number seeing themselves as working with both, interchangeably. This finding illustrated something that has been increasingly discussed in the Research Data team office: that finding alternatives to the term data may make our services and initiatives more appealing to members of the AHSS community. This is something we will take into account when targeting our outreach in the future. Yet, one thing is certain – our Research Data services are needed by the AHSS community just as much as it is by the STEMM community.

Activity 2: More generators of ‘data’ than we expected!

The pop-ups at the Alison Richard building were encouraging and it is hoped that fruitful relationships will transpire from these events. This is something that we may hold again soon. It was a good way to communicate our message and make others aware of the services of the Research Data team. Over at the West Hub Lizzie was not as encouraged, having only managed to have in depth chats with a couple of people. She reported that lots of people were very determinedly on their way somewhere and not up for stopping to talk. The time and/or location did not seem right for the intended audience. I suppose, we shouldn’t stand in between a student and their food. In any case, there were lots to take away from this Love Data Week pop-ups, and lots to reflect when we plan for our next pop-up, be it for Love Data Week 2025 or just as a periodic service to the research community here at Cambridge. Perhaps when the weather is nicer in the summer, we will do a pop-up outdoors in the middle of the Sidgwick site, or at research events throughout the University. If you have any ideas on where it would be good for us to hold such a pop-up, do let us know!