All posts by Lutfi Bin Othman

Data Diversity Podcast #3 – Dr Nick H. Wise (1/4)

In our third instalment of the Data Diversity Podcast, we are joined by Dr Nick H. Wise, Research Associate in Architectural Fluid Mechanics at the Department of Engineering, University of Cambridge. As is the theme of the podcast, we spoke to Nick about his experience as a researcher, but this is a special edition of the podcast. Besides being a scientist and an engineer, Nick has made his name as a scientific sleuth who, based on an article on the blog Retraction Watch which was written in 2022, is responsible for more than 850 retractions, leading Times Higher Education to dub him as a research fraudbuster. Since then, through his X account @Nickwizzo, he has continued his investigations, tracking cases of fraud and in some cases, naming and shaming the charlatans. Nick was kind to share with us many great insights over a 90-minute conversation, and as such we have decided to release a four part-series dedicated to the topic of research integrity. 

In this four-part series, we will learn from Nick about some of the shady activities that taint the scientific publishing industry today. In part one, we learn how Nick was introduced into the world of publication fraud and how that led him to investigate the industry behind it. Below are some excerpts from the conversation, which can be listened to in full here


I have found evidence of a papermill bribing some editors and there have been many, at least tens, if not hundreds, of editors that have been let go or told to stop being editors by journals in the last year because they have been found to be compromised. This could be because of bribery or some other way of being compromised. This is what I try to uncover. – Dr Nick H. Wise


Tortured Phrases and PubPeer: Nick’s beginnings as a Scientific Sleuth  

My background is in fluid dynamics where I mostly think about fluid dynamics within buildings. For instance, I think about the air flows generated by different heating systems and things like pollutant transport such as smells or COVID which can travel with the air and interact with other each other. That was my PhD and the post-doc in the Engineering department.

About three years ago whilst trying to avoid writing my thesis, I saw a tweet from the great Elizabeth Bik, who is possibly the most famous research fraud investigator. She mostly looks at biomedical images and her great skill is she would be able to look through a paper and see photos of Western blots of microscopy slides and see if parts of an image are identical to other parts, or if the image overlaps with images from different papers. She has an incredible memory and ability to spot these images. She’s been doing this for over 10 years and has caused many retractions. I was aware of her work but there was no way for me to assist with that because it is not my area of research. I don’t have an appreciation of what these images should look like.

But about three years ago she shared a preprint written by three computer scientists on her Twitter account about a phenomenon they called ‘tortured phrases’. In doing their research and reading the literature, these computer scientists noticed that there were papers with very weird language in them. What they surmised was that to overcome plagiarism checks by software like Turnitin, people would run text through paraphrasing software. These software were very crude in that they would go word by word. For instance, it would look at a word and replace it with the first synonym it found in a thesaurus. It would do this word for word, which makes the text barely readable. However, it is novel and so it will not flag any plagiarism checking software. Eventually, if you as a publisher have outsourced the plagiarism checks to some software, and neither your editor or peer reviewer reads the text to check if it makes sense, then this will get through peer review process without any problem and the paper would get published.  

For an example of tortured phrases: sometimes there’s not only one way to say something. Particularly if English is not someone’s first language, you don’t want to be too harsh on anyone who’s just chosen a word which just isn’t what a native speaker would pick. But there are some phrases where there’s only one right way to say it. For instance, artificial intelligence is the phrase for the phenomenon you want to talk about, and if instead you use “man-made consciousness”, that’s not the phrase you need to use, particularly if the original text said artificial intelligence brackets AI, and your text says “man-made consciousness” brackets AI. It’s going to be very clear what has happened.  

The three computer scientists highlighted this phenomenon of ‘tortured phrases’, but entirely from within the computer science field. I wondered if a similar phenomenon was happening in my own field in fluid dynamics. Samples of these paraphrasing software are freely available online as little widgets so I took some standard phrases from fluid dynamics, which were the kind that would not make sense if you swapped the words around and generated a few of these tortured phrases, I googled them and up popped hundreds of papers featuring these phrases. That was the beginning for me. 

I started reporting papers with these phrases on a website called PubPeer, which is a website for post-publication peer review. I commented on these papers and started being in conversation with the computer scientists who wrote the paper on ‘tortured phrases’ because they built a tool to scrape the literature and automatically tabulate these papers featuring these phrases. They basically had a dictionary of phrases which they knew would be spat out by the software because some of this paraphrasing software are so crude, such that if you put in “artificial intelligence”, you are always going to get out “man-made consciousness” or a handful of variants. It didn’t come up with a lot of different things. If you could just search for “man-made consciousness” and it brings up many papers, you knew what has been going on. I contributed a lot of new ‘fingerprints’, which is what they call their dictionary that they would search the literature for. That is my origin story. 

On Paper Mills and the Sale of Authorships 

There is also the issue of meta-science, which has nothing to do with the text of the paper or with the data itself, but more to do with how someone may add a load of references through the paper which are not relevant, or they are all references to one person or a colleague. In that way you would be gaming the system to boost profiles, careers, and things like H-index. Because having more publications and more citations is so desirable, there is a market for this. It is easy to find online advertisements for authorship of scientific papers ranging from $100 to over $1000, depending on the impact factor of the journal, and the position of authorship you want: first authorship, seventh authorship, or whether you want to be the corresponding author, these sorts of factors. Likewise, you can buy citations.  

There are also organizations known as paper mills. For example, as an author I might have written the paper and want, or need, to make some money and so I go to this broker and say: I want to sell authorships, I’ll be author number six, but I can sell the first five authorships. Can you put me in touch with someone selling authorships? At the same time, there are people who go to them saying I want to buy an authorship, and they put two and two together acting as a middleman. Also, some of these paper mills do not want to wait for someone to come to them with a paper – they will write papers to order. They have an in-house team of scientific writers who produce papers. This does not necessarily mean that the paper is bad. Depending on where they want the paper to publish, the paper might have to be good if it has to get published. So, they will employ people with degrees, qualified people or PhD students who need to earn some money, and then they will sell the authorships and get the papers published. This is a big business. 

There is a whole industry behind it, and something I have moved onto investigating quite a lot is where these papers are going. When I identify these papers, I try to find out where they are being published, how they’re being published, who is behind them, who is running these paper mills, who is collaborating with them. Something I found out which resulted in an article in Science was that paper mills want to guarantee acceptance as much as they can. If a paper is not accepted, it creates a lot of work for them and it means a longer time before their customers get what they paid for. For example, if a paper that they wrote and sold authorships for gets rejected, they’re going to have to resubmit it to another journal. So something paper mills will do is they will submit a paper to 10 journals at once and publish with whichever journal gave them the easiest time. But still, they want to try and guarantee acceptance and one way to do that is to simply bribe the editor. I have found evidence of a papermill bribing some editors and there have been many, at least tens, if not hundreds, of editors that have been let go or told to stop being editors by journals in the last year because they have been found to be compromised. This could be because of bribery or some other way of being compromised. This is what I try to uncover.

Although I’m not fighting this alone, it can feel like that. Publishers are doing things to some extent and they’re doing things that they can’t tell you about as well. And then there’s other people like me investigating this in their free time or as a side project. Not enough of us are doing it because it is a multi-million-dollar industry that is generating these papers. More papers are being published than ever before so it is a big fight.


Stay tuned as we release the rest of the conversation with Nick over the next month. In the next post, we get Nick’s take on the peer review process and fake research data, and I ask his opinion on where the fault lies in the publication of fraudulent research. 

Data Diversity Podcast #2 – Dr Alfredo Cortell-Nicolau

In our second instalment of the Data Diversity Podcast, we are joined by archaeologist Dr Alfredo Cortell-Nicolau, a Senior Teaching Associate in Quantitative and Computational Methods in Archaeology and Biological Anthropology at the McDonald Institute for Archaeological Research and Data Champion.

As is the theme of the podcast, we spoke to Alfredo about his relationship with data and learned from his experiences as a researcher. The conversation also touched on the different interpersonal, and even diplomatic, skills that an archaeologist must possess to carry out their research, and how one’s relationship with individuals such as landowners and government agents might impact their access to data. Alfredo also sheds light on some of the considerations that archaeologists must go through when storing physical data and discussed some ways that artificial intelligence is impacting the field. Below are some excerpts from the conversation, which can be listened to in full here.

I see data in a twofold way. This implies that there are different ways to liaise with the data. When you’re talking about the actual arrowhead or the actual pot, then you would need to liaise with all the different regional and national laws regarding heritage and how they want you to treat the data because it’s going to be different for every country and even for every region. Then, of course, when you’re using all these morphometric information, all the CSV files, the way to liaise with the data becomes different. You have to think of data in this twofold way.

Dr Alfredo Cortell-Nicolau

Lutfi Othman (LO): What is data to you?

Alfredo Cortell-Nicolau (ACN): In archaeology in general, there are two ways to see the data. In my case for example, one way to see it is that the data is as the arrowhead and that’s the primary data. But then when I conduct my studies, I extract lots of morphometric measures and I produce a second level of data, which are CSV files with all of these measurements and different information about the arrowheads. So, what is the data? Is it the arrowhead or is it the file with information about the arrowhead? This raises some issues in terms of who owns the data and how you are going to treat the data because it’s not the same. In my case, I always share my data and make everything reproducible. But when I share my data, I’m sharing the data that I collected from the arrowheads. I’m not sharing the arrowheads because they are not mine to share.

This is kind of a second layer of thought when you’re working with Archaeology. When you’re studying, for example, pottery residues, then you’re sharing the information of the residues and not the pot that you used to obtain those residues. There are two levels of data. Which is the actual data itself? The data which can be reanalyzed in different ways by different people, or the data that you extracted only for your specific analysis? I see data in this twofold way. This implies that there are different ways to liaise with the data. When you’re talking about the actual arrowhead or the actual pot, then you would need to liaise with all the different regional and national laws regarding heritage and how they want you to treat the data because it’s going to be different for every country and even for every region. Then, of course, when you’re using all these morphometric information, all the CSV files, the way to liaise with the data becomes different. You have to think of data in this twofold way.

On some of the barriers to sharing of archaeological data

ACN: There are some issues in how you would acknowledge that the field archaeologist is the one who got the data. Say that you might have excavated a site in the 1970s and some other researcher comes later, and they may be doing many publications after that excavation, but you are not always giving the proper attribution to the field archaeologist because you cited the first excavation in the first publication, and you’re done. Sometimes, that makes field archaeologists reluctant to share the data because they don’t feel that their work is acknowledged enough. This is one issue which we need to try to solve. Take for example a huge radiocarbon database of 5000 dates: if I use that database, I will cite whoever produced that database, but I will not be citing everyone who actually contributed indirectly to that database. How do I include all of these citations? Maybe we can discuss something like meta-citations, but there must be some way in which everyone feels they are getting something out of sharing the data. Otherwise, there might be a reaction where they think “well, I just won’t share. There’s nothing in for me to share it so why should I share my data”, which would be understandable.

On dealing with local communities, archaeological site owners and government officials

ACN: When we have had to deal with private owners, local politicians and different heritage caretakers, not everyone feels the same way. Not everyone feels the same way about everything, and you do need a lot of diplomatic skills to navigate through this because to excavate the site you need all kinds of permits. You need the permit of the owner of the site, the municipality, the regional authorities, the museum where you’re going to store the material. You need all of these to work and you need the money, of course. Different levels of discussion with indigenous communities is another layer of complexity which you have to deal with. In some cases, like in the site where we’re excavating now, the owner is the sweetest person in the world, and we are so lucky to have him. I called him two days ago because we were going to go to the site, and I was just joking with him, saying I’ll try not to break anything in your cave, and he was like, “this is not my cave. This is heritage for everyone. This is not mine. This is for everyone to know and to share”. It is so nice to find people like that. That may happen also with some kinds of indigenous communities. The levels of politics and negotiation are probably different in every case.

On how archaeologists are perceived

LO: When you approach a field or people, how do they view the archaeologists and the work?

ACN: It really depends on the owner. The one that we’re working with now, he’s super happy because he didn’t know that he had archaeology in his cave. When we told him, he was happy because he’s able to bring something to the community and he wants his local community to be aware that there is something valuable in terms of heritage. This is one good example. But we have also had other examples, for instance, where the owner of the cave was a lawyer and the first thing he thought was “are there going to be legal problems for me? If something happens in the cave, who’s the legal responsibility.” In another case there was there was another person that just didn’t care, she said “you want to come? Fine. The field is there, just do whatever you want.” So, there are different sensibilities to this. Some people are really happy about the heritage and don’t see it as a nuisance that they have to deal with. 

LO: How about yourself as a researcher, archaeologist: do you see yourself as the custodian of sorts, or someone who’s trying to contribute to this or local heritage for the place? Or is it almost scientific and you’re there to dig.

ACN: When I approach the different owners, I think the most important thing is to let them know that they have something valuable to the local community and they can be a part of that. They can be a part of being valuable to the local community. Also, you must make it clear that it’s not going to be a nuisance for them and they don’t have to do anything. I think the most important part is letting them know how it can be valuable for the community. I usually like them to be involved, and they can come and see the cave and see what we are doing. In the end it’s their land and if they see that we are producing something that is valuable to the community then it is good for them. In this case, the type of data that we produce is the primary type of data, that is, the actual different pottery sherds, the different arrowheads, etcetera. In this current excavation, we got an arrowhead that is probably some 4- or 5000 years old and you get (the land owners) to touch this arrowhead that no one in 5000 years has seen. If you can get the owners to think of it in this way, that they’re doing something valuable for your community, then they will be happier to participate in this whole thing and to just let us do whatever we want to do, which is science.

LO: How do you store physical data? Or do you let the landowner store it?

ACN: That depends on the national and regional laws and different countries have different laws about this. The cave where I’m working right now is in Spain, so I’m going to talk about the Spanish law, which is the one that I that I follow and it’s going to be different depending on every country. In our case, with the different assemblages that you find, you have a period of up to 10 years where you can store them yourself in your university and that period is for you to do your research with them. After that period, it goes to whichever museum they are supposed to be going, which depends on the law that says that it has to be the museum that is the closest to the cave or site where they were excavated. Here, the objects can then be displayed and the museum is the ones responsible for managing them, and storing them long term.

There is one additional thing: If you are excavating a site that has already been excavated, then there is a principle of keeping the objects and assemblages together. For example, there is this cave that was excavated in the 1950s and they store all the assemblages in the Museum of Prehistory of Valencia, which was the only museum in the whole region. Now, they excavated it again a few years ago and now there are museums that are closer to the cave but because the bulk of the assemblages are in Valencia and they don’t want to have it separated in two museums, they still have to go to Valencia. This is the principle of not having the assemblages separated and it is the most important one.


As always, we learn so much by engaging with our researchers about their relationship with data, and we thank Alfredo for joining us for this conversation. Please let us know how you think the podcast is going and if there are any question relation to research data that you would like us to ask!

The (exponential) thirst for data – The March 2024 Data Champions forum

The Data Champions were treated to a big data themed session for the March Data Champion forum, hosted at (and sponsored by) the Cambridge University Press and Assessment in their amazing Triangle building. First up was Dr James Fergusson, course director for the MPhil in Data Intensive Science, who described how the exponential growth in data accumulation, computing and artificial intelligence (A.I.) capabilities has led to a paradigm shift in the world of cosmological theorisation and research, potentially changing with it scientific research as a whole.  

Dr James Fergusson presenting to the Data Champions at the March forum

As he explained, over the last two decades cosmologists have seen a rapid increase of data points on which to base their theorisation – from merely 14 data points in 2000 to 800 million data points in 2013! Through the availability of these data points, the paradigm for research in cosmology started to shift completely – from being theory based to being based on data.  With several projects beginning soon that will see vast amounts of data generated daily for decades to come, this trend is showing no signs of slowing down. The only way to cope with this exponential increase in data generation is with computing power, which has also been growing exponentially. In tandem with these sectors of growth is the growth of machine learning (ML) capabilities as the copious amount of data not only necessitates immense amounts of computing power but also ML capabilities to process and analyse all of the data. Together, these elements are fundamentally changing the story of scientific discovery. What was once a story of an individual researcher having an intellectual breakthrough is becoming the story of machine led, automated discovery. While it used to be the case that an idea, put through the rigour of the scientific method, would lead to the generation of data, now the reverse is not only possible but become increasingly likely. Data is now generated first before a theory is discovered, and the discovery may come from AI and not a scientist. This, for James, can be considered the new scientific method. 

Dr Anne Alexander has been familiarising herself with AI, especially in her capacity as Director of Learning at Cambridge Digital Humanities (CDH) where she has been incorporating critique of AI into a methodology of research in the digital humanities, particularly in the area of Critical Pedagogy. In her work, Anne addresses how structural inequalities can be reinforced, rather than challenged by AI systems. She demonstrated this through two projects that she was involved with at CDH. One was called Ghost Fictions, a series of workshops with the aim of encouraging critical thinking about automated text generation using AI methods both in scholarly work and in social life. The project resulted in a (free to download) book titled Ghost, Robots, Automatic Writing: an AI level study guide, which was intended as a provocation of a future where books, study guides and examinations are created by Large Language Models (LLM) (perhaps a not so distant future). Another project involved using AI to create characters for a new novel, which revealed the racial biases of ChatGPT when prompted with certain names. Yet, perhaps the most worrying aspect about the transformative forms of AI is the immediate and consequential impact it has on the environment. The computational power needed to quench the thirst for the exponential amounts of data needed to train and progress AI chat bots, LLMs and image generation systems, requires vast computing power which in turn generates a lot of heat and requires large amounts of water to operate. As Anne demonstrated, this could be increasingly problematic for many places as the global climate crisis continues. Locally, we have the case of West Cambridge, which is already water stressed, but also home to the University’s data centre and where the new DAWN AI supercomputer is located. Through these examples, she posed the questions: does AI perpetuate further harm and inequality? Are the environmental costs of AI too high?    

Dr James Fergusson and Dr Anne Alexander answering questions from the Data Champions at the March forum

The themes that Anne concluded her presentation with formed the basis of the Q&A between the Data Champions and the speakers. The topic of the potential biases of AI and ML was put forward to James who agreed that his field of study could not escape it. That said, unlike the humanities, biases in physics can potentially be helpful as it may help make the scientific process as objective as possible. However, this could clearly be problematic for humanities research, which tends to deal with social systems and relations, and views of the world. The topic of the environmental cost of AI was also touched on, with which James commented that energy insufficiency is a problem and getting harder to justify, and solutions might only create new problems as the demand for this technology is not slowing down. Anne expressed her concerned and suggests that society at large should be consulted on this as the environment is a social problem thus society should have a say on what risk they are willing to be a part of. The question of the automation of science was also raised to James who admitted that preparing early career physicists for research now involves developing their software skills rather than subject knowledge expertise in physics or mathematics.