The (exponential) thirst for data – The March 2024 Data Champions forum

The Data Champions were treated to a big data themed session for the March Data Champion forum, hosted at (and sponsored by) the Cambridge University Press and Assessment in their amazing Triangle building. First up was Dr James Fergusson, course director for the MPhil in Data Intensive Science, who described how the exponential growth in data accumulation, computing and artificial intelligence (A.I.) capabilities has led to a paradigm shift in the world of cosmological theorisation and research, potentially changing with it scientific research as a whole.  

Dr James Fergusson presenting to the Data Champions at the March forum

As he explained, over the last two decades cosmologists have seen a rapid increase of data points on which to base their theorisation – from merely 14 data points in 2000 to 800 million data points in 2013! Through the availability of these data points, the paradigm for research in cosmology started to shift completely – from being theory based to being based on data.  With several projects beginning soon that will see vast amounts of data generated daily for decades to come, this trend is showing no signs of slowing down. The only way to cope with this exponential increase in data generation is with computing power, which has also been growing exponentially. In tandem with these sectors of growth is the growth of machine learning (ML) capabilities as the copious amount of data not only necessitates immense amounts of computing power but also ML capabilities to process and analyse all of the data. Together, these elements are fundamentally changing the story of scientific discovery. What was once a story of an individual researcher having an intellectual breakthrough is becoming the story of machine led, automated discovery. While it used to be the case that an idea, put through the rigour of the scientific method, would lead to the generation of data, now the reverse is not only possible but become increasingly likely. Data is now generated first before a theory is discovered, and the discovery may come from AI and not a scientist. This, for James, can be considered the new scientific method. 

Dr Anne Alexander has been familiarising herself with AI, especially in her capacity as Director of Learning at Cambridge Digital Humanities (CDH) where she has been incorporating critique of AI into a methodology of research in the digital humanities, particularly in the area of Critical Pedagogy. In her work, Anne addresses how structural inequalities can be reinforced, rather than challenged by AI systems. She demonstrated this through two projects that she was involved with at CDH. One was called Ghost Fictions, a series of workshops with the aim of encouraging critical thinking about automated text generation using AI methods both in scholarly work and in social life. The project resulted in a (free to download) book titled Ghost, Robots, Automatic Writing: an AI level study guide, which was intended as a provocation of a future where books, study guides and examinations are created by Large Language Models (LLM) (perhaps a not so distant future). Another project involved using AI to create characters for a new novel, which revealed the racial biases of ChatGPT when prompted with certain names. Yet, perhaps the most worrying aspect about the transformative forms of AI is the immediate and consequential impact it has on the environment. The computational power needed to quench the thirst for the exponential amounts of data needed to train and progress AI chat bots, LLMs and image generation systems, requires vast computing power which in turn generates a lot of heat and requires large amounts of water to operate. As Anne demonstrated, this could be increasingly problematic for many places as the global climate crisis continues. Locally, we have the case of West Cambridge, which is already water stressed, but also home to the University’s data centre and where the new DAWN AI supercomputer is located. Through these examples, she posed the questions: does AI perpetuate further harm and inequality? Are the environmental costs of AI too high?    

Dr James Fergusson and Dr Anne Alexander answering questions from the Data Champions at the March forum

The themes that Anne concluded her presentation with formed the basis of the Q&A between the Data Champions and the speakers. The topic of the potential biases of AI and ML was put forward to James who agreed that his field of study could not escape it. That said, unlike the humanities, biases in physics can potentially be helpful as it may help make the scientific process as objective as possible. However, this could clearly be problematic for humanities research, which tends to deal with social systems and relations, and views of the world. The topic of the environmental cost of AI was also touched on, with which James commented that energy insufficiency is a problem and getting harder to justify, and solutions might only create new problems as the demand for this technology is not slowing down. Anne expressed her concerned and suggests that society at large should be consulted on this as the environment is a social problem thus society should have a say on what risk they are willing to be a part of. The question of the automation of science was also raised to James who admitted that preparing early career physicists for research now involves developing their software skills rather than subject knowledge expertise in physics or mathematics. 

What we can learn from the ‘promise and pitfalls of preregistration’ meeting

Dr Mandy Wigdorowitz, Open Research Community Manager, Cambridge University Libraries

The promise and pitfalls of preregistration meeting was held at the Royal Society in March 2024. It was organised to address the utility of preregistration and initiate an interdisciplinary dialogue about its epistemic and pragmatic aims. The goal of the meeting was to explore the limitations associated with preregistration, and to conceive of a practical way to guide future research that can make the most of its implementation.

Preregistration is the practice of publicly declaring a study’s hypotheses, methods, and analyses before conducting a research study. Researchers are encouraged to be as specific as possible when writing preregistration plans, detailing every aspect of the research methodology and analyses, including, for instance, the study design, sample size, procedure for dealing with outliers, blinding and manipulation of conditions, and how multiple analyses will be controlled for. By doing so, researchers commit to a time-stamped study plan which will reduce the potential for flexibility in analysis and interpretation that may lead to biased results. Preregistration is a community-led response to the replication crisis and aims to mitigate questionable research practices (QRPs) that have come to light in recent years, some of which include HARKing (Hypothesising After Results are Known), p-hacking (the inappropriate manipulation of data analysis to enable a favoured result to be presented as statistically significant), and publication bias (the unbalanced publication of statistically significant findings or positive results over null and/or unexpected findings) (Simmons et al., 2011; Stefan & Schönbrodt, 2023).

The meeting brought together scholars and publishers from a range of disciplines and institutions to discuss whether preregistration has indeed lived up to these aims and whether and to what extent it has solved the problems it was envisioned to address.

It became clear that the problems associated with QRPs have not simply disappeared with the uptake and implementation of preregistration. From the perspective of meta-research, the success of preregistration appears to be largely disciplinary and legally dependent, with some disciplines mandating and normalising it (e.g., clinical trial registration in biomedical research), others greatly encouraging and (sometimes) requiring it (e.g., psychological science research), and others having no expectations about its use (e.g., economics research). The effectiveness of preregistration was shown to be linked to these dependencies, but also related to the quality and detail of the preregistration plan itself. Researchers are the arbiters of their research choices and if they choose to write vague or ambiguous preregistration plans, the problems that preregistration are assumed to address will inevitably persist.

Various preregistration templates exist (such as on the Open Science Framework, OSF) and some incentives for preregistration are recognised, such as the preregistration badges awarded by some journals, making it a systematic and straightforward exercise. In practice, however, it is not always the case that sufficient information is provided, and even in cases where preregistered plans are detailed, they are not always followed for various pragmatic or other (not always nefarious) reasons. As such, the research community are cautioned to not assume that preregistration equates to better or more trustworthy research. Rather, the preregistration plan needs to be critically reviewed as a standalone document in conjunction with the published study. This is important because preregistration plans that are usually deposited into repositories (e.g., OSF, National Library of Medicine’s Clinical Trials Registry) are seldom evaluated as entities of their own or against their corresponding research articles. Note that this is unlike registered reports which are a type of journal article that details a study’s protocol that does get peer reviewed before data is collected and if reviewed favourably, is given an in-principal acceptance regardless of the study outcomes.

Other discussions centred around the utility of preregistration in exploratory versus confirmatory research, whether preregistration can improve our theories, and how the process of conducting multiple but slightly varied analyses and selecting the most desired outcome (also referred to the ‘garden of forking paths’) affects the claims we make.

The overall sentiment from the meeting was that while preregistration does not solve all the issues that have arisen from QRPs, it ultimately leads to more transparency of the research process, accountability on the part of the researchers conducting the research, and it facilitates deeper engagement with one’s own research prior to any collection or analysis of data.

Since attending the meeting, I have taken away valuable insights that have made me critically reflect on my own research choices, and from a practice perspective, I have downloaded the OSF preregistration template and am documenting the plans for a research project.

Given the strides that have been taken toward improving the transparency, credibility and reproducibility of research, researchers at Cambridge need to consider whether preregistration plans should be included as another type of output that can be deposited on the institutional repository, Apollo. We have recently added Methods and preprints as output types which have broadened the options for sharing and which align with open research practices. Including preregistration could be a valuable and timely addition.  

References

Stefan, A. M., & Schönbrodt, F. D. (2023). Big little lies: a compendium and simulation of p-hacking strategies. Royal Society Open Science, 10(2), 220346. https://doi.org/10.1098/rsos.220346

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359-1366. https://doi.org/10.1177/0956797611417632

Dear Data,…

Valentine’s day week for the international data community is not only a time for expressing your love to the significant others in your life. As it is also Love Data Week, it is also a time to reflect on your love for all things data! That was the goal for the Research Data team this year! The theme of this year’s Love Data Week was “My Kind of Data”, suggesting that data workers – researchers and analysts alike – have a relationship to data that is personal, often idiosyncratic, and almost always heartfelt. The Research Data team, as supporters of the University’s researchers, are interested in such relationships and are always eager to discover the distinctive needs that the disciplinary differences between the University’s departments create. This year, the Research Data team decided that they wanted to find out from students and researchers from the Arts, Humanities and Social Sciences (AHSS) what was their kind of data.

To do so, the Research Data team positioned themselves at the Foyer of the Alison Richard Building on the University’s Sidgwick Site, which is home to several AHSS departments, for two mornings on Monday the 12th and Thursday the 15th of February. Across the city, Data Champion Lizzie Sparrow was leading the charge with science, technology, engineering, mathematics (STEMM) students and researchers by holding her own pop-up at the West Hub. Like the Research Data team, and as a Research Support Librarian (Engineering) herself, Lizzie is also interested in the relationships that researchers have with data. Her approach, however, would likely be different. Unlike researchers in the STEMM subjects, the term data for AHSS students and researchers can sometimes feel exclusionary as they may not consider what they generate through research as data. From our perspective on the other hand, any material that goes on to form any part of their research is one’s data. To bring attention to this, the team tried to engage passers-by with the provocation “you have research data, change our minds!” The provocation was successful and many conversations were had on the different ways that members of the Sidgwick community understood data in their research.

The Research Data Team from the Office of Scholarly Communication (Cambridge University Library), from left to right: Clair Castle, Lutfi Othman, Kim Clugston.

The team was pleased to find that there was a general interest in the services of the Research Data team among the Sidgwick community, and we were happy to be able to share with others how we can help them with their data management and planning.

Some treats for those who stop by.
Our Open Research poster, designed by Clair Castle.

The team tried to capture the sentiments of the conversations had by asking the Sidgwick community to partake in 2 short activities as they departed our pop-up to better understand  their relationship with data (in exchange for Love Hearts sweets!). Firstly, we asked them to describe to us what data was to them, a question that we are extremely fond of asking! As usual, the answers were informative and they helped us to gain a sense of the varying data types that the Sidgwick community worked with – from political tracts and archival materials to balance sheets and land deeds from the early modern era.

Activity 1: Lots of different data types in the AHSS community!

For the second activity, we asked them what term best captured the materials that formed the basis of their scholarly work: data, research materials, or other? To our surprise, the majority of people we spoke to over both days saw themselves as working with data, more than double the number that saw themselves working with research materials, with a small number seeing themselves as working with both, interchangeably. This finding illustrated something that has been increasingly discussed in the Research Data team office: that finding alternatives to the term data may make our services and initiatives more appealing to members of the AHSS community. This is something we will take into account when targeting our outreach in the future. Yet, one thing is certain – our Research Data services are needed by the AHSS community just as much as it is by the STEMM community.

Activity 2: More generators of ‘data’ than we expected!

The pop-ups at the Alison Richard building were encouraging and it is hoped that fruitful relationships will transpire from these events. This is something that we may hold again soon. It was a good way to communicate our message and make others aware of the services of the Research Data team. Over at the West Hub Lizzie was not as encouraged, having only managed to have in depth chats with a couple of people. She reported that lots of people were very determinedly on their way somewhere and not up for stopping to talk. The time and/or location did not seem right for the intended audience. I suppose, we shouldn’t stand in between a student and their food. In any case, there were lots to take away from this Love Data Week pop-ups, and lots to reflect when we plan for our next pop-up, be it for Love Data Week 2025 or just as a periodic service to the research community here at Cambridge. Perhaps when the weather is nicer in the summer, we will do a pop-up outdoors in the middle of the Sidgwick site, or at research events throughout the University. If you have any ideas on where it would be good for us to hold such a pop-up, do let us know!