All posts by Lutfi Bin Othman

The (exponential) thirst for data – The March 2024 Data Champions forum

The Data Champions were treated to a big data themed session for the March Data Champion forum, hosted at (and sponsored by) the Cambridge University Press and Assessment in their amazing Triangle building. First up was Dr James Fergusson, course director for the MPhil in Data Intensive Science, who described how the exponential growth in data accumulation, computing and artificial intelligence (A.I.) capabilities has led to a paradigm shift in the world of cosmological theorisation and research, potentially changing with it scientific research as a whole.  

Dr James Fergusson presenting to the Data Champions at the March forum

As he explained, over the last two decades cosmologists have seen a rapid increase of data points on which to base their theorisation – from merely 14 data points in 2000 to 800 million data points in 2013! Through the availability of these data points, the paradigm for research in cosmology started to shift completely – from being theory based to being based on data.  With several projects beginning soon that will see vast amounts of data generated daily for decades to come, this trend is showing no signs of slowing down. The only way to cope with this exponential increase in data generation is with computing power, which has also been growing exponentially. In tandem with these sectors of growth is the growth of machine learning (ML) capabilities as the copious amount of data not only necessitates immense amounts of computing power but also ML capabilities to process and analyse all of the data. Together, these elements are fundamentally changing the story of scientific discovery. What was once a story of an individual researcher having an intellectual breakthrough is becoming the story of machine led, automated discovery. While it used to be the case that an idea, put through the rigour of the scientific method, would lead to the generation of data, now the reverse is not only possible but become increasingly likely. Data is now generated first before a theory is discovered, and the discovery may come from AI and not a scientist. This, for James, can be considered the new scientific method. 

Dr Anne Alexander has been familiarising herself with AI, especially in her capacity as Director of Learning at Cambridge Digital Humanities (CDH) where she has been incorporating critique of AI into a methodology of research in the digital humanities, particularly in the area of Critical Pedagogy. In her work, Anne addresses how structural inequalities can be reinforced, rather than challenged by AI systems. She demonstrated this through two projects that she was involved with at CDH. One was called Ghost Fictions, a series of workshops with the aim of encouraging critical thinking about automated text generation using AI methods both in scholarly work and in social life. The project resulted in a (free to download) book titled Ghost, Robots, Automatic Writing: an AI level study guide, which was intended as a provocation of a future where books, study guides and examinations are created by Large Language Models (LLM) (perhaps a not so distant future). Another project involved using AI to create characters for a new novel, which revealed the racial biases of ChatGPT when prompted with certain names. Yet, perhaps the most worrying aspect about the transformative forms of AI is the immediate and consequential impact it has on the environment. The computational power needed to quench the thirst for the exponential amounts of data needed to train and progress AI chat bots, LLMs and image generation systems, requires vast computing power which in turn generates a lot of heat and requires large amounts of water to operate. As Anne demonstrated, this could be increasingly problematic for many places as the global climate crisis continues. Locally, we have the case of West Cambridge, which is already water stressed, but also home to the University’s data centre and where the new DAWN AI supercomputer is located. Through these examples, she posed the questions: does AI perpetuate further harm and inequality? Are the environmental costs of AI too high?    

Dr James Fergusson and Dr Anne Alexander answering questions from the Data Champions at the March forum

The themes that Anne concluded her presentation with formed the basis of the Q&A between the Data Champions and the speakers. The topic of the potential biases of AI and ML was put forward to James who agreed that his field of study could not escape it. That said, unlike the humanities, biases in physics can potentially be helpful as it may help make the scientific process as objective as possible. However, this could clearly be problematic for humanities research, which tends to deal with social systems and relations, and views of the world. The topic of the environmental cost of AI was also touched on, with which James commented that energy insufficiency is a problem and getting harder to justify, and solutions might only create new problems as the demand for this technology is not slowing down. Anne expressed her concerned and suggests that society at large should be consulted on this as the environment is a social problem thus society should have a say on what risk they are willing to be a part of. The question of the automation of science was also raised to James who admitted that preparing early career physicists for research now involves developing their software skills rather than subject knowledge expertise in physics or mathematics. 

Dear Data,…

Valentine’s day week for the international data community is not only a time for expressing your love to the significant others in your life. As it is also Love Data Week, it is also a time to reflect on your love for all things data! That was the goal for the Research Data team this year! The theme of this year’s Love Data Week was “My Kind of Data”, suggesting that data workers – researchers and analysts alike – have a relationship to data that is personal, often idiosyncratic, and almost always heartfelt. The Research Data team, as supporters of the University’s researchers, are interested in such relationships and are always eager to discover the distinctive needs that the disciplinary differences between the University’s departments create. This year, the Research Data team decided that they wanted to find out from students and researchers from the Arts, Humanities and Social Sciences (AHSS) what was their kind of data.

To do so, the Research Data team positioned themselves at the Foyer of the Alison Richard Building on the University’s Sidgwick Site, which is home to several AHSS departments, for two mornings on Monday the 12th and Thursday the 15th of February. Across the city, Data Champion Lizzie Sparrow was leading the charge with science, technology, engineering, mathematics (STEMM) students and researchers by holding her own pop-up at the West Hub. Like the Research Data team, and as a Research Support Librarian (Engineering) herself, Lizzie is also interested in the relationships that researchers have with data. Her approach, however, would likely be different. Unlike researchers in the STEMM subjects, the term data for AHSS students and researchers can sometimes feel exclusionary as they may not consider what they generate through research as data. From our perspective on the other hand, any material that goes on to form any part of their research is one’s data. To bring attention to this, the team tried to engage passers-by with the provocation “you have research data, change our minds!” The provocation was successful and many conversations were had on the different ways that members of the Sidgwick community understood data in their research.

The Research Data Team from the Office of Scholarly Communication (Cambridge University Library), from left to right: Clair Castle, Lutfi Othman, Kim Clugston.

The team was pleased to find that there was a general interest in the services of the Research Data team among the Sidgwick community, and we were happy to be able to share with others how we can help them with their data management and planning.

Some treats for those who stop by.
Our Open Research poster, designed by Clair Castle.

The team tried to capture the sentiments of the conversations had by asking the Sidgwick community to partake in 2 short activities as they departed our pop-up to better understand  their relationship with data (in exchange for Love Hearts sweets!). Firstly, we asked them to describe to us what data was to them, a question that we are extremely fond of asking! As usual, the answers were informative and they helped us to gain a sense of the varying data types that the Sidgwick community worked with – from political tracts and archival materials to balance sheets and land deeds from the early modern era.

Activity 1: Lots of different data types in the AHSS community!

For the second activity, we asked them what term best captured the materials that formed the basis of their scholarly work: data, research materials, or other? To our surprise, the majority of people we spoke to over both days saw themselves as working with data, more than double the number that saw themselves working with research materials, with a small number seeing themselves as working with both, interchangeably. This finding illustrated something that has been increasingly discussed in the Research Data team office: that finding alternatives to the term data may make our services and initiatives more appealing to members of the AHSS community. This is something we will take into account when targeting our outreach in the future. Yet, one thing is certain – our Research Data services are needed by the AHSS community just as much as it is by the STEMM community.

Activity 2: More generators of ‘data’ than we expected!

The pop-ups at the Alison Richard building were encouraging and it is hoped that fruitful relationships will transpire from these events. This is something that we may hold again soon. It was a good way to communicate our message and make others aware of the services of the Research Data team. Over at the West Hub Lizzie was not as encouraged, having only managed to have in depth chats with a couple of people. She reported that lots of people were very determinedly on their way somewhere and not up for stopping to talk. The time and/or location did not seem right for the intended audience. I suppose, we shouldn’t stand in between a student and their food. In any case, there were lots to take away from this Love Data Week pop-ups, and lots to reflect when we plan for our next pop-up, be it for Love Data Week 2025 or just as a periodic service to the research community here at Cambridge. Perhaps when the weather is nicer in the summer, we will do a pop-up outdoors in the middle of the Sidgwick site, or at research events throughout the University. If you have any ideas on where it would be good for us to hold such a pop-up, do let us know!

Mapping the world through data – The November 2023 Data Champion Forum 

The November Data Champion forum was a geography/geospatial data themed edition of the bi-monthly gathering, this time hosted by the Physiology department. As usual, the Data Champions in attendance were treated to two presentations. Up first was Martin Lucas-Smith from the Department of Geography who introduced the audience to the OpenStreetMap (OSM) project, a global community mapping project using crowdsourcing. Just as Wikipedia is for textual information, OSM results in a worldwide map created by everyday people who map the world themselves. The resulting maps can vary in terms of its focus such as the transport map, which is a map which shows public transport lanes like railways, buses and trams worldwide, and the humanitarian map, which is an initiative dedicated to humanitarian action through open mapping. Martin is personally involved in a project called CycleStreets which, as the name implies, uses open mapping of bicycle infrastructure. The Department of Geography uses OSM as a background for its Cambridge Air Photos websites. Projects like these, Martin highlighted, demonstrate how community gets generated around open data. 

CycleStreets: Martin at the November 2023 Data Champion Forum

In his presentation, Martin explained the mechanics of OSM such as its data structure, how the maps are edited, and how data can be used in systems like routing engines. Editing the maps and the decision-making processes that go behind how a path is represented visually on the map is the point where the OSM community comes to action. While the data in OSM consists primarily of geometric points (called ‘Nodes’) and lines (called ‘Ways’) coupled with tags which denotes metadata values, the norms about how to define this information can only come about by consensus from the OSM community. This is perhaps different to more formal database structures that might be employed within corporate efforts such as Google. Because of its widespread crowdsourced nature, OSM tends to be more detailed than other maps for less well-served communities such as people cycling or walking, and its metadata is richer, as they are created by people who are intimately familiar with the areas that they are mapping. A map by users for users. 

Next up was Dr Rachel Sippy, a Research Associate with the Department of Genetics who presented how geospatial data factored into epidemiological research. In her work, the questions of ‘who’, ‘when’, and ‘where’ a disease outbreak occurred are important, at it is the where that gives her research a geographical focus. Maps, however, are often not detailed enough to provide information about an outbreak of disease among a population or community as maps can only mark out the incident site, the place, whereas the spatial context of that place, which she denotes as space, is equally as important in understanding disease outbreaks.  

Of ‘Space’ and ‘Place’: Rachel at the November 2023 Data Champion forum

It can be difficult, however, to understand what a researcher is measuring and what types of data can be used to measure space and/or place. Spatial data, as Rachel pointed out, can be difficult to work with and the researcher has to decide if spatial data is a burden or fundamental to the understanding of a disease outbreak in a particular setting. Rachel discussed several aspects of spatial data which she has considered in her research such as visualisation techniques, data sources and methods of analysis. They all come with their own sets of challenges and researchers have to navigate them to decide how best to tell the fundamental story that answers the research question. This essentially comes down to an act of curation of spatial data, as Rachel pointed out, quoting Mark Monmoneir, that “not only is it easy to lie with maps, it’s essential”. In doing so, researchers working with spatial data would have to navigate the political and cultural hierarchies that are explicitly and implicitly inherent to places, and any ethical considerations relating to both the human and non-human (animal) inhabitants of those geographical locations. Ultimately, how data owners choose to model the spatial data will affect the analysis of the research, and with it, its utility for public health. 

After lunch, both Martin and Rachel sat together to hold a combined Q&A session and a discussion emerged around the topic of subjectivity. A question was raised to Rachel regarding mapping and subjectivity, as it was noticed that how she described place, which included socio-cultural meanings and personal preferences of the inhabitants of the place, can be considered to be subjective in manner. Rachel agreed and alluded back to her presentation, where she mentioned that these aspects of mapping can get fuzzy as researchers would have to deal with matters relating to identity, political affiliations and personal opinions, such as how safe an individual may feel in a particular place. Martin added that with the OSM project the data must be objective as possible, yet the maps themselves are subjective views of objective data.  

Rachel and Martin answering questions from the Data Champions at the November 2023 forum

Martin also brought to attention that maps are contested spaces because spaces can be political in nature. Rachel added that sometimes, maps do not appropriately represent the contested nature of her field sites, which she only learned through time on the field. In this way, context is very important for “real mapping”. As an example, Martin discussed his “UK collision data” map, created outside the University, which states where collisions have happened, giving the example of one of central Cambridge’s busiest streets, Mill Road: without contextual information such as what time these collisions occurred, what vehicles were involved, and the environmental conditions at the time of the accident, a collision map may not be that valuable. To this end, it was asked whether ethnographic research could provide useful data in the act of mapping and the speakers agreed. 

Data Diversity Podcast #1 – Danny van der Haven

Last week, the Research Data Team at the Office of Scholarly Communication recorded the inaugural Data Diversity Podcast with Data Champion Danny van der Haven from the Department of Material Science and Metallurgy.

As is the theme of the podcast, we spoke to Danny about his relationship with data and learned from his experiences as a researcher. The conversation also touched on the differences between lab research and working with human participants, his views on objectivity in scientific research, and how unexpected findings can shed light on datasets that were previously insignificant. We also learn about Danny’s current PhD research studying the properties of pharmaceutical powders to enhance the production of medication tablets.   

Click here to listen to the full conversation.

If you have heart rate data, you do not want to get a different diagnosis if you go to a different doctor. Ideally, you would get the same diagnosis with every doctor, so the operator or the doctor should not matter, but only the data should matter.
– Danny van der Haven

   ***  

What is data to you?  

Danny: I think I’m going to go for a very general description. I think that you have data as soon as you record something in any way. If it’s a computer signal or if it’s something written down in your lab notebook, I think that is already data. So, it can be useless data, it can be useful data, it can be personal data, it can be sensitive data, it can be data that’s not sensitive, but I would consider any recording of any kind already data. The experimental protocol that you’re trying to draft? I think that’s already data.   

If you’re measuring something, I don’t think it’s necessarily data when you’re measuring it. I think it becomes data only when it is recorded. That’s how I would look at it. Because that’s when you have to start thinking about the typical things that you need to consider when dealing with regular data, sensitive data or proprietary data etc.   

When you’re talking about sensitive data, I would say that any data or information of which the public disclosure or dissemination may be undesirable for any given reason. That’s really when I start to draw the distinction between data and sensitive data. That’s more my personal view on it, but there’s also definitely a legal or regulatory view. Looking for example at the ECG, the electrocardiogram, you can take the electrical signal from one of the 12 stickers on a person’s body. I think there is practically nobody that’s going to call that single electrical signal personal data or health data, and most doctors wouldn’t bat an eye.   

But if you would take, for example, the heart rate per minute that follows from the full ECG, then it becomes not only personal data but also becomes health data, because then it starts to say something about your physiological state, your biology, your body. So there’s a transition here that is not very obvious. Because I would say that heart rate is obviously health data and the electrical signal from one single sticker is quite obviously not health data. But where is the change? Because what if I have the electrical signal from all 12 stickers? Then I can calculate the heart rate from the signal of all the 12 stickers. In this case, I would start labelling this as health data already. But even then, before it becomes health data, you also need to know where the stickers are on the body.   

So when is it health data? I would say that somebody with decent technical knowledge, if they know where the stickers are, can already compute the heart rate. So then it becomes health data, even if it’s even if it’s not on the surface. A similar point is when metadata becomes real data. For example, your computer always saves that date and time you modified files. But sometimes, if you have sensitive systems or you have people making appointments, even such simple metadata can actually become sensitive data.   

On working within the constraints of GDPR  

Danny: We struggled with that because with our startup Ares Analytics, we also ran into the issues with GDPR. In the Netherlands at the time, GDPR was interpreted really stringently by the Dutch government. Data was not anonymous if you could, in any way, no matter how difficult, retrace the data to the person. Some people are not seeing these possibilities, but just to take it really far: if I would be a hacker with infinite resources, I could say I’m going to hack into the dataset and see the moments that the data that were recorded. And then I can hack into the calendar of everybody whose GPS signal was at the hospital on this day, and then I can probably find out who at that time was taking the test… I mean is that reasonable? Is anybody ever going do that? If you put those limitations on data because that is a very, very remote possibility; is that fair or are you going hinder research too much? I understand the cautionary principle in this case, but it ends up being a struggle for us in in that sense.  

Lutfi: Conceivably, data will lose its value. If you really go to the full extent on how to anonymise something, then you will be dataless really because the only true way to anonymise and to protect the individual is to delete the data.  

Danny: You can’t. You’re legally not allowed to because you need to know what data was recorded with certain participants. Because if some accident happens to this person five years later, and you had a trial with this person, you need to know if your study had something to do with that accident. This is obvious when you you’re testing drugs. So in that sense, the hospital must have a non-anonymised copy, they must. But if they have a non-anonymized copy and I have an anonymised copy… If you overlay your data sets, you can trace back the identity. So, this is of course where you end up with a with a deadlock.  

What is your relationship to data?  

Danny: I see my relationship to data more as a role that I play with respect to the data, and I have many roles that I cycle through. I’m the data generator in the lab. Then at some point, I’m the data processor when I’m working on it, and then I am the data manager when I’m storing it and when I’m trying to make my datasets Open Access. To me, that varies, and it seems more like a functional role. All my research depends on the data.  

Lutfi: Does the data itself start to be more or less humanised along the way, or do you always see it as you’re working on someone, a living, breathing human being, or does that only happen toward the end of that spectrum?   

Danny: Well, I think I’m very have the stereotypical scientist mindset in that way. To me, when I’m working on it, in the moment, I guess it’s just numbers to me. When I am working on the data and it eventually turns into personal and health data, then I also become the data safe guarder or protector. And I definitely do feel that responsibility, but I am also trying to avoid bias. I try not to make a personal connection with the data in any sense. When dealing with people and human data, data can be very noisy. To control tests properly, you would like to be double blind. You would like not to know who did a certain test, you would like not to know the answer beforehand, more or less, as in who’s more fit or less fit. But sometimes you’re the same person as the person who collected the data, and you actually cannot avoid knowing that. But there are ways that you can trick yourself to avoid that. For example, you can label the data in certain clever way and you make sure that the labelling is only something that you see afterwards.   

Even in very dry physical lab data, for example microscopy of my powders, the person recording it can introduce a significant bias because of how they tap the microscopy slide when there’s powder on it. Now, suddenly, I’m making an image of two particles that are touching instead of two separate particles. I think it’s also kind of my duty, that when I do research, to make the data, how I acquire it, and how it’s processed to be as independent of the user as possible. Because otherwise user variation is going to overlap with my results and that’s not something I want, because I want to look at the science itself, not who did the science. 

Lutfi: In a sentence, in terms of the sort of accuracy needed for your research, the more dehumanised the data is, the more accurate the data so to speak.   

Danny: I don’t like the phrasing of the word “dehumanised”. I guess I would say that maybe we should be talking about not having person-specific or operator-specific data. If you have heart rate data, you do not want to get a different diagnosis if you go to a different doctor. Ideally, you would get the same diagnosis with every doctor, so the operator or the doctor should not matter, but only the data should matter. 

             ***  

If you would like to be a guest on the Data Diversity Podcast and have some interesting data related stories to share, please get in touch with us at info@data.cam.ac.uk and state your interest. We look forward to hearing from you!  

The September 2023 Data Champion Forum

The Cambridge Data Champions had a fantastic September Forum at the West Hub. The forum started with an introduction to the West Hub by  Library Manager Daniele Campello and we welcomed Clair Castle as the new interim Research Data Manager with the Office of Scholarly Communication (University Library).

Dr Mandy Wigdorowitz kicked off the presentations by sharing with the Data Champions what she aims to achieve as the University’s Open Research Community Manager. This includes raising the profile of Open Research at the University and ensuring that scholarly and research outputs that are deemed to be open are indeed accessible and interoperable in accordance with FAIR principles.  As Open Research Community Manager, Mandy advocates for Open Research among University researchers from both the STEMM and AHSS (Art, Humanities and Social Sciences) disciplines. The latter proves to be more challenging as researchers in AHSS may often have valid reasons from refraining from making their research data open, such as working with sensitive data or working with interlocutors who object to their data being shared. Such issues will be addressed at the Cambridge Open Research Conference that she is organising, which takes place on 17th November 2023 at Downing College, Cambridge as well as online. To end, Mandy invited the Data Champions to join her Open Research initiative, a community of advocates for Open Research across the University.

Before lunch, Madeleine Taylor (Information Security Risk and Governance Manager with University Information Services, UIS) presented a follow up to a webinar session on monitoring the Information and Cybersecurity (ICS) risks for research data across the university, which she conducted with the Data Champions a couple weeks prior. After a brief introduction of what she has done so far to protect Cambridge’s research communities against ICS threats, she asked the Data Champions for help in her task of securing research data against ICS risks. They can do so by providing her with a sense of what data their own research communities are working with and how they were storing them. As the Data Champions ate the delicious lunch of sandwiches and cakes provided by the West Hub caterers, they provided feedback to Madeleine on two forms that she proposed as methods of gathering the information she needed: a 3-minute research data impact assessment form and a research data cyber security risk form. Maddy will continue to work with the Research Data Team and the Data Champions to refine, and gather information, through these forms.

Thank you to the West Hub and Daniele Campello for hosting the Data Champions Forum in your welcoming building!

If you are a member of the University of Cambridge and are interested in attending the Data Champions Forum, please join us as a Data Champion. If you are passionate about research data management and data sharing or you would like to find out more about what being a Data Champion entails, please visit the Data Champions webpage. We welcome applications from those working in all academic subjects across AHSS and STEMM disciplines. If you are unsure about how being a Data Champion would impact your research, please get in touch with the Research Data Team!

Cartoon by Clare Trowell CC-BY-NC-ND