Tag Archives: data champions

Data Diversity Podcast (#4) – Dr Stefania Merlo (1/2) 

Welcome back to the fourth instalment of Data Diversity, the podcast where we speak to Cambridge University Data Champions about their relationship with research data and highlight their unique data experiences and idiosyncrasies in their journeys as a researcher. In this edition, we speak to Data Champion Dr Stefania Merlo from the McDonald Institute of Archaeological Research, the Remote Sensing Digital Data Coordinator and project manager of the Mapping Africa’s Endangered Archaeological Sites and Monuments (MAEASaM) project and coordinator of the Metsemegologolo project. This is the first of a two-part series and in this first post, Stefania shares with us her experiences of working with research data and outputs that are part of heritage collections, and how her thoughts about research data and the role of the academic researcher have changed throughout her projects. She also shares her thoughts about what funders can do to ensure that research participants, and the data that they provide to researchers, can speak for themselves.   

This is the first of a two-part series and in this first post, Stefania shares with us her experiences of working with research data and outputs that are part of heritage collections, and how her thoughts about research data and the role of the academic researcher have changed throughout her projects. She also shares her thoughts about what funders can do to ensure that research participants, and the data that they provide to researchers, can speak for themselves.   


I’ve been thinking for a while about the etymology of the word data. Datum in Latin means ‘given’. Whereas when we are collecting data, we always say we’re “taking measurements”. Upon reflection, it has made me come to a realisation that we should approach data more as something that is given to us and we hold responsibility for, and something that is not ours, both in terms of ownership, but also because data can speak for itself and tell a story without our intervention – Dr Stefania Merlo


Data stories (whose story is it, anyway?) 

LO: How do you use data to tell the story that you want to tell? To put it another way, as an archaeologist, what is the story you want to tell and how do you use data to tell that story?

SM: I am currently working on two quite different projects. One is Mapping Africa’s Endangered Archaeological Sites and Monuments (funded by Arcadia) which is funded to create an Open Access database of information on endangered archaeological sites and monuments in Africa. In the project, we define “endangered” very broadly because ultimately, all sites are endangered. We’re doing this with a number of collaborators and the objective is to create a database that is mainly going to be used by national authorities for heritage management. There’s a little bit less storytelling there, but it has more to do with intellectual property: who are the custodians of the sites and the custodians of the data? A lot of questions are asked about Open Access, which is something that the funders of the projects have requested, but something that our stakeholders have got a lot of issues with. The issues surround where the digital data will be stored because currently, it is stored in Cambridge temporarily. Ideally all our stakeholders would like to see it stored in a server in the African continent at the least, if not actually in their own country. There are a lot of questions around this. 

The other project stems out of the work I’ve been doing in Southern Africa for almost the past 20 years, and is about asking how do you articulate knowledge of the African past that is not represented in history textbooks? This is a history that is rarely taught at university and is rarely discussed. How do you avail knowledge to publics that are not academic publics? That’s where the idea of creating a multimedia archive and a platform where digital representations of archaeological, archival, historical, and ethnographic data could be used to put together stories that are not the mainstream stories. It is a work in progress. The datasets that we deal with are very diverse because it is required to tell a history in a place and in periods for which we don’t have written sources.  

It’s so mesmerizing and so different from what we do in contexts where history is written. It gives us the opportunity to put together so many diverse types of sources. From oral histories to missionary accounts with all the issues around colonial reports and representations of others as they were perceived at the time, putting together information on the past environment combining archaeological data. We have a collective of colleagues that work in universities and museums. Each performs different bits and pieces of research, and we are trying to see how we would put together these types of data sets. How much do we curate them to avail them to other audiences? We’ve used the concept of data curation very heavily, and we use it purposefully because there is an impression of the objectivity of data, and we know, especially as social scientists, that this just doesn’t exist. 

I’ve been thinking for a while about the etymology of the word data. Datum in Latin means ‘given’. Whereas when we are collecting data, we always say we’re taking measurements. Upon reflection, it has made me come to a realisation that we should approach data more as something that is given to us and we hold responsibility for, and something that is not ours, both in terms of ownership, but also because data can speak for itself and tell a story without our intervention. That’s the kind of thinking surrounding data that we’ve been going through with the project. If data are given, our work is an act of restitution, and we should also acknowledge that we are curating it. We are picking and choosing what we’re putting together and in which format and framework. We are intervening a lot in the way these different records are represented so that they can be used by others to tell stories that are perhaps of more relevance to us. 

So there’s a lot of work in this project that we’re doing about representation. We are explaining – not justifying but explaining – the choices that we have made in putting together information that we think could be useful to re-create histories and tell stories. The project will benefit us because we are telling our own stories using digital storytelling, and in particular story mapping, but it could become useful for others as resources that can be used to tell their own stories. It’s still a work in progress because we also work in low resourced environments. The way in which people can access digital repositories and then use online resources is very different in Botswana and in South Africa, which are the two countries where I mainly work with in this project. We also dedicate time into thinking how useful the digital platform will be for the audiences that we would like to get an engagement from. 

The intended output is an archive that can be used in a digital storytelling platform. We have tried to narrow down our target audience to secondary school and early university students of history (and archaeology). We hope that the platform will eventually be used more widely, but we realised that we had to identify an audience to be able to prepare the materials. We have also realised that we need to give guidance on how to use such a platform so in the past year, we have worked with museums and learnt from museum education departments about using the museum as a space for teaching and learning, where some of these materials could become useful. Teachers and museum practitioners don’t have a lot of time to create their own teaching and learning materials, so we’re trying to create a way of engaging with practitioners and teachers in a way that doesn’t overburden them. For these reasons, there is more intervention that needs to come from our side into pre-packaging some of these curations, but we’re trying to do it in collaboration with them so that it’s not something that is solely produced by us academics. We want this to be something that is negotiated. As archaeologists and historians, we have an expertise on a particular part of African history that the communities that live in that space may not know about and cannot know because they were never told. They may have learned about the history of these spaces from their families and their communities, but they have learned only certain parts of the history of that land, whereas we can go much deeper into the past. So, the question becomes, how do you fill the gaps of knowledge, without imposing your own worldview? It needs to be negotiated but it’s a very difficult process to establish. There is a lot of trial and error, and we still don’t have an answer. 

Negotiating communities and funders 

LO: Have you ever had to navigate funders’ policies and stakeholder demands?  

SM: These kinds of projects need to be long and they need continuous funding, but they have outputs that are not always necessarily valued by funding bodies. This brings to the fore what funding bodies are interested in – is it solely data production, as it is called, and then the writing up of certain academic content? Or can we start to acknowledge that there are other ways of creating and sharing knowledge? As we know, there has been a drive, especially with UK funding bodies, to acknowledge that there are different ways in which information and knowledge is produced and shared. There are alternative ways of knowledge production from artistic ones to creative ones and everything in between, but it’s still so difficult to account for the types of knowledge production that these projects may have. When I’m reporting on projects, I still find it cumbersome and difficult to represent these types of knowledge production. There’s so much more that you need to do to justify the output of alternative knowledge compared to traditional outputs. I think there needs to be change to make it easier for researchers that produce alternative forms of knowledge to justify it rather than more difficult than the mainstream. 

One thing I would say is there’s a lot that we’ve learned with the (Mapping Africa’s Endangered Archaeological Sites and Monuments) project because there we engage directly with the custodians of the site and of the analog data. When they realise that the funders of the project expect to have this data openly accessible, then the questions come and the pushback comes, and it’s a pushback on a variety of different levels. The consequence is that basically we still haven’t been able to finalise our agreements with the custodians of the data. They trust us, so they have informed us that in the interim we can have the data as a project, but we haven’t been able to come to an agreement on what is going to happen to the data at the end of the project. In fact, the agreement at the moment is the data are not going to be going on a completely Open Access sphere. The negotiation now is about what they would be willing to make public, and what advantages they would have as a custodian of the data to make part, or all, of these data public.

This has created a disjuncture between what the funders thought they were doing. I’m sure they thought they were doing good by mandating that the data needs to be Open Access, but perhaps they didn’t consider that in other parts of the world, Open Access may not be desirable, or wanted, or acceptable, for a variety of very valid reasons. It’s a node that we still haven’t resolved and it makes me wonder: when funders are asking for Open Access, have they really thought about work outside of UK contexts with communities outside of the UK context? Have they considered these communities’ rights to data and their right to say, “we don’t want our data to be shared”? There’s a lot of work that has happened in North America in particular, because indigenous communities are the ones that put forward the concept of C.A.R.E., but in UK we are still very much discussing F.A.I.R. and not C.A.R.E.. I think the funders may have started thinking about it, but we’re not quite there. There is still this impression that Open Data and Open Access is a universal good without having considered that this may not be the case. It puts researchers that don’t work in UK or the Global North in an awkward position. This is definitely something that we are still grappling with very heavily. My hope is that this work is going to help highlight that when it comes to Open Access, there are no universals. We should revisit these policies in light of the fact that we are interacting with communities globally, not only those in some countries of the world. Who is Open Access for? Who does it benefit? Who wants it and who doesn’t want it, and for what reasons? These are questions that we need to keep asking ourselves. 

LO: Have you been in a position where you had to push back on funders or Open Access requirements before? 

Not necessarily a pushback, but our funders have funded a number of similar projects in South Asia, in Mongolia, in Nepal and the MENA region and we have come together as a collective to discuss issues around the ethics and the sustainability of the projects. We have engaged with representatives of our funders trying to explain that what they wanted initially, which is full Open Access, may not be practicable. In fact, there has already been a change in the terminology that is used by the funders. From Open Access, they changed the concept to Public Access, and they have come back to us to say that they can change their contractual terms to be more nuanced and acknowledge the fact that we are in negotiation with national stakeholders and other stakeholders about what should happen to the data. Some of this has been articulated in various meetings, but some of it was trial and error on our side. In other words, with our new proposal for renewal of funding, which was approved, we just included these nuances in the proposal and in our commitment and they were accepted. So in the course of the past four years, through lobbying of the funded projects, we have been able to bring nuance to the way in which the funders themselves think about Open Access. 


Stay tuned for part two of this conversation where Stefania will share some of the challenges of managing research data that are located in different countries!


Mapping the world through data – The November 2023 Data Champion Forum 

The November Data Champion forum was a geography/geospatial data themed edition of the bi-monthly gathering, this time hosted by the Physiology department. As usual, the Data Champions in attendance were treated to two presentations. Up first was Martin Lucas-Smith from the Department of Geography who introduced the audience to the OpenStreetMap (OSM) project, a global community mapping project using crowdsourcing. Just as Wikipedia is for textual information, OSM results in a worldwide map created by everyday people who map the world themselves. The resulting maps can vary in terms of its focus such as the transport map, which is a map which shows public transport lanes like railways, buses and trams worldwide, and the humanitarian map, which is an initiative dedicated to humanitarian action through open mapping. Martin is personally involved in a project called CycleStreets which, as the name implies, uses open mapping of bicycle infrastructure. The Department of Geography uses OSM as a background for its Cambridge Air Photos websites. Projects like these, Martin highlighted, demonstrate how community gets generated around open data. 

CycleStreets: Martin at the November 2023 Data Champion Forum

In his presentation, Martin explained the mechanics of OSM such as its data structure, how the maps are edited, and how data can be used in systems like routing engines. Editing the maps and the decision-making processes that go behind how a path is represented visually on the map is the point where the OSM community comes to action. While the data in OSM consists primarily of geometric points (called ‘Nodes’) and lines (called ‘Ways’) coupled with tags which denotes metadata values, the norms about how to define this information can only come about by consensus from the OSM community. This is perhaps different to more formal database structures that might be employed within corporate efforts such as Google. Because of its widespread crowdsourced nature, OSM tends to be more detailed than other maps for less well-served communities such as people cycling or walking, and its metadata is richer, as they are created by people who are intimately familiar with the areas that they are mapping. A map by users for users. 

Next up was Dr Rachel Sippy, a Research Associate with the Department of Genetics who presented how geospatial data factored into epidemiological research. In her work, the questions of ‘who’, ‘when’, and ‘where’ a disease outbreak occurred are important, at it is the where that gives her research a geographical focus. Maps, however, are often not detailed enough to provide information about an outbreak of disease among a population or community as maps can only mark out the incident site, the place, whereas the spatial context of that place, which she denotes as space, is equally as important in understanding disease outbreaks.  

Of ‘Space’ and ‘Place’: Rachel at the November 2023 Data Champion forum

It can be difficult, however, to understand what a researcher is measuring and what types of data can be used to measure space and/or place. Spatial data, as Rachel pointed out, can be difficult to work with and the researcher has to decide if spatial data is a burden or fundamental to the understanding of a disease outbreak in a particular setting. Rachel discussed several aspects of spatial data which she has considered in her research such as visualisation techniques, data sources and methods of analysis. They all come with their own sets of challenges and researchers have to navigate them to decide how best to tell the fundamental story that answers the research question. This essentially comes down to an act of curation of spatial data, as Rachel pointed out, quoting Mark Monmoneir, that “not only is it easy to lie with maps, it’s essential”. In doing so, researchers working with spatial data would have to navigate the political and cultural hierarchies that are explicitly and implicitly inherent to places, and any ethical considerations relating to both the human and non-human (animal) inhabitants of those geographical locations. Ultimately, how data owners choose to model the spatial data will affect the analysis of the research, and with it, its utility for public health. 

After lunch, both Martin and Rachel sat together to hold a combined Q&A session and a discussion emerged around the topic of subjectivity. A question was raised to Rachel regarding mapping and subjectivity, as it was noticed that how she described place, which included socio-cultural meanings and personal preferences of the inhabitants of the place, can be considered to be subjective in manner. Rachel agreed and alluded back to her presentation, where she mentioned that these aspects of mapping can get fuzzy as researchers would have to deal with matters relating to identity, political affiliations and personal opinions, such as how safe an individual may feel in a particular place. Martin added that with the OSM project the data must be objective as possible, yet the maps themselves are subjective views of objective data.  

Rachel and Martin answering questions from the Data Champions at the November 2023 forum

Martin also brought to attention that maps are contested spaces because spaces can be political in nature. Rachel added that sometimes, maps do not appropriately represent the contested nature of her field sites, which she only learned through time on the field. In this way, context is very important for “real mapping”. As an example, Martin discussed his “UK collision data” map, created outside the University, which states where collisions have happened, giving the example of one of central Cambridge’s busiest streets, Mill Road: without contextual information such as what time these collisions occurred, what vehicles were involved, and the environmental conditions at the time of the accident, a collision map may not be that valuable. To this end, it was asked whether ethnographic research could provide useful data in the act of mapping and the speakers agreed. 

Data Diversity Podcast #1 – Danny van der Haven

Last week, the Research Data Team at the Office of Scholarly Communication recorded the inaugural Data Diversity Podcast with Data Champion Danny van der Haven from the Department of Material Science and Metallurgy.

As is the theme of the podcast, we spoke to Danny about his relationship with data and learned from his experiences as a researcher. The conversation also touched on the differences between lab research and working with human participants, his views on objectivity in scientific research, and how unexpected findings can shed light on datasets that were previously insignificant. We also learn about Danny’s current PhD research studying the properties of pharmaceutical powders to enhance the production of medication tablets.   

Click here to listen to the full conversation.

If you have heart rate data, you do not want to get a different diagnosis if you go to a different doctor. Ideally, you would get the same diagnosis with every doctor, so the operator or the doctor should not matter, but only the data should matter.
– Danny van der Haven

   ***  

What is data to you?  

Danny: I think I’m going to go for a very general description. I think that you have data as soon as you record something in any way. If it’s a computer signal or if it’s something written down in your lab notebook, I think that is already data. So, it can be useless data, it can be useful data, it can be personal data, it can be sensitive data, it can be data that’s not sensitive, but I would consider any recording of any kind already data. The experimental protocol that you’re trying to draft? I think that’s already data.   

If you’re measuring something, I don’t think it’s necessarily data when you’re measuring it. I think it becomes data only when it is recorded. That’s how I would look at it. Because that’s when you have to start thinking about the typical things that you need to consider when dealing with regular data, sensitive data or proprietary data etc.   

When you’re talking about sensitive data, I would say that any data or information of which the public disclosure or dissemination may be undesirable for any given reason. That’s really when I start to draw the distinction between data and sensitive data. That’s more my personal view on it, but there’s also definitely a legal or regulatory view. Looking for example at the ECG, the electrocardiogram, you can take the electrical signal from one of the 12 stickers on a person’s body. I think there is practically nobody that’s going to call that single electrical signal personal data or health data, and most doctors wouldn’t bat an eye.   

But if you would take, for example, the heart rate per minute that follows from the full ECG, then it becomes not only personal data but also becomes health data, because then it starts to say something about your physiological state, your biology, your body. So there’s a transition here that is not very obvious. Because I would say that heart rate is obviously health data and the electrical signal from one single sticker is quite obviously not health data. But where is the change? Because what if I have the electrical signal from all 12 stickers? Then I can calculate the heart rate from the signal of all the 12 stickers. In this case, I would start labelling this as health data already. But even then, before it becomes health data, you also need to know where the stickers are on the body.   

So when is it health data? I would say that somebody with decent technical knowledge, if they know where the stickers are, can already compute the heart rate. So then it becomes health data, even if it’s even if it’s not on the surface. A similar point is when metadata becomes real data. For example, your computer always saves that date and time you modified files. But sometimes, if you have sensitive systems or you have people making appointments, even such simple metadata can actually become sensitive data.   

On working within the constraints of GDPR  

Danny: We struggled with that because with our startup Ares Analytics, we also ran into the issues with GDPR. In the Netherlands at the time, GDPR was interpreted really stringently by the Dutch government. Data was not anonymous if you could, in any way, no matter how difficult, retrace the data to the person. Some people are not seeing these possibilities, but just to take it really far: if I would be a hacker with infinite resources, I could say I’m going to hack into the dataset and see the moments that the data that were recorded. And then I can hack into the calendar of everybody whose GPS signal was at the hospital on this day, and then I can probably find out who at that time was taking the test… I mean is that reasonable? Is anybody ever going do that? If you put those limitations on data because that is a very, very remote possibility; is that fair or are you going hinder research too much? I understand the cautionary principle in this case, but it ends up being a struggle for us in in that sense.  

Lutfi: Conceivably, data will lose its value. If you really go to the full extent on how to anonymise something, then you will be dataless really because the only true way to anonymise and to protect the individual is to delete the data.  

Danny: You can’t. You’re legally not allowed to because you need to know what data was recorded with certain participants. Because if some accident happens to this person five years later, and you had a trial with this person, you need to know if your study had something to do with that accident. This is obvious when you you’re testing drugs. So in that sense, the hospital must have a non-anonymised copy, they must. But if they have a non-anonymized copy and I have an anonymised copy… If you overlay your data sets, you can trace back the identity. So, this is of course where you end up with a with a deadlock.  

What is your relationship to data?  

Danny: I see my relationship to data more as a role that I play with respect to the data, and I have many roles that I cycle through. I’m the data generator in the lab. Then at some point, I’m the data processor when I’m working on it, and then I am the data manager when I’m storing it and when I’m trying to make my datasets Open Access. To me, that varies, and it seems more like a functional role. All my research depends on the data.  

Lutfi: Does the data itself start to be more or less humanised along the way, or do you always see it as you’re working on someone, a living, breathing human being, or does that only happen toward the end of that spectrum?   

Danny: Well, I think I’m very have the stereotypical scientist mindset in that way. To me, when I’m working on it, in the moment, I guess it’s just numbers to me. When I am working on the data and it eventually turns into personal and health data, then I also become the data safe guarder or protector. And I definitely do feel that responsibility, but I am also trying to avoid bias. I try not to make a personal connection with the data in any sense. When dealing with people and human data, data can be very noisy. To control tests properly, you would like to be double blind. You would like not to know who did a certain test, you would like not to know the answer beforehand, more or less, as in who’s more fit or less fit. But sometimes you’re the same person as the person who collected the data, and you actually cannot avoid knowing that. But there are ways that you can trick yourself to avoid that. For example, you can label the data in certain clever way and you make sure that the labelling is only something that you see afterwards.   

Even in very dry physical lab data, for example microscopy of my powders, the person recording it can introduce a significant bias because of how they tap the microscopy slide when there’s powder on it. Now, suddenly, I’m making an image of two particles that are touching instead of two separate particles. I think it’s also kind of my duty, that when I do research, to make the data, how I acquire it, and how it’s processed to be as independent of the user as possible. Because otherwise user variation is going to overlap with my results and that’s not something I want, because I want to look at the science itself, not who did the science. 

Lutfi: In a sentence, in terms of the sort of accuracy needed for your research, the more dehumanised the data is, the more accurate the data so to speak.   

Danny: I don’t like the phrasing of the word “dehumanised”. I guess I would say that maybe we should be talking about not having person-specific or operator-specific data. If you have heart rate data, you do not want to get a different diagnosis if you go to a different doctor. Ideally, you would get the same diagnosis with every doctor, so the operator or the doctor should not matter, but only the data should matter. 

             ***  

If you would like to be a guest on the Data Diversity Podcast and have some interesting data related stories to share, please get in touch with us at info@data.cam.ac.uk and state your interest. We look forward to hearing from you!