Mapping the world through data – The November 2023 Data Champion Forum 

The November Data Champion forum was a geography/geospatial data themed edition of the bi-monthly gathering, this time hosted by the Physiology department. As usual, the Data Champions in attendance were treated to two presentations. Up first was Martin Lucas-Smith from the Department of Geography who introduced the audience to the OpenStreetMap (OSM) project, a global community mapping project using crowdsourcing. Just as Wikipedia is for textual information, OSM results in a worldwide map created by everyday people who map the world themselves. The resulting maps can vary in terms of its focus such as the transport map, which is a map which shows public transport lanes like railways, buses and trams worldwide, and the humanitarian map, which is an initiative dedicated to humanitarian action through open mapping. Martin is personally involved in a project called CycleStreets which, as the name implies, uses open mapping of bicycle infrastructure. The Department of Geography uses OSM as a background for its Cambridge Air Photos websites. Projects like these, Martin highlighted, demonstrate how community gets generated around open data. 

CycleStreets: Martin at the November 2023 Data Champion Forum

In his presentation, Martin explained the mechanics of OSM such as its data structure, how the maps are edited, and how data can be used in systems like routing engines. Editing the maps and the decision-making processes that go behind how a path is represented visually on the map is the point where the OSM community comes to action. While the data in OSM consists primarily of geometric points (called ‘Nodes’) and lines (called ‘Ways’) coupled with tags which denotes metadata values, the norms about how to define this information can only come about by consensus from the OSM community. This is perhaps different to more formal database structures that might be employed within corporate efforts such as Google. Because of its widespread crowdsourced nature, OSM tends to be more detailed than other maps for less well-served communities such as people cycling or walking, and its metadata is richer, as they are created by people who are intimately familiar with the areas that they are mapping. A map by users for users. 

Next up was Dr Rachel Sippy, a Research Associate with the Department of Genetics who presented how geospatial data factored into epidemiological research. In her work, the questions of ‘who’, ‘when’, and ‘where’ a disease outbreak occurred are important, at it is the where that gives her research a geographical focus. Maps, however, are often not detailed enough to provide information about an outbreak of disease among a population or community as maps can only mark out the incident site, the place, whereas the spatial context of that place, which she denotes as space, is equally as important in understanding disease outbreaks.  

Of ‘Space’ and ‘Place’: Rachel at the November 2023 Data Champion forum

It can be difficult, however, to understand what a researcher is measuring and what types of data can be used to measure space and/or place. Spatial data, as Rachel pointed out, can be difficult to work with and the researcher has to decide if spatial data is a burden or fundamental to the understanding of a disease outbreak in a particular setting. Rachel discussed several aspects of spatial data which she has considered in her research such as visualisation techniques, data sources and methods of analysis. They all come with their own sets of challenges and researchers have to navigate them to decide how best to tell the fundamental story that answers the research question. This essentially comes down to an act of curation of spatial data, as Rachel pointed out, quoting Mark Monmoneir, that “not only is it easy to lie with maps, it’s essential”. In doing so, researchers working with spatial data would have to navigate the political and cultural hierarchies that are explicitly and implicitly inherent to places, and any ethical considerations relating to both the human and non-human (animal) inhabitants of those geographical locations. Ultimately, how data owners choose to model the spatial data will affect the analysis of the research, and with it, its utility for public health. 

After lunch, both Martin and Rachel sat together to hold a combined Q&A session and a discussion emerged around the topic of subjectivity. A question was raised to Rachel regarding mapping and subjectivity, as it was noticed that how she described place, which included socio-cultural meanings and personal preferences of the inhabitants of the place, can be considered to be subjective in manner. Rachel agreed and alluded back to her presentation, where she mentioned that these aspects of mapping can get fuzzy as researchers would have to deal with matters relating to identity, political affiliations and personal opinions, such as how safe an individual may feel in a particular place. Martin added that with the OSM project the data must be objective as possible, yet the maps themselves are subjective views of objective data.  

Rachel and Martin answering questions from the Data Champions at the November 2023 forum

Martin also brought to attention that maps are contested spaces because spaces can be political in nature. Rachel added that sometimes, maps do not appropriately represent the contested nature of her field sites, which she only learned through time on the field. In this way, context is very important for “real mapping”. As an example, Martin discussed his “UK collision data” map, created outside the University, which states where collisions have happened, giving the example of one of central Cambridge’s busiest streets, Mill Road: without contextual information such as what time these collisions occurred, what vehicles were involved, and the environmental conditions at the time of the accident, a collision map may not be that valuable. To this end, it was asked whether ethnographic research could provide useful data in the act of mapping and the speakers agreed. 

Data Diversity Podcast #1 – Danny van der Haven

Last week, the Research Data Team at the Office of Scholarly Communication recorded the inaugural Data Diversity Podcast with Data Champion Danny van der Haven from the Department of Material Science and Metallurgy.

As is the theme of the podcast, we spoke to Danny about his relationship with data and learned from his experiences as a researcher. The conversation also touched on the differences between lab research and working with human participants, his views on objectivity in scientific research, and how unexpected findings can shed light on datasets that were previously insignificant. We also learn about Danny’s current PhD research studying the properties of pharmaceutical powders to enhance the production of medication tablets.   

Click here to listen to the full conversation.

If you have heart rate data, you do not want to get a different diagnosis if you go to a different doctor. Ideally, you would get the same diagnosis with every doctor, so the operator or the doctor should not matter, but only the data should matter.
– Danny van der Haven

   ***  

What is data to you?  

Danny: I think I’m going to go for a very general description. I think that you have data as soon as you record something in any way. If it’s a computer signal or if it’s something written down in your lab notebook, I think that is already data. So, it can be useless data, it can be useful data, it can be personal data, it can be sensitive data, it can be data that’s not sensitive, but I would consider any recording of any kind already data. The experimental protocol that you’re trying to draft? I think that’s already data.   

If you’re measuring something, I don’t think it’s necessarily data when you’re measuring it. I think it becomes data only when it is recorded. That’s how I would look at it. Because that’s when you have to start thinking about the typical things that you need to consider when dealing with regular data, sensitive data or proprietary data etc.   

When you’re talking about sensitive data, I would say that any data or information of which the public disclosure or dissemination may be undesirable for any given reason. That’s really when I start to draw the distinction between data and sensitive data. That’s more my personal view on it, but there’s also definitely a legal or regulatory view. Looking for example at the ECG, the electrocardiogram, you can take the electrical signal from one of the 12 stickers on a person’s body. I think there is practically nobody that’s going to call that single electrical signal personal data or health data, and most doctors wouldn’t bat an eye.   

But if you would take, for example, the heart rate per minute that follows from the full ECG, then it becomes not only personal data but also becomes health data, because then it starts to say something about your physiological state, your biology, your body. So there’s a transition here that is not very obvious. Because I would say that heart rate is obviously health data and the electrical signal from one single sticker is quite obviously not health data. But where is the change? Because what if I have the electrical signal from all 12 stickers? Then I can calculate the heart rate from the signal of all the 12 stickers. In this case, I would start labelling this as health data already. But even then, before it becomes health data, you also need to know where the stickers are on the body.   

So when is it health data? I would say that somebody with decent technical knowledge, if they know where the stickers are, can already compute the heart rate. So then it becomes health data, even if it’s even if it’s not on the surface. A similar point is when metadata becomes real data. For example, your computer always saves that date and time you modified files. But sometimes, if you have sensitive systems or you have people making appointments, even such simple metadata can actually become sensitive data.   

On working within the constraints of GDPR  

Danny: We struggled with that because with our startup Ares Analytics, we also ran into the issues with GDPR. In the Netherlands at the time, GDPR was interpreted really stringently by the Dutch government. Data was not anonymous if you could, in any way, no matter how difficult, retrace the data to the person. Some people are not seeing these possibilities, but just to take it really far: if I would be a hacker with infinite resources, I could say I’m going to hack into the dataset and see the moments that the data that were recorded. And then I can hack into the calendar of everybody whose GPS signal was at the hospital on this day, and then I can probably find out who at that time was taking the test… I mean is that reasonable? Is anybody ever going do that? If you put those limitations on data because that is a very, very remote possibility; is that fair or are you going hinder research too much? I understand the cautionary principle in this case, but it ends up being a struggle for us in in that sense.  

Lutfi: Conceivably, data will lose its value. If you really go to the full extent on how to anonymise something, then you will be dataless really because the only true way to anonymise and to protect the individual is to delete the data.  

Danny: You can’t. You’re legally not allowed to because you need to know what data was recorded with certain participants. Because if some accident happens to this person five years later, and you had a trial with this person, you need to know if your study had something to do with that accident. This is obvious when you you’re testing drugs. So in that sense, the hospital must have a non-anonymised copy, they must. But if they have a non-anonymized copy and I have an anonymised copy… If you overlay your data sets, you can trace back the identity. So, this is of course where you end up with a with a deadlock.  

What is your relationship to data?  

Danny: I see my relationship to data more as a role that I play with respect to the data, and I have many roles that I cycle through. I’m the data generator in the lab. Then at some point, I’m the data processor when I’m working on it, and then I am the data manager when I’m storing it and when I’m trying to make my datasets Open Access. To me, that varies, and it seems more like a functional role. All my research depends on the data.  

Lutfi: Does the data itself start to be more or less humanised along the way, or do you always see it as you’re working on someone, a living, breathing human being, or does that only happen toward the end of that spectrum?   

Danny: Well, I think I’m very have the stereotypical scientist mindset in that way. To me, when I’m working on it, in the moment, I guess it’s just numbers to me. When I am working on the data and it eventually turns into personal and health data, then I also become the data safe guarder or protector. And I definitely do feel that responsibility, but I am also trying to avoid bias. I try not to make a personal connection with the data in any sense. When dealing with people and human data, data can be very noisy. To control tests properly, you would like to be double blind. You would like not to know who did a certain test, you would like not to know the answer beforehand, more or less, as in who’s more fit or less fit. But sometimes you’re the same person as the person who collected the data, and you actually cannot avoid knowing that. But there are ways that you can trick yourself to avoid that. For example, you can label the data in certain clever way and you make sure that the labelling is only something that you see afterwards.   

Even in very dry physical lab data, for example microscopy of my powders, the person recording it can introduce a significant bias because of how they tap the microscopy slide when there’s powder on it. Now, suddenly, I’m making an image of two particles that are touching instead of two separate particles. I think it’s also kind of my duty, that when I do research, to make the data, how I acquire it, and how it’s processed to be as independent of the user as possible. Because otherwise user variation is going to overlap with my results and that’s not something I want, because I want to look at the science itself, not who did the science. 

Lutfi: In a sentence, in terms of the sort of accuracy needed for your research, the more dehumanised the data is, the more accurate the data so to speak.   

Danny: I don’t like the phrasing of the word “dehumanised”. I guess I would say that maybe we should be talking about not having person-specific or operator-specific data. If you have heart rate data, you do not want to get a different diagnosis if you go to a different doctor. Ideally, you would get the same diagnosis with every doctor, so the operator or the doctor should not matter, but only the data should matter. 

             ***  

If you would like to be a guest on the Data Diversity Podcast and have some interesting data related stories to share, please get in touch with us at info@data.cam.ac.uk and state your interest. We look forward to hearing from you!  

Methods getting their chance to shine – Apollo wants your methods!

By Dr. Kim Clugston, Research Data Co-ordinator, Office of Scholarly Communication

Underlying all research data is always an effective and working method and this applies across all disciplines from STEMM to the Arts, Humanities and Social Sciences. Methods are a detailed description of the tools that are used in research and can come in many forms depending on the type of research. Methods are often overlooked rather than being seen as an integral research output in their own right. Traditionally, published journals include a materials and methods section, which is often a summary due to restrictions on word limits making it difficult for other researchers to reproduce the results or replicate the study. There can sometimes be an option to submit the method as “supplementary material”, but this is not always the case. There are specific journals that publish methods and may be peer-reviewed but not all are open access, rendering them hidden behind a paywall. The last decade has seen the creation of “protocol” repositories, some with the ability to comment, adapt and even insert videos. Researchers at the University of Cambridge, from all disciplines – arts, humanities, social sciences and STEMM fields – can now publish their method openly in Apollo, our institutional repository. In this blog, we discuss why it is important to publish methods openly and how the University’s researchers and students can do this in Apollo.

The protocol sharing repository, Protocols.io, was founded in 2012. Protocols can be uploaded to the platform or created within it; they can be shared privately with others or made public. The protocols can be dynamic and interactive (rather than a static document) and can be annotated, which is ideal for highlighting information that could be key to an experiment’s success. Collaboration, adaptation and reuse are possible by creating a fork (an editable clone of a version) that can be compared with any existing versions of the same protocol. Protocols.io currently hosts nearly 16,000 public protocols, showing that there is a support for this type of platform. In July this year it was announced that Protocols.io was acquired by Springer Nature. Their press statement aims to reassure that Protocols.io mission and vision will not change with the acquisition, despite Springer Nature already hosting the world’s largest collection of published protocols in the form of SpringerProtocols along with their own version of a free and open repository, Protocol Exchange. This begs the question of whether a major commercial publisher is monopolising the protocol space, and if they are, is this or will this be a problem? At the moment there do not appear to be any restrictions on exporting/transferring protocols from Protocols.io and hopefully this will continue. This is a problem often faced by researchers using proprietary Electronic Research Notebooks (ERNs), where it can be difficult to disengage from one platform and laborious to transfer notebooks to another, all while ensuring that data integrity is maintained. Because of this, researchers may feel locked into using a particular product. Time will tell how the partnership between Protocols.io and Springer Nature develops and whether the original mission and vision of Protocols.io will remain. Currently, their Open Research plan enables researchers to make an unlimited number of protocols public, with the number of private protocols limited to two (paid plans offer more options and features).

Bio-protocol exchange (under the umbrella of Bio-protocol Journal) is a platform for researchers to find, share and discuss life science protocols with protocol search and webinars. Protocols can be submitted either to Bio-protocol or as a preprint, researchers can ask authors questions, and fork to modify and share the protocol while crediting the original author. They also have an interesting ‘Request a Protocol’ (RaP) service that searches more than 6 million published research papers for protocols or allows you to request one if you are unable to find what you are looking for. A useful feature is that you can ask the community or the original authors of the protocol any question you may have about the protocol. Bio-protocol exchange published all protocols free of charge to their authors since their launch in 2011, with substantial financial backing of their founders. Unfortunately,  it was announced that protocol articles submitted to Bio-protocol after March 1 2023 will be charged an Article Processing Charge (APC) of $1200. Researchers who do not want to pay the APC can still post a protocol for free in the Bio-protocol Preprint Repository where they will receive a DOI but will not have gone through the journal’s peer review process.

As methods are integral to successful research, it is a positive move to see the creation and growth of platforms supporting protocol development and sharing. Currently, these tend to cater for research in the sciences, and serve the important role of supporting research reproducibility. Yet, methods exist across all disciplines – arts, humanities, social sciences as well as STEMM – and we see the term ‘method’ rather than ‘protocol’ as more inclusive of all areas of research.

Apollo (Cambridge University’s repository) has now joined the growing appreciation within the research community of recognising the importance of detailing and sharing methodologies. Researchers at the University can now use their Symplectic Elements account to deposit a method into Apollo. Not only does this value the method as an output in its own right, it provides the researcher with a DOI and a publication that can be automatically updated to their ORCID profile (if ORCID is linked to their Elements account). In May this year, Apollo was awarded CoreTrustSeal certification, reinforcing the University’s commitment to preserving research outputs in the long-term and should give researchers confidence that they are depositing their work in a trustworthy digital repository.

The first method to be deposited into Apollo in this way was authored by Professor John Suckling and colleagues. Professor Suckling is Director of Research in Psychiatric Neuroimaging in the Department of Psychiatry. His published method relates to an interesting project combining art and science to create artwork that aims to represent hallucinatory experiences in individuals with diagnosed psychotic or neurodegenerative disorders. He is no stranger to depositing in Apollo; in fact, he has one of the most downloaded datasets in Apollo after depositing the Mammographic Image Analysis Society database in Apollo in 2015. This record contains the images of 322 digital mammograms from a database complied in 1992. Professor Suckling is an advocate of open research and was a speaker at the Open Research at Cambridge conference in 2021.

An interesting and exciting new platform which aims to change research culture and the way researchers are recognised is Octopus. Founded by University of Cambridge researcher Dr Alexandra Freeman, Octopus is free to use for all and is funded by UKRI and developed by Jisc. Researchers can publish instantly all research outputs without word limit constraints, which can often stifle the details. Research outputs are not restricted to articles but also include, for example, code, methods, data, videos and even ideas or short pieces of work. This serves to incentivise the importance of all research outputs. Octopus aims to level up the current skew toward publishing more sensationalist work and encourages publishing all work, such as negative findings, which are often of equal value to science but often get shelved in what is termed the ‘file drawer’ problem. A collaborative research community is encouraged to work together on pieces of a puzzle, with credit given to individual researchers rather than a long list of authors. The platform supports reproducibility, transparency, accountability and aims to allow research the best chance to advance more quickly. Through Octopus, authors retain copyright and apply a Creative Commons licence to their work; the only requirement is that published work is open access and allows derivatives. It is a breath of fresh air in the current rigid publishing structure.

Clear and transparent methods underpin research and are fundamental to the reliability, integrity and advancement of research. Is the research landscape beginning to change to allow open methods, freely published, to take centre stage and for methods to be duly recognised and rewarded as a standalone research output? We certainly hope so. The University of Cambridge is committed to supporting open research, and past and present members who have conducted research at the University can share these outputs openly in Apollo. If you would like to publish a method in Apollo, please submit it here or if you have any queries email us at info@data.cam.ac.uk.

There will be an Octopus workshop at the Open Research for Inclusion: Spotlighting Different Voices in Open Research at Cambridge on Friday 17th November 2023 at Downing College.