Tag Archives: research data

Data Diversity Podcast #1 – Danny van der Haven

Last week, the Research Data Team at the Office of Scholarly Communication recorded the inaugural Data Diversity Podcast with Data Champion Danny van der Haven from the Department of Material Science and Metallurgy.

As is the theme of the podcast, we spoke to Danny about his relationship with data and learned from his experiences as a researcher. The conversation also touched on the differences between lab research and working with human participants, his views on objectivity in scientific research, and how unexpected findings can shed light on datasets that were previously insignificant. We also learn about Danny’s current PhD research studying the properties of pharmaceutical powders to enhance the production of medication tablets.   

Click here to listen to the full conversation.

If you have heart rate data, you do not want to get a different diagnosis if you go to a different doctor. Ideally, you would get the same diagnosis with every doctor, so the operator or the doctor should not matter, but only the data should matter.
– Danny van der Haven


What is data to you?  

Danny: I think I’m going to go for a very general description. I think that you have data as soon as you record something in any way. If it’s a computer signal or if it’s something written down in your lab notebook, I think that is already data. So, it can be useless data, it can be useful data, it can be personal data, it can be sensitive data, it can be data that’s not sensitive, but I would consider any recording of any kind already data. The experimental protocol that you’re trying to draft? I think that’s already data.   

If you’re measuring something, I don’t think it’s necessarily data when you’re measuring it. I think it becomes data only when it is recorded. That’s how I would look at it. Because that’s when you have to start thinking about the typical things that you need to consider when dealing with regular data, sensitive data or proprietary data etc.   

When you’re talking about sensitive data, I would say that any data or information of which the public disclosure or dissemination may be undesirable for any given reason. That’s really when I start to draw the distinction between data and sensitive data. That’s more my personal view on it, but there’s also definitely a legal or regulatory view. Looking for example at the ECG, the electrocardiogram, you can take the electrical signal from one of the 12 stickers on a person’s body. I think there is practically nobody that’s going to call that single electrical signal personal data or health data, and most doctors wouldn’t bat an eye.   

But if you would take, for example, the heart rate per minute that follows from the full ECG, then it becomes not only personal data but also becomes health data, because then it starts to say something about your physiological state, your biology, your body. So there’s a transition here that is not very obvious. Because I would say that heart rate is obviously health data and the electrical signal from one single sticker is quite obviously not health data. But where is the change? Because what if I have the electrical signal from all 12 stickers? Then I can calculate the heart rate from the signal of all the 12 stickers. In this case, I would start labelling this as health data already. But even then, before it becomes health data, you also need to know where the stickers are on the body.   

So when is it health data? I would say that somebody with decent technical knowledge, if they know where the stickers are, can already compute the heart rate. So then it becomes health data, even if it’s even if it’s not on the surface. A similar point is when metadata becomes real data. For example, your computer always saves that date and time you modified files. But sometimes, if you have sensitive systems or you have people making appointments, even such simple metadata can actually become sensitive data.   

On working within the constraints of GDPR  

Danny: We struggled with that because with our startup Ares Analytics, we also ran into the issues with GDPR. In the Netherlands at the time, GDPR was interpreted really stringently by the Dutch government. Data was not anonymous if you could, in any way, no matter how difficult, retrace the data to the person. Some people are not seeing these possibilities, but just to take it really far: if I would be a hacker with infinite resources, I could say I’m going to hack into the dataset and see the moments that the data that were recorded. And then I can hack into the calendar of everybody whose GPS signal was at the hospital on this day, and then I can probably find out who at that time was taking the test… I mean is that reasonable? Is anybody ever going do that? If you put those limitations on data because that is a very, very remote possibility; is that fair or are you going hinder research too much? I understand the cautionary principle in this case, but it ends up being a struggle for us in in that sense.  

Lutfi: Conceivably, data will lose its value. If you really go to the full extent on how to anonymise something, then you will be dataless really because the only true way to anonymise and to protect the individual is to delete the data.  

Danny: You can’t. You’re legally not allowed to because you need to know what data was recorded with certain participants. Because if some accident happens to this person five years later, and you had a trial with this person, you need to know if your study had something to do with that accident. This is obvious when you you’re testing drugs. So in that sense, the hospital must have a non-anonymised copy, they must. But if they have a non-anonymized copy and I have an anonymised copy… If you overlay your data sets, you can trace back the identity. So, this is of course where you end up with a with a deadlock.  

What is your relationship to data?  

Danny: I see my relationship to data more as a role that I play with respect to the data, and I have many roles that I cycle through. I’m the data generator in the lab. Then at some point, I’m the data processor when I’m working on it, and then I am the data manager when I’m storing it and when I’m trying to make my datasets Open Access. To me, that varies, and it seems more like a functional role. All my research depends on the data.  

Lutfi: Does the data itself start to be more or less humanised along the way, or do you always see it as you’re working on someone, a living, breathing human being, or does that only happen toward the end of that spectrum?   

Danny: Well, I think I’m very have the stereotypical scientist mindset in that way. To me, when I’m working on it, in the moment, I guess it’s just numbers to me. When I am working on the data and it eventually turns into personal and health data, then I also become the data safe guarder or protector. And I definitely do feel that responsibility, but I am also trying to avoid bias. I try not to make a personal connection with the data in any sense. When dealing with people and human data, data can be very noisy. To control tests properly, you would like to be double blind. You would like not to know who did a certain test, you would like not to know the answer beforehand, more or less, as in who’s more fit or less fit. But sometimes you’re the same person as the person who collected the data, and you actually cannot avoid knowing that. But there are ways that you can trick yourself to avoid that. For example, you can label the data in certain clever way and you make sure that the labelling is only something that you see afterwards.   

Even in very dry physical lab data, for example microscopy of my powders, the person recording it can introduce a significant bias because of how they tap the microscopy slide when there’s powder on it. Now, suddenly, I’m making an image of two particles that are touching instead of two separate particles. I think it’s also kind of my duty, that when I do research, to make the data, how I acquire it, and how it’s processed to be as independent of the user as possible. Because otherwise user variation is going to overlap with my results and that’s not something I want, because I want to look at the science itself, not who did the science. 

Lutfi: In a sentence, in terms of the sort of accuracy needed for your research, the more dehumanised the data is, the more accurate the data so to speak.   

Danny: I don’t like the phrasing of the word “dehumanised”. I guess I would say that maybe we should be talking about not having person-specific or operator-specific data. If you have heart rate data, you do not want to get a different diagnosis if you go to a different doctor. Ideally, you would get the same diagnosis with every doctor, so the operator or the doctor should not matter, but only the data should matter. 


If you would like to be a guest on the Data Diversity Podcast and have some interesting data related stories to share, please get in touch with us at info@data.cam.ac.uk and state your interest. We look forward to hearing from you!  

Open Research 101

Dr. Sacha Jones and Dr. Samuel Moore, Office of Scholarly Communication, Cambridge University Libraries

The Open Research at Cambridge conference took place between 22–26 November 2021. In a series of talks, panel discussions and interactive Q&A sessions, researchers, publishers, and other stakeholders explored how Cambridge can make the most of the opportunities offered by open research. This blog is part of a series summarising each event. 

As part of the Cambridge Open Research conference, the Office of Scholarly Communication hosted a ‘101’ session on open research, covering the basics and answering queries for the audience on all aspects of open access publication and open data. With over 80 participants, we were thrilled with the response and wanted to recap some of the topics we covered in this post.

Firstly, as we discussed in the session, it is easy to assume that open research is simply an issue for the sciences rather than all academic disciplines. Practices such as open access and open data have been taken up widely in the sciences, although in different ways, and there is a common association with science and openness. This is compounded by the fact that in many European countries Open Science is inclusive of arts and humanities scholarship and so is functionally equivalent to open research. At the OSC, we are keen to support open practices across all disciplines while being sensitive to different ways of working. We are guided by the university’s Open Research Position Statement that requires work to be ‘as open as possible, as closed as necessary’.

After an introduction to open research, Sam then outlined the key issues in open access, including the different licences for making your research open access, the differences between green and gold open access, and the many and various reasons for making your work open access. Open access allows us to reach new audiences, improve the economics of research access, and reassess knowledge production and dissemination in a digital world. We also learned about open access monographs, the complex policy landscape and the various ways in which you can make your research open access through repositories and journals. The OSC’s Open Access webpages are an excellent set of resources for learning more.

We then moved onto open data – research data shared publicly – and how this fits into open research (see the University’s policy framework on research data). After highlighting that all research regardless of discipline generates or uses data of one kind or another (e.g. text, audio-visual, numerical, etc.), Sacha posed a series of questions with answers, anticipating what the audience might want to know more about. Do I have to share my data? What data do I share – is it meant to be everything from my research? My data contains sensitive information so I can’t share my data, or can I? How do I share my data? I don’t want to be criticised after making my data open, so how can I prevent this? How can I stop someone else from taking my data, using it, and getting all the credit? The OSC’s Research Data website contain information about data management and data sharing, and check out our list of Cambridge Data Champion experts to see if there’s anyone who’s volunteered to be a local source of data-related advice in your department or discipline.

We are always available as a source of support and guidance in all matters relating to open research and encourage you to contact us if you have any questions. The OSC has webpages on open research and sites dedicated to both open access and research data. For general open research enquires, we can be emailed at info@osc.cam.ac.uk, for open access at info@openaccess.cam.ac.uk and for data at info@data.cam.ac.uk. There are also a number of training sessions provided throughout the year and online that relate to the topics covered in this session. If you think that those in your department or institute at Cambridge would like to know more about the topics covered here then please do get in touch as we’d be happy to speak to these and answer any questions you may have.

Research Data at Cambridge – highlights of the year so far

By Dr Sacha Jones, Research Data Coordinator

This year we have continued, as always, to provide support and services for researchers to help with their research data management and open data practices. So far in 2020, we have approved more than 230 datasets into our institutional repository, Apollo. This includes Apollo’s 2000th dataset on the impact of health warning labels on snack selection, which represents a shining example of reproducible research, involving the full gamut: preregistration, and sharing of consent forms, code, protocols, data. There are other studies that have sparked media interest for which the data are also openly available in Apollo, such as the data supporting research that reports the development of a wireless device that can convert sunlight, carbon dioxide and water into a carbon-neutral fuel. Or, data supporting a study that has used computational modelling to explain why blues and greens are the brightest colours in nature. Also, and in the year of COVID, a dataset was published in April on the ability of common fabrics to filter ultrafine particles, associated with an article in BMJ Open. Sharing data associated with publications is critical for the integrity of many disciplines and best practice in the majority of studies, but there is also an important responsibility of science communication in particular to bring research datasets to the forefront. This point was discussed eloquently this summer in a guest blog post in Unlocking Research by Itamar Shatz, a researcher and Cambridge Data Champion. Making datasets open permits their reuse, and if you have wondered how research data is reused and then read this comprehensive data sharing and reuse case study written by the Research Data team’s Dominic Dixon. This centres on the use and value of the Mammographic Image Society database, published in Apollo five years ago. 

This year has seen the necessary move from our usual face-to-face Research Data Management (RDM) training to provision of training online. This has led us to produce an online training session in RDM, covering topics such as data organisation, storage, back up and sharing, as well as data management plans. This forms one component of a broader Research Skills Guide – an online course for Cambridge researchers on publishing, managing data, finding and disseminating research  – developed by Dr Bea Gini, the OSC’s training coordinator. We have also contributed to a ‘Managing your study resources’ CamGuide for Master’s students, providing guidance on how to work reproducibly. In collaboration with several University stakeholders we released last month new guidance on the use of electronic research notebooks (ERNs), providing information on the features of ERNs and guidance to help researchers select one that is suitable. 

At the start of this year we invited members of the University to apply to become Data Champions, joining the pre-existing community of 72 Data Champions. The 2020 call was very successful, with us welcoming 56 new Data Champions to the programme. The community has expanded this year, not only in terms of numbers of volunteers but also in terms of disciplinary focus, where there are now Data Champions in several areas of the arts, humanities and social sciences in particular where there were none previously. During this year, we have held forums in person and then online, covering themes such as how to curate manual research records, ideas for RDM guidance materials, data management in the time of coronavirus, and data practices in the arts and humanities and how these can be best supported. We look forward to further supporting and advocating the fantastic work of the Cambridge Data Champions in the months and years to come.