Tag Archives: research data management

Data Diversity Podcast #5 – Abdulwahab Alshallal

Welcome back to another edition of the Data Diversity Podcast, the Research Data podcast from the University of Cambridge Office of Scholarly Communication (OSC). If this is your first time here, in this podcast, I speak to Cambridge Data Champions about their journeys in acquiring and working with data in their research, with the hope to highlight interesting facets of data work, but also academic research in general. In this episode, I spoke to Cambridge PhD student Abdulwahab Alshallal, from the MRC Epidemiology Unit, and who is part of the Physical Activity Epidemiology research group.  

Currently for his PhD, he is exploring associations of physical activity, behaviour and fitness with cardio metabolic risk in different global populations. Abdulwahab recently presented at a Data Champion Forum, where he talked about working with datasets from international sources, specifically from non-Western nations, and discussed the barriers to collaboration and differences in the flexibility of institutions regarding data access and sharing. In this episode, we discussed those matters and also went into his aspirations for public health policy making and how his data driven mindset applies to this endeavour. 


I am of the mind that your social and physical environments are a big determinant of your physical activity and your general lifestyle behaviours. For example, it is unfair to to compare the UK and India because it is much easier to cycle in streets and walk around in the UK than it is in India, or Mexico or even Kuwait, and the barriers can be different. It could be pedestrian access, it could be heat, in my case it would be humidity. All of these factors matter, and we need to get data to represent those populations and use that data in such studies. – Abdulwahab Alshallal


The overrepresentation of data from Western studies in global understandings of fitness 

LO: Is it true to say that most of the data that is available now is all based on Western data sources and is it problematic then to use that to represent a global understanding of fitness?

AA: I would rephrase that. It is not that the data does not exist, rather, it is that its representation in the literature is absent. The data exists but when it comes to the data making into the literature and influencing policy guidelines, this is not yet prevalent. Take for example physical activity guidelines: every few years, data from a lot of the literature of what is published is gathered and used to make new recommendations for physical activity. It is through these guidelines that it was recommended that people exercise, for example, 30 to 60 minutes of physical activity per day. Now, the guidelines say that it is 150 minutes of physical activity per week, no matter which day you do it. But the data that influences these policies are mostly data from North America, Europe, Australia (because these are the data used in the literature cited for the creation of these guidelines). This implies that we do not think that it matters much to look at data from other places, because humans are humans. But I am of the mind that your social and physical environments are a big determinant of your physical activity and your general lifestyle behaviours. For example, it is unfair to to compare the UK and India because it is much easier to cycle in streets and walk around in the UK than it is in India, or Mexico or even Kuwait and the barriers can be different. It could be pedestrian access, it could be heat, in my case, it would be humidity. All of these factors matter, and we need to get data to represent those populations and use that data in such studies. 

The data does exist and thankfully I have made an effort to do include it in my research. One of the places where you can acquire this data is from the World Health Organisation (WHO). That is the most wide-ranging data source, and then the few others that I’m using are from the South Asia Biobank, which covers four countries in South Asia: India, Sri Lanka, Pakistan and Bangladesh. Another source is the biobank from the UAE Healthy Future Study which would cover the Gulf populations, and the Qatar biobank.  

Data in his research 

LO: what are the research questions that you’re asking and what and and how is data used, or what data is needed to answer those questions? 

AA: I am interested in physical activity and asking are the associations of physical activity in the different ethnic populations different or the same? Does it matter where you live in the world? And we have made progress in this discovery. You would be the first to hear this actually but we have finished up our analysis for my first paper, and this is using the WHO data. We are close to submitting the manuscript. This is a bit of a segue but it is worth mentioning because it highlights one of the problems of the literature, but this paper touches on one of the controversies in my field. What the paper addresses is that all physical activity is good for you. For some context: there has been a recent phenomenon that we found in the current literature that uses mostly European data, that views occupational and non-occupational physical activity separately. They show that non-occupational physical activity is good for you, but occupational physical activity either has no effect or is actually bad for you in terms of mortality outcomes. What is alarming for us to instigate is to frame a paper that states that in low- and low-income countries outside of Europe, there is very little concept of non-occupational leisure time physical activity. Most of your activity is going to be in travel behavior or activity during your occupation, for example if you are doing heavy manual labor like construction and farming. So, we had to investigate that and I’m glad to report that, at least in terms of our findings, we found that occupational physical activity is not bad for you. Non-occupational physical activity is also good for you and it doesn’t matter what type of activity you do. We also were able to control the proportion accumulated in either occupational and non-occupational physical activity and based on what we found, any physical activity wherever you do it is good for you. 

We need to understand the physical activity in different parts of the world. The types of activity you’re going in one part of the world is going to be different to other parts of the world so one guideline is not going to be appropriate. We currently have one guideline from The WHO for the whole world which has 150 minutes of moderate to vigorous physical activity as the goal. Does that seem appropriate for the whole world? It might not be in terms of different countries or even different population subgroups such as young versus old or men versus women, or different occupations or different activity levels, and what really is the barrier between light, physical activity and moderate activity? It is going to be relative and likely complex. This is a shift of mindset that hopefully I will be able to contribute through my research. 

The experience of acquiring data for his research from global data banks

LO: What has your experience of acquiring data from different sources been? From what I understand, there are different barriers in place to getting the data. 

AA: Just to put it out there, I think it’s completely understandable that these barriers are in place. The data that these organisations produce is particularly high-quality, high-resolution data. Besides the WHO data, the studies from the biobank’s that I have mentioned plan on collecting data every few years from the same participants so the data really tells you about the health of the population because these cohorts are meant to be representative of the population. To put this data in the hands of researchers that you do not properly vet can be quite a risk, even if it means using anonymised data, so I completely understand the barriers. 

In terms of the the difficulty in which to get that data, it has been different. In regard to WHO data – and this is not my experience, but an experience of a researcher before me, a post doc that that worked on the same data set before me – a few years back she had to go all the way to Geneva and to perform the data analysis there because they did not have an online infrastructure in order to allow researchers from abroad to use the data. That has since changed and the way that I was able to request it is through the WHO microdata Repository.  

For the South Asia Biobank, after going through the data request, researchers are given a link to the data. The data request process itself is very comprehensive and can cause delays. It takes a lot of time, and there is a lot of emphasis on the protocol. They want to make sure that you have a proper protocol to say what you’re authorized to do. If you want to make small changes, even small changes, you have to rectify them before submitting the proposal and that can cause delays. In my case it took around six months and we just received the data, so we have not had a chance to use it. 

For the UAE healthy future study, it is actually a bit more secure than that. You do go through that process of the back and forth of going through the protocol. In terms of getting the data, from what I understand, you are using it locally. I know this from a researcher that I spoke to who works between Cambridge and the UAE. To work with the UAE healthy future study data, she’s given a laptop by the University (NYU Abu Dhabi), and she must be connected to a VPN. While she while connected to the VPN, she’s using a secure platform called NYU-Box. I believe NYU uses this platform in all of its institutions; Shanghai, Abu Dhabi. I have been told that it is very secure and you can use it offline as well.  

Regarding the Qatar Biobank, I don’t know much about the data security measures of Qatar Biobank. Through my experiences, I only know about trying to get that data. They are willing to work with foreign institutions, which is good, but the main PI of the project must be based in Qatar and the analysis must be conducted in Qatar. However, I think going through that effort and that process is very much worth it because it has one of the most comprehensive data sources in all of the Middle East that is available in recent times. It was established around 2014 and they have now up to 47,000 participants and counting. 30,000 of them are Qatar nationals and around 17,000 are foreign nationals who are long term residents. You have people from various populations which includes participants who are Indian, Egyptian, Lebanese. So, you can get to look at migrant workers, you get to look at other Arabs that are living in a specific environment, meaning that you can parse genetics out of social and physical environments. There is so much you can do and in addition to that, what makes it special, for my PhD at least, is that they have treadmill data. This is where they put people through a treadmill around treadmill test and they look at their heart rate response to exercise instead of just going through self-reported physical activity or through wearables. The Qatar biobank is the only study in that region that actually uses heart rate data so we can definitely estimate fitness in that population. For this reason, it is very much worth the effort of trying to push for it.


One thing I am grappling with at the moment is policy development, which is a bit of a departure from data. On one end, I’m gathering the evidence in order to understand the different populations of the world through physical activity to look at the different trends in fitness. Then, once we have the physical activity data, how do we know which resources to allocate to? Who should we target so fitness can tell us that in terms of policy? Who needs it the most might not necessarily be in the volume of activity. – Abdulwahab Alshallal


On the difference between self-reported fitness data and objective data

LO: Are self-reported fitness data less valuable than objective data obtained from wearables? 

AA: It is important to understand that for a long time, it was difficult to get objective data. If you spoke to a researcher from 30 or 40 years ago telling them about a cohort study that would be using wearables, they would not believe you and they wouldn’t think it would be scalable and they think it would be too expensive, and so self-reported data was the only resource that we had. Also, there are downsides to data from wearables. For example, there is going to be noise and glitches with data obtained from accelerometry. So, I wouldn’t say that self-reported data is useless.

I am a big critic of self-reported data and the dependence of the literature on self-reported data and my supervisor has made mellow about it by reminding me that it gives you context. One of the things that we haven’t been able to overcome with accelerometry is knowing what is actually happening. We can tell that they are being active, but what are they doing? When are they doing it? For example, in the questionnaires (that are used to generate self-reported data), we don’t ask people when they leave work or when they start work or commuting, we ask them to estimate their physical strain while doing those things in those specific contexts. This removes from the researcher the burden of trying to estimate what activity is happening. 

In terms of accuracy of the numbers and their influence on policy? That is a good question, and I think accelerometry would answer those questions. Using wearables and attaining objective data, in terms of specific numbers, is much more valuable. But policies in the past are not necessarily based on numbers, and self-reports have benefited us and there is still continued benefit. It is about data points which have a degree of relativity. There are people who are going to misreport because they don’t remember accurately how much activity they were doing, or they might be lying because they feel self-conscious or they want people to think that they are more active, or there might be a recall bias or a social desirability bias which could all lead to misclassifications. We asked for moderate and vigorous activity, but what is moderate to me and light to you? It’s different and relative. While there are accuracy problems in self-reported data, for the most part it tells us something that is relative to people. Take for example someone who reports 30 minutes of activity throughout the whole week versus someone who is reporting 200 or 300 minutes of physical activity per week. We could tell that the person who was reporting the more minutes of activity is more likely to be someone who’s more physically active. It’s going to be aligned more with a better blood profile than the person reporting less activity and so in terms of a relative sense, it is helpful. But having the resources that we have now and the ability to use wearable data, we should be making a transition towards that, but self-reported data still has value. I think they can compliment each other and provide context for the type of activity that you’re doing. 

On data and policy making

AA: One thing I am grappling with at the moment is policy development, which is a bit of a departure from data. On one end, I’m gathering the evidence in order to understand the different populations of the world through physical activity to look at the different trends in fitness. Then, once we have the physical activity data, how do we know which resources to allocate to? Who should we target so fitness can tell us that in terms of policy? Who needs it the most might not necessarily be in the volume of activity. For example, we may have some barriers to fitness such as environmental factors like heat and humidity, also infrastructure factors such as pedestrian access, green spaces, and how these are different in different parts of the world. But how can we use these data to influence policy development? This is something I’m starting to understand and trying to get a grip on. Soon, I will begin a policy internship so I will hopefully learn more about that. I’ve had some conversations with people in physical activity policy, and I’ve learned that in terms of what would actually influence policy, I should be looking for a shared problem and the shared solution. Take for example, cycling lanes. Say you want to create more cycling lanes, but then the government says they don’t have enough money for cycling lanes so they decide against it. But then, you also have a congestion problem and you want to achieve net zero, and you also have an obesity problem. You know what can fix that? Cycling lanes. More cycling lanes means more people are going to be actively commuting and less cars on the road, so there will be less carbon. Then, they will be interested to get on board. So it’s about framing it and that’s what I’ve realized, because framing it in terms of health is not going to take you very far. But in terms of money, or the overall goal, matching them up is going to be helpful. And it’s quite a departure from the way that I’ve been doing things which is being driven by data and what is good for health.


We thank Abdulwahab for speaking with us. We are certainly excited to see how he gets on with policy making. It would be comforting to know that there is a data driven thinker in the world of policy making, especially one that is aware of, and takes into consideration, the contextual, environmental and behavioural differences of people in different communities and parts of the world when integrating data into public health policy decisions

Data Diversity Podcast (#4) – Dr Stefania Merlo (2/2)

We return with another post featuring our Data Diversity conversation with University of Cambridge Data Champion, archaeologist Dr Stefania Merlo from the McDonald Institute of Archaeological Research, the Remote Sensing Digital Data Coordinator and project manager of the Mapping Africa’s Endangered Archaeological Sites and Monuments (MAEASaM) project and coordinator of the Metsemegologolo project. This post is short in word count but not in importance, as it touches on two reflections on the challenges of data management as a researcher who works in a global context, two aspects of present-day academia that may be relevant to many readers. This edition follows on from the previous post where Stefania talks about the challenges of extending UK-based Open Data policies to non-UK communities that may not share the same enthusiasm for making their cultural heritage artefacts available Open Access.  

In this post, Stefania reflects on how she conducts herself as a European researcher working in the African continent where her intention may sometimes be misaligned with the local data co-creators. Stefania also shares the challenge of academic mobility, where migrating from one academic institution to another results in data that is left behind, provoking an uncomfortable thought: what would happen to your data when you are suddenly rendered uncontactable? 


One would like to think that this is a rare situation, but I suspect that the situation where somebody passes away unexpectedly or even not, or somebody retires and has not made a plan for what happens to an entire careers’ data set happens more often than we know. I think it is an individual’s responsibility to make plans, but I think support should be given by the institutions and people should be accompanied through this path. – Dr Stefania Merlo


Working in the African continent and being honest about the objectives of research 

Working in Africa and in African countries, gives somebody coming from a European background, and an Italian background like me, a particular set of challenges and opportunities, because you encounter a different set up with everything – with life, and with research. Living and working in this context in various African countries, allows a researcher coming from a different background to question and challenge themselves on how they do their work. Many things that are taken for granted in other settings cannot be taken for granted in that setting. In particular that relationship with the land, with nature, and with the past. Any archaeologist that works in this setting would tell you that there are certain things that you just know from very early on that you should do. For example, although we’re dealing with the past of archaeological landscapes, you don’t just go and do your work there without acknowledging that these landscapes come in spaces and areas occupied by people today, and that those people are the custodians of the land and of the archaeology today. So there needs to be a deep engagement with communities and with people even before you put your spade in the ground. And it takes time to build relationships of trust, and relationships that then allow you to do work on your own or together, depending on what the aim of your research is.

When I do work that fulfills certain academic goals that may not be of interest to the communities that I work with, I think it is better to be honest and tell them that I’m doing this piece of work because there is an archaeological question that probably only archaeologists are interested in, and this is the part of work that I’m doing. At the same time, I think it is also important then to acknowledge that you work in a setting that includes other people, and start thinking about what work you can do with the people that are custodians of or inhabit a particular part of the world. Then you start thinking, OK, there’s a different set of activities that I can do with people that people want to do with me and let’s do that. I think that it is important to have this honesty of saying that particular things are of interest to me and to my academic community that I would like to do, and then we can negotiate together. You have to engage with the community, and I think we should be a bit more honest and a bit more specific about what the expectations from both parties are, and from the setting, we’re coming and the setting we’re going to. 

There are certain academic activities that I’m expected to do that are of no interest whatsoever for the communities that I’m working with, such as the academic publications on which my career rests. Then, there are other things that the communities are interested in that will give me no weight whatsoever in my academic career but contribute to building a relationship with the local community. These give me so much fulfillment because I realise that I am doing research work that is useful not only for my academic community, but for other people, be it students, colleagues elsewhere in the world, or the building of policies around archaeological heritage. 

Global researcher, global data 

LO: As someone who has engaged in research all over the globe, how do you deal with data that is in various places around the world? 

SM: How do I deal with my data? – poorly. I may be a digital data champion, but it has been a difficult road, and it is still a difficult road, that of even managing and curating my own data. Just to give you an example, a lot of the data I’ve collected for the past 20 years is both in analog and digital format for the same project. I have some data with me here (in Cambridge) and I still have data backed up in hard drives that I haven’t opened in a long time. The majority of my analogue data sets, maps, drawings, diaries, I have left behind in South Africa when I moved here, and I haven’t been able to bring them with me. Some of my materials are in Italy with my family. Some of my diaries I had left back in Cambridge when I left to go to Botswana in 2006 and somehow got lost. So, it has been messy and I’m not proud of it. But I’m saying it because it is a problem with a lot of researchers that have become highly mobile and have migrated from one place to another, in some cases without sufficient funding to bring all of the paperwork with them. I have been a messy data collector, since my undergraduate and PhD days, and I’ve been trying to train myself to be better, I’m still not there yet, and in part it’s just me. But I think it has also to do with this very high mobility and having to change institutions in my career so many times. And what changed is not only the location, but the requirement of what you do with data where you put it, how you avail it to yourself and to others.

And so yes, I’m not very good at it but I’m trying very hard to find a way of now putting everything together because I do feel the responsibility that comes with collecting data in different countries. Some of it is actually information that was given to me from community members or friends, or colleagues that I work with and it’s with me. 
It’s their work, it’s with me and if anything ever happens to me – if I were to change institutions, or if anything were to happen to me, including losing my memory – let me put it like that – what’s going to happen? I’ve never really thought of what would happen if I were to move or to shift? I left my previous institution quite abruptly and during COVID, and I was able to take some materials out, but some other materials I didn’t get access to and they are still all over the place.  

And then I started thinking: I have never made a plan for this kind of situation to happen. So what am I going to do now in order to make sure that these data are usable and useful for me, but perhaps also to others when I’m not present as the curator that will be able to tell you what each data asset is. I’m not even talking about the creation of metadata. Most of my photographs, digital photographs, for example, have got metadata that have been ordered. But archaeological datasets are complex, fragmented and can be dispersed so the main challenge is how would you connect the photographs with the drawings within my diary? Of course, there are dates, but it’s going take so much time for somebody else to put all of it together, especially because half of it is in digital format and half of this is in analog format. That is going be a nightmare and may not even be doable. And so, I’ve become acutely aware of the fact that we never think of this situation. We rarely think about handing over data to others in a particular form that will allow others accessibility and ability to still reuse this complex interrelated data if they were to do so. 

Worst case (data) scenario

I have another example. One of my collaborators and mentors in South Africa passed away quite suddenly a couple of years ago. They had never made a plan for what would happen to their materials. They published prolifically, so we know a lot of the research that was done over 50 years, but I am aware that they had so much more material, both physical material and files in computers. Their physical collection was transferred from their house to the University by another colleague but, to the best of my knowledge, to date, no one has been able to get access to the digital data, stored in a password protected computer. One would like to think that this is a rare situation, but I suspect that the situation where somebody passes away unexpectedly or even not, or somebody retires and has not made a plan for what happens to an entire career’s data set happens more often than we know. I think it is an individual’s responsibility to make plans, but I think support should be given by the institutions and people should be accompanied through this path. In particular, perhaps academics from other generations that may not be so knowledgeable about how to deal with data management. In particular of digital data, but also of analog data. 

Once upon a time, archaeologists used to just put everything into a library or an archive so at least we have the analog records. But again, putting them together and having them make sense is extremely difficult if we don’t think of a framework for doing so. Another issue that I’ve mentioned before is mobility. You know, how do we assist researchers that have got high mobility to deal with this every time they move? I don’t have an exact formula, but when I changed institutions before, both the institution that I was leaving and the ones that were accepting me, I was never asked ‘do you need any financial or other kind of help to transfer your data?’ I was asked to fill in forms for transferring my goods, I was given money for my visa, but nobody ever asked about my academic research and the related data. 


We once again thank Stefania for taking the time to speak to us and giving us food for thought. Stefania raises, we believe, a very important question – are we taking for granted that we will always be at hand to ensure that the data that we produce will be understood? Researchers tend to wait until a project is completed before supplying their data with the information needed to make them understood and reusable. If there’s one thing that Stefania brings to mind, is that data FAIR-ness needs to be implemented from the onset of a project and then at every juncture of the project’s lifecycle, as the research unfolds. That way, the research data will be reusable in a self-contained manner. 

The Research Data Sustainability Workshop – November 2024

The rapid advance of computing and data centres means there is an increasing amount of generated and stored research data worldwide, leading to an emerging awareness that this may have an impact on the environment. Wellcome have recently published their Environmental sustainability policy, which stipulates that any Wellcome funded research projects must be conducted in an environmentally sustainable way. Cancer Research UK have also updated their environmental sustainability in research policy and it is anticipated that more funders will begin to adopt similar policies in the near future. 

In November we held our first Research Data Sustainability Workshop in collaboration with Cambridge University Press & Assessment (CUP&A). The aim was to address some of the areas common to researchers with a focus on how research data can impact the environment. The workshop was attended by Cambridge Data Champions and other interested researchers at the University of Cambridge. This blog summarises some of the presentations and group activities that took place at the workshop to help us to better understand the impact that processing and storing data can have on the environment and identify what steps researchers could take in their day-to-day research to help minimise their impact.  

The Invisible Cost to Storing Data 

Our first speaker at the workshop was Dr Loïc Lannelongue, Research Associate at the University of Cambridge. Loïc leads on the Green Algorithms initiative which aims to promote more environmentally sustainable computational science and has developed an online calculator to check computational carbon footprint. Loïc suggested that the aim is not that we shouldn’t have data, as we all use computing, just that we should be more aware of the work we do and the impact it has on the environment so we can make informed choices. Loïc emphasised that computing is not free, even though it might look like that to the end user. There is an invisible cost to storing data, whilst the exact costs are largely unknown, the estimates calculated for data centres suggest that they emit around 126mt of CO2 e/year. Loïc furthered explained that there are many more aspects to the footprint than just greenhouse gas emissions such as water use, toxicity, land use, minerals, metals and human toxicity. For example, there is a huge amount of water consumption needed to cool data centres, and you often find that cheaper data centres tend to use larger amounts of water. 

Loïc continued to discuss how there are a wide range of carbon footprints in research with some datasets having a large footprint. The estimate for storing data is ~10kg CO2 per tb per year, although there are many varying factors that could affect this figure. Loïc pointed out that the bottom line is – don’t store useless data! He suggested we shouldn’t stop doing research, we just have to do it better. Incentivising and addressing sustainability in Data Management Plans from the outset of projects could help. Artificial Intelligence (AI) is predicted to help to combat the impact on the environment in the future, although as AI comes at a large environmental cost, whether any benefit will outweigh the impact is still unknown. Loic has written a paper on the Ten simple rules to make your computing more sustainable, and he recommends looking at the Green DiSC Certification which is a free, open-access roadmap for anyone working in research (dry lab) to learn how to be more sustainable.

The Shift to Digital Publishing 

Next to present was Andri Johnston, Digital Sustainability Lead at CUP&A. Andri discussed how her role was newly created to address the carbon footprint within the digital publishing environment at CUP&A. In publishing, there has been a shift from print to digital, but after publishing digitally, what can be done to make it more sustainable? CUP&A are committed to being carbon zero by 2048, aiming for a 72% reduction by 2030. As 43% of all their digital emissions for the wider technology sector come from digital products such as software, CUP&A have been looking at how they can create their digital products more sustainably. They have been investigating methods to calculate digital emissions by looking at their hardware and cloud hosting, which is mostly Amazon Web Services (AWS) but they use some Cambridge data centres. Andri explained how it has been hard to find information on AWS data centres emissions and knowing whether your users use a fixed line or cellular internet network connection (some cellular network towers use backup diesel generators which have a higher environmental impact) is hard to pinpoint. AWS doesn’t supply accurate information on the emissions of using their services and Andri is fully aware that they are using data to get data!

Andri introduced the DIMPACT project (digital impact), where they are using the DIMPACT tool to report and better understand the carbon emissions of platforms serving digital media products. Carbon emissions of the academic publishing websites at CUP&A have reduced in the last year as the team looked at where they can make improvements. At CUP&A, they want to publish more and allow more to access their content globally, but this needs to be done in a sustainable way to not increase the websites’ carbon emissions. The page weight of web pages is also something to consider; heavy web pages due to media such as videos can be difficult to download for people in areas with low bandwidth so this needs to be taken into account when designing them. The Sustainable web design guide for 2023 has been produced with Wholegrain Digital, and can be downloaded for free. Andri mentioned that in the future they need to be aware of the impact of AI as it is becoming a significant part of publishing and academia and will increase energy consumption. 

Andri concluded by summarising that in academic publishing, they will always be adding more content such as videos and articles for download. It is likely that researchers may need to report on the carbon impact of research in the future, but the question on how best to do this is still to be decided. The impact of downloaded papers is also a question that the industry is struggling with, for example how many of these papers are read and stored. 

Digital Preservation: Promising for the Environment and Scholarship  

Alicia Wise who is Executive Director at CLOCKSS gave us an overview of the infrastructure in place to preserve scholarship for the long-term. This is vital to be able to reliably refer to research from the past. Alicia explained that there is an awareness to consider sustainability during preservation. When print publishing was the main research output, preservation was largely taken care of by librarians, in a digital world this is now undertaken by digital archives such as CLOCKSS. The plan is to prepare to archive research for future generations 200-500 years from now!

CLOCKSS was founded in the 1990’s to solve the problem of digital preservation. There is a now a growing collection of digitally archived books, articles, data, software and code. CLOCKSS consists of 12 mirror repository sites located across the world, all of which contain the same copies. The 12 sites are in constant communication, using network self-healing to restore the original if a problem is detected. CLOCKSS currently store 57.5 million journal articles and 530,500 books.  

CLOCKSS are a dark archive, this means they don’t provide access unless it is needed, such as when a publisher goes out of business, or a repository goes down. If this happens, the lost material is made open access. CLOCKSS have been working with the DIMPACT project to map and calculate their carbon footprint. They have looked at the servers at all their 12 repository sites to estimate the environmental impact. It became clear that not all their sites are equal. The best was their site at Stanford University, where the majority of the CLOCKSS machines are located. Stanford has a high renewable energy profile, largely due to their climate and even have their own a solar power plant! They also have a renewable, recirculating, chilled underground water system for cooling the servers. The site at Indiana University was their worst performing as this is supplied by 70% coal. The estimated carbon emissions at the Indiana University site is estimated to be 9 tonnes of carbon per month (equivalent to a fleet of 20 petrol cars). 

Alicia explained that most of the carbon emissions come from the integrity checking (self-healing network). CLOCKSS mission is to reduce the emissions, and they are looking into whether reducing the number of repository sites to 6 copies could still predict preservation will be available in 500 years’ time. They are reviewing what they need to keep and informing publishers of their contribution so they can consider this impact.  

Alicia summarised by saying that it appears that digital preservation may have a lower carbon footprint than print preservation. CLOCKSS are working with the Digital Preservation Coalition to help other digital archives reduce their footprint too (with DIMPACT), they are finalising a general tool for calculation of emissions that can be used by other archives. They don’t want to discourage long-term preservation, as currently, 25% of academic journals are not preserved anywhere. This risks access to scholarship in the future. They want to encourage preservation, but in an environmentally friendly way. 

Preserving for the future at the University of Cambridge 

There are many factors that could impact data remaining accessible now and over time. Digital Preservation maintains the integrity of digital files and ensures ongoing access to content for as long as necessary. Caylin Smith, Head of Digital Preservation at Cambridge University Libraries, gave an overview of the CUL Digital Preservation Programme that is changing how the Libraries manages its digital collection materials to ensure they can be accessed for teaching, learning, and research. These include the University’s research outputs in a wide range of content types and formats; born-digital special collections, including archives; and digitised versions of print and physical collection items.  

Preserving and providing access to data, as well as using cloud services and storing multiple copies of files and metadata, all impact the environment.  Monitoring usage of cloud services and appraising the content are two ways of working towards more responsible Digital Preservation. Within the Programme, the team is delivering a Workbench, which is a web user interface for curatorial staff to undertake collection management activities, including appraising files and metadata deposited to archives.  This work will help confirm that any deposited files, whether these are removed from a storage carrier or securely uploaded, must be preserved long term. Curatorial staff will also be alerted to any potential duplicate files, export metadata for creating archival records, and create an audit trail of appraisal activities before the files are moved to the preservation workflow and storage.  

Within the University Library, where the Digital Preservation team is based, there may be additional carbon emissions from computers kept on overnight to run workflows and e-waste (some of the devices that become obsolete may still have a use for reading data from older carriers e.g. floppy disk drives). Caylin explained that CUL pays for the cloud services and storage used by the Digital Preservation infrastructure, which means you can scale up and scale down as needed. They are considering whether there is a need for an offline backup and weighing up if the benefit to having such a backup would outweigh costs and energy consumption.  

Caylin discussed what they and other researchers could do to reduce the impact on the environment: use tools available to estimate personal carbon footprint and associated costs of research; minimise access to data where necessary to minimise use of computing. Ideally data centres and cloud computing suppliers should have green credentials so researchers can make informed choices. There is a choice to make between using second hand equipment and repair equipment where possible. At Cambridge we have the Research Cold Store which is low energy as it uses tapes and robots to store dark data, but the question remains as to whether this is really more energy efficient in the long term.   

What could help reduce the impact of research data on the environment? 

The afternoon session at the workshop involved group work to discuss two extreme hypothetical mandated scenarios for research data preservation. It allowed the pros and cons of each scenario to be addressed, how this could impact sustainability and problems that could arise. We will use the information gathered in this group session to consider what is possible right now to help researchers at the University of Cambridge make informed choices for research data sustainability. Some of the suggestions that could reduce research data storage (and carbon footprint) include improving documentation and metadata of files, regularly appraising files as part of weekly tasks and making data open to prevent duplication of research. It could also be helpful to address environmental sustainability at the start of projects such as in a Data Management Plan.  

We have learned in this workshop, that research data can have an environmental impact and as computing capabilities expand, this impact is likely to increase in the future. There are now tools available to help estimate research carbon footprints. We also need stakeholders (e.g. publishers, funders) to work together to advocate that relevant companies provide transparent information so researchers can make informed choices on managing their research data more sustainably.  


Data Diversity Podcast (#4) – Dr Stefania Merlo (1/2) 

Welcome back to the fourth instalment of Data Diversity, the podcast where we speak to Cambridge University Data Champions about their relationship with research data and highlight their unique data experiences and idiosyncrasies in their journeys as a researcher. In this edition, we speak to Data Champion Dr Stefania Merlo from the McDonald Institute of Archaeological Research, the Remote Sensing Digital Data Coordinator and project manager of the Mapping Africa’s Endangered Archaeological Sites and Monuments (MAEASaM) project and coordinator of the Metsemegologolo project. This is the first of a two-part series and in this first post, Stefania shares with us her experiences of working with research data and outputs that are part of heritage collections, and how her thoughts about research data and the role of the academic researcher have changed throughout her projects. She also shares her thoughts about what funders can do to ensure that research participants, and the data that they provide to researchers, can speak for themselves.   

This is the first of a two-part series and in this first post, Stefania shares with us her experiences of working with research data and outputs that are part of heritage collections, and how her thoughts about research data and the role of the academic researcher have changed throughout her projects. She also shares her thoughts about what funders can do to ensure that research participants, and the data that they provide to researchers, can speak for themselves.   


I’ve been thinking for a while about the etymology of the word data. Datum in Latin means ‘given’. Whereas when we are collecting data, we always say we’re “taking measurements”. Upon reflection, it has made me come to a realisation that we should approach data more as something that is given to us and we hold responsibility for, and something that is not ours, both in terms of ownership, but also because data can speak for itself and tell a story without our intervention – Dr Stefania Merlo


Data stories (whose story is it, anyway?) 

LO: How do you use data to tell the story that you want to tell? To put it another way, as an archaeologist, what is the story you want to tell and how do you use data to tell that story?

SM: I am currently working on two quite different projects. One is Mapping Africa’s Endangered Archaeological Sites and Monuments (funded by Arcadia) which is funded to create an Open Access database of information on endangered archaeological sites and monuments in Africa. In the project, we define “endangered” very broadly because ultimately, all sites are endangered. We’re doing this with a number of collaborators and the objective is to create a database that is mainly going to be used by national authorities for heritage management. There’s a little bit less storytelling there, but it has more to do with intellectual property: who are the custodians of the sites and the custodians of the data? A lot of questions are asked about Open Access, which is something that the funders of the projects have requested, but something that our stakeholders have got a lot of issues with. The issues surround where the digital data will be stored because currently, it is stored in Cambridge temporarily. Ideally all our stakeholders would like to see it stored in a server in the African continent at the least, if not actually in their own country. There are a lot of questions around this. 

The other project stems out of the work I’ve been doing in Southern Africa for almost the past 20 years, and is about asking how do you articulate knowledge of the African past that is not represented in history textbooks? This is a history that is rarely taught at university and is rarely discussed. How do you avail knowledge to publics that are not academic publics? That’s where the idea of creating a multimedia archive and a platform where digital representations of archaeological, archival, historical, and ethnographic data could be used to put together stories that are not the mainstream stories. It is a work in progress. The datasets that we deal with are very diverse because it is required to tell a history in a place and in periods for which we don’t have written sources.  

It’s so mesmerizing and so different from what we do in contexts where history is written. It gives us the opportunity to put together so many diverse types of sources. From oral histories to missionary accounts with all the issues around colonial reports and representations of others as they were perceived at the time, putting together information on the past environment combining archaeological data. We have a collective of colleagues that work in universities and museums. Each performs different bits and pieces of research, and we are trying to see how we would put together these types of data sets. How much do we curate them to avail them to other audiences? We’ve used the concept of data curation very heavily, and we use it purposefully because there is an impression of the objectivity of data, and we know, especially as social scientists, that this just doesn’t exist. 

I’ve been thinking for a while about the etymology of the word data. Datum in Latin means ‘given’. Whereas when we are collecting data, we always say we’re taking measurements. Upon reflection, it has made me come to a realisation that we should approach data more as something that is given to us and we hold responsibility for, and something that is not ours, both in terms of ownership, but also because data can speak for itself and tell a story without our intervention. That’s the kind of thinking surrounding data that we’ve been going through with the project. If data are given, our work is an act of restitution, and we should also acknowledge that we are curating it. We are picking and choosing what we’re putting together and in which format and framework. We are intervening a lot in the way these different records are represented so that they can be used by others to tell stories that are perhaps of more relevance to us. 

So there’s a lot of work in this project that we’re doing about representation. We are explaining – not justifying but explaining – the choices that we have made in putting together information that we think could be useful to re-create histories and tell stories. The project will benefit us because we are telling our own stories using digital storytelling, and in particular story mapping, but it could become useful for others as resources that can be used to tell their own stories. It’s still a work in progress because we also work in low resourced environments. The way in which people can access digital repositories and then use online resources is very different in Botswana and in South Africa, which are the two countries where I mainly work with in this project. We also dedicate time into thinking how useful the digital platform will be for the audiences that we would like to get an engagement from. 

The intended output is an archive that can be used in a digital storytelling platform. We have tried to narrow down our target audience to secondary school and early university students of history (and archaeology). We hope that the platform will eventually be used more widely, but we realised that we had to identify an audience to be able to prepare the materials. We have also realised that we need to give guidance on how to use such a platform so in the past year, we have worked with museums and learnt from museum education departments about using the museum as a space for teaching and learning, where some of these materials could become useful. Teachers and museum practitioners don’t have a lot of time to create their own teaching and learning materials, so we’re trying to create a way of engaging with practitioners and teachers in a way that doesn’t overburden them. For these reasons, there is more intervention that needs to come from our side into pre-packaging some of these curations, but we’re trying to do it in collaboration with them so that it’s not something that is solely produced by us academics. We want this to be something that is negotiated. As archaeologists and historians, we have an expertise on a particular part of African history that the communities that live in that space may not know about and cannot know because they were never told. They may have learned about the history of these spaces from their families and their communities, but they have learned only certain parts of the history of that land, whereas we can go much deeper into the past. So, the question becomes, how do you fill the gaps of knowledge, without imposing your own worldview? It needs to be negotiated but it’s a very difficult process to establish. There is a lot of trial and error, and we still don’t have an answer. 

Negotiating communities and funders 

LO: Have you ever had to navigate funders’ policies and stakeholder demands?  

SM: These kinds of projects need to be long and they need continuous funding, but they have outputs that are not always necessarily valued by funding bodies. This brings to the fore what funding bodies are interested in – is it solely data production, as it is called, and then the writing up of certain academic content? Or can we start to acknowledge that there are other ways of creating and sharing knowledge? As we know, there has been a drive, especially with UK funding bodies, to acknowledge that there are different ways in which information and knowledge is produced and shared. There are alternative ways of knowledge production from artistic ones to creative ones and everything in between, but it’s still so difficult to account for the types of knowledge production that these projects may have. When I’m reporting on projects, I still find it cumbersome and difficult to represent these types of knowledge production. There’s so much more that you need to do to justify the output of alternative knowledge compared to traditional outputs. I think there needs to be change to make it easier for researchers that produce alternative forms of knowledge to justify it rather than more difficult than the mainstream. 

One thing I would say is there’s a lot that we’ve learned with the (Mapping Africa’s Endangered Archaeological Sites and Monuments) project because there we engage directly with the custodians of the site and of the analog data. When they realise that the funders of the project expect to have this data openly accessible, then the questions come and the pushback comes, and it’s a pushback on a variety of different levels. The consequence is that basically we still haven’t been able to finalise our agreements with the custodians of the data. They trust us, so they have informed us that in the interim we can have the data as a project, but we haven’t been able to come to an agreement on what is going to happen to the data at the end of the project. In fact, the agreement at the moment is the data are not going to be going on a completely Open Access sphere. The negotiation now is about what they would be willing to make public, and what advantages they would have as a custodian of the data to make part, or all, of these data public.

This has created a disjuncture between what the funders thought they were doing. I’m sure they thought they were doing good by mandating that the data needs to be Open Access, but perhaps they didn’t consider that in other parts of the world, Open Access may not be desirable, or wanted, or acceptable, for a variety of very valid reasons. It’s a node that we still haven’t resolved and it makes me wonder: when funders are asking for Open Access, have they really thought about work outside of UK contexts with communities outside of the UK context? Have they considered these communities’ rights to data and their right to say, “we don’t want our data to be shared”? There’s a lot of work that has happened in North America in particular, because indigenous communities are the ones that put forward the concept of C.A.R.E., but in UK we are still very much discussing F.A.I.R. and not C.A.R.E.. I think the funders may have started thinking about it, but we’re not quite there. There is still this impression that Open Data and Open Access is a universal good without having considered that this may not be the case. It puts researchers that don’t work in UK or the Global North in an awkward position. This is definitely something that we are still grappling with very heavily. My hope is that this work is going to help highlight that when it comes to Open Access, there are no universals. We should revisit these policies in light of the fact that we are interacting with communities globally, not only those in some countries of the world. Who is Open Access for? Who does it benefit? Who wants it and who doesn’t want it, and for what reasons? These are questions that we need to keep asking ourselves. 

LO: Have you been in a position where you had to push back on funders or Open Access requirements before? 

Not necessarily a pushback, but our funders have funded a number of similar projects in South Asia, in Mongolia, in Nepal and the MENA region and we have come together as a collective to discuss issues around the ethics and the sustainability of the projects. We have engaged with representatives of our funders trying to explain that what they wanted initially, which is full Open Access, may not be practicable. In fact, there has already been a change in the terminology that is used by the funders. From Open Access, they changed the concept to Public Access, and they have come back to us to say that they can change their contractual terms to be more nuanced and acknowledge the fact that we are in negotiation with national stakeholders and other stakeholders about what should happen to the data. Some of this has been articulated in various meetings, but some of it was trial and error on our side. In other words, with our new proposal for renewal of funding, which was approved, we just included these nuances in the proposal and in our commitment and they were accepted. So in the course of the past four years, through lobbying of the funded projects, we have been able to bring nuance to the way in which the funders themselves think about Open Access. 


Stay tuned for part two of this conversation where Stefania will share some of the challenges of managing research data that are located in different countries!


Data Diversity Podcast #2 – Dr Alfredo Cortell-Nicolau

In our second instalment of the Data Diversity Podcast, we are joined by archaeologist Dr Alfredo Cortell-Nicolau, a Senior Teaching Associate in Quantitative and Computational Methods in Archaeology and Biological Anthropology at the McDonald Institute for Archaeological Research and Data Champion.

As is the theme of the podcast, we spoke to Alfredo about his relationship with data and learned from his experiences as a researcher. The conversation also touched on the different interpersonal, and even diplomatic, skills that an archaeologist must possess to carry out their research, and how one’s relationship with individuals such as landowners and government agents might impact their access to data. Alfredo also sheds light on some of the considerations that archaeologists must go through when storing physical data and discussed some ways that artificial intelligence is impacting the field. Below are some excerpts from the conversation, which can be listened to in full here.

I see data in a twofold way. This implies that there are different ways to liaise with the data. When you’re talking about the actual arrowhead or the actual pot, then you would need to liaise with all the different regional and national laws regarding heritage and how they want you to treat the data because it’s going to be different for every country and even for every region. Then, of course, when you’re using all these morphometric information, all the CSV files, the way to liaise with the data becomes different. You have to think of data in this twofold way.

Dr Alfredo Cortell-Nicolau

Lutfi Othman (LO): What is data to you?

Alfredo Cortell-Nicolau (ACN): In archaeology in general, there are two ways to see the data. In my case for example, one way to see it is that the data is as the arrowhead and that’s the primary data. But then when I conduct my studies, I extract lots of morphometric measures and I produce a second level of data, which are CSV files with all of these measurements and different information about the arrowheads. So, what is the data? Is it the arrowhead or is it the file with information about the arrowhead? This raises some issues in terms of who owns the data and how you are going to treat the data because it’s not the same. In my case, I always share my data and make everything reproducible. But when I share my data, I’m sharing the data that I collected from the arrowheads. I’m not sharing the arrowheads because they are not mine to share.

This is kind of a second layer of thought when you’re working with Archaeology. When you’re studying, for example, pottery residues, then you’re sharing the information of the residues and not the pot that you used to obtain those residues. There are two levels of data. Which is the actual data itself? The data which can be reanalyzed in different ways by different people, or the data that you extracted only for your specific analysis? I see data in this twofold way. This implies that there are different ways to liaise with the data. When you’re talking about the actual arrowhead or the actual pot, then you would need to liaise with all the different regional and national laws regarding heritage and how they want you to treat the data because it’s going to be different for every country and even for every region. Then, of course, when you’re using all these morphometric information, all the CSV files, the way to liaise with the data becomes different. You have to think of data in this twofold way.

On some of the barriers to sharing of archaeological data

ACN: There are some issues in how you would acknowledge that the field archaeologist is the one who got the data. Say that you might have excavated a site in the 1970s and some other researcher comes later, and they may be doing many publications after that excavation, but you are not always giving the proper attribution to the field archaeologist because you cited the first excavation in the first publication, and you’re done. Sometimes, that makes field archaeologists reluctant to share the data because they don’t feel that their work is acknowledged enough. This is one issue which we need to try to solve. Take for example a huge radiocarbon database of 5000 dates: if I use that database, I will cite whoever produced that database, but I will not be citing everyone who actually contributed indirectly to that database. How do I include all of these citations? Maybe we can discuss something like meta-citations, but there must be some way in which everyone feels they are getting something out of sharing the data. Otherwise, there might be a reaction where they think “well, I just won’t share. There’s nothing in for me to share it so why should I share my data”, which would be understandable.

On dealing with local communities, archaeological site owners and government officials

ACN: When we have had to deal with private owners, local politicians and different heritage caretakers, not everyone feels the same way. Not everyone feels the same way about everything, and you do need a lot of diplomatic skills to navigate through this because to excavate the site you need all kinds of permits. You need the permit of the owner of the site, the municipality, the regional authorities, the museum where you’re going to store the material. You need all of these to work and you need the money, of course. Different levels of discussion with indigenous communities is another layer of complexity which you have to deal with. In some cases, like in the site where we’re excavating now, the owner is the sweetest person in the world, and we are so lucky to have him. I called him two days ago because we were going to go to the site, and I was just joking with him, saying I’ll try not to break anything in your cave, and he was like, “this is not my cave. This is heritage for everyone. This is not mine. This is for everyone to know and to share”. It is so nice to find people like that. That may happen also with some kinds of indigenous communities. The levels of politics and negotiation are probably different in every case.

On how archaeologists are perceived

LO: When you approach a field or people, how do they view the archaeologists and the work?

ACN: It really depends on the owner. The one that we’re working with now, he’s super happy because he didn’t know that he had archaeology in his cave. When we told him, he was happy because he’s able to bring something to the community and he wants his local community to be aware that there is something valuable in terms of heritage. This is one good example. But we have also had other examples, for instance, where the owner of the cave was a lawyer and the first thing he thought was “are there going to be legal problems for me? If something happens in the cave, who’s the legal responsibility.” In another case there was there was another person that just didn’t care, she said “you want to come? Fine. The field is there, just do whatever you want.” So, there are different sensibilities to this. Some people are really happy about the heritage and don’t see it as a nuisance that they have to deal with. 

LO: How about yourself as a researcher, archaeologist: do you see yourself as the custodian of sorts, or someone who’s trying to contribute to this or local heritage for the place? Or is it almost scientific and you’re there to dig.

ACN: When I approach the different owners, I think the most important thing is to let them know that they have something valuable to the local community and they can be a part of that. They can be a part of being valuable to the local community. Also, you must make it clear that it’s not going to be a nuisance for them and they don’t have to do anything. I think the most important part is letting them know how it can be valuable for the community. I usually like them to be involved, and they can come and see the cave and see what we are doing. In the end it’s their land and if they see that we are producing something that is valuable to the community then it is good for them. In this case, the type of data that we produce is the primary type of data, that is, the actual different pottery sherds, the different arrowheads, etcetera. In this current excavation, we got an arrowhead that is probably some 4- or 5000 years old and you get (the land owners) to touch this arrowhead that no one in 5000 years has seen. If you can get the owners to think of it in this way, that they’re doing something valuable for your community, then they will be happier to participate in this whole thing and to just let us do whatever we want to do, which is science.

LO: How do you store physical data? Or do you let the landowner store it?

ACN: That depends on the national and regional laws and different countries have different laws about this. The cave where I’m working right now is in Spain, so I’m going to talk about the Spanish law, which is the one that I that I follow and it’s going to be different depending on every country. In our case, with the different assemblages that you find, you have a period of up to 10 years where you can store them yourself in your university and that period is for you to do your research with them. After that period, it goes to whichever museum they are supposed to be going, which depends on the law that says that it has to be the museum that is the closest to the cave or site where they were excavated. Here, the objects can then be displayed and the museum is the ones responsible for managing them, and storing them long term.

There is one additional thing: If you are excavating a site that has already been excavated, then there is a principle of keeping the objects and assemblages together. For example, there is this cave that was excavated in the 1950s and they store all the assemblages in the Museum of Prehistory of Valencia, which was the only museum in the whole region. Now, they excavated it again a few years ago and now there are museums that are closer to the cave but because the bulk of the assemblages are in Valencia and they don’t want to have it separated in two museums, they still have to go to Valencia. This is the principle of not having the assemblages separated and it is the most important one.


As always, we learn so much by engaging with our researchers about their relationship with data, and we thank Alfredo for joining us for this conversation. Please let us know how you think the podcast is going and if there are any question relation to research data that you would like us to ask!

The September 2023 Data Champion Forum

The Cambridge Data Champions had a fantastic September Forum at the West Hub. The forum started with an introduction to the West Hub by  Library Manager Daniele Campello and we welcomed Clair Castle as the new interim Research Data Manager with the Office of Scholarly Communication (University Library).

Dr Mandy Wigdorowitz kicked off the presentations by sharing with the Data Champions what she aims to achieve as the University’s Open Research Community Manager. This includes raising the profile of Open Research at the University and ensuring that scholarly and research outputs that are deemed to be open are indeed accessible and interoperable in accordance with FAIR principles.  As Open Research Community Manager, Mandy advocates for Open Research among University researchers from both the STEMM and AHSS (Art, Humanities and Social Sciences) disciplines. The latter proves to be more challenging as researchers in AHSS may often have valid reasons from refraining from making their research data open, such as working with sensitive data or working with interlocutors who object to their data being shared. Such issues will be addressed at the Cambridge Open Research Conference that she is organising, which takes place on 17th November 2023 at Downing College, Cambridge as well as online. To end, Mandy invited the Data Champions to join her Open Research initiative, a community of advocates for Open Research across the University.

Before lunch, Madeleine Taylor (Information Security Risk and Governance Manager with University Information Services, UIS) presented a follow up to a webinar session on monitoring the Information and Cybersecurity (ICS) risks for research data across the university, which she conducted with the Data Champions a couple weeks prior. After a brief introduction of what she has done so far to protect Cambridge’s research communities against ICS threats, she asked the Data Champions for help in her task of securing research data against ICS risks. They can do so by providing her with a sense of what data their own research communities are working with and how they were storing them. As the Data Champions ate the delicious lunch of sandwiches and cakes provided by the West Hub caterers, they provided feedback to Madeleine on two forms that she proposed as methods of gathering the information she needed: a 3-minute research data impact assessment form and a research data cyber security risk form. Maddy will continue to work with the Research Data Team and the Data Champions to refine, and gather information, through these forms.

Thank you to the West Hub and Daniele Campello for hosting the Data Champions Forum in your welcoming building!

If you are a member of the University of Cambridge and are interested in attending the Data Champions Forum, please join us as a Data Champion. If you are passionate about research data management and data sharing or you would like to find out more about what being a Data Champion entails, please visit the Data Champions webpage. We welcome applications from those working in all academic subjects across AHSS and STEMM disciplines. If you are unsure about how being a Data Champion would impact your research, please get in touch with the Research Data Team!

Cartoon by Clare Trowell CC-BY-NC-ND



Open Research in Cambridge: 2022 in review

2022 has been another fantastic year for Open Research in Cambridge and I’m so proud of what we have achieved together as a community of researchers, library staff, technicians, administrators, publishers and more. I’d like to highlight some of the key themes in our work this year and thank all who have contributed to this work in any way throughout the year (though I have limited myself to naming chairs of workstrands below). The following video by our Pro-Vice-Chancellor (Research), Prof Anne Ferguson-Smith, gives an indication of the importance that the university places on this work.

Understanding disciplinary differences

I know that I’m not alone in hearing that researchers in Arts, Humanities & Social Sciences disciplines often feel a disconnect between the language and priorities of “Open Science” and their experiences of how research is conducted – this is one of the reasons we choose to frame it as “Open Research” here in Cambridge. I see a strong desire from many to engage with open research practices, paired with frustration with the challenges of translating the terminology of open science to other areas. In order to better understand these issues, we established two working groups (Open Research in the Humanities and Open Qualitative Research), each of which was tasked with forgetting what they think they should do due to how open science is generally described, and instead describe what they see as the opportunities for open research within their disciplines.

The Open Research in the Humanities group was chaired by Prof Emma Gilby and supported by Dr Matthias Ammon. Their excellent report is already available on Apollo and through a series of blog posts here on Unlocking Research. The Open Qualitative Research group was chaired by Dr Meg Westbury and their report is due to come to the university’s Open Research Steering Committee in January. We will be sharing this more widely in early 2023 – it’s well worth watching out for! Both reports will inform how we talk about Open Research at Cambridge and will shape the transformative programme that we are in the process of developing.

Research data management

Our small but dedicated Research Data team, led by Dr Sacha Jones, has had another impressive year. Our Data Champions Network goes from strength to strength, and has expanded into departments that have not been represented in previous years. Other key projects have included a review of our research data services with recommendations for future development, a project on electronic research notebooks, and lots of work to support open research system developments, all while continuing to support researchers with data deposits and writing data management plans. This team is expanding next year which will enable even more work to meet the needs of different disciplines.

The future of scholarly publishing

We hosted a series of three strategic workshops on the future of scholarly communication earlier this year, developed in collaboration between Cambridge University Libraries and Cambridge University Press. Led by independent facilitator Mark Allin, participants across disciplines and career stages came together to discuss the problems of scholarly communication, potential long-term solutions to these problems and a strategy to help Cambridge get us there. The proposals emerging from the meeting are currently being developed and include newly developed infrastructures for diamond open access publishing projects and a series of high-level strategic meetings aimed at strategic improvements to equity in academic publishing. There are already diamond publishing initiatives within Cambridge, and projects will start in early 2023 to understand existing initiatives in greater detail and to provide the infrastructure to establish additional diamond journals. 

The library’s annual Open Research Conference took a similar visionary approach in its focus on the future of open access. Titled Open Access: Where Next?, the conference featured expert speakers on how we can think beyond open access toward more innovative, sustainable and equitable open futures. We heard from researchers excluded by certain approaches to open access, how other researchers are addressing issues through their own scholar-led approaches, alongside how openness fits into changing research cultures and can facilitate experimental publishing projects. A full round up with videos of each session is available on the Unlocking Research blog.  My thanks to Dr Bea Gini for her leadership in planning this conference.

Open Access now

While we are actively working towards a new future for scholarly publishing, we also need to ensure that our researchers have ways to make their work open access right now. We do this in a number of ways, engaging with the academic community and contributing expert open access advice on publishing agreements that are negotiating across the sector and administering the block grants that are provided by funders and the university to cover the costs of publishing in fully open access venues. All of this requires close reading and interpretation of funder requirements to ensure that we are able to support our researchers in what they are required to as well as what they would like to do. I’d like to specifically thank Alexia Sutton, who leads our Open Access team, and Dr Samuel Moore, our Scholarly Communication Specialist, for their leadership in this area.

We are particularly pleased with the engagement from across the university with the ongoing Rights Retention Pilot, which provides a route to open access for articles that cannot be made immediately available through existing publishing deals, are not eligible for the block grants mentioned above or where the publisher simply does not provide any route to immediate open access. We are now consulting on the development of a Self-Archiving Policy which is buit on what we have learned throught he pilot and will sit within our Open Access Publications Policy Framework. Members of the university can find out more by reading this document (accessible to Raven users only). It has been an honour to lead a dedicated group of library and research staff on this project.

Open research systems

Everything we do requires that we have the right technical infrastructure in place. The Open Research Systems team is led by Dr Agustina Martinez-Garcia and based within Cambridge University Libraries’ Digital Initiatives directorate. This year has seen projects to upgrade links between Symplectic Elements and Apollo, technical changes to support the rights retention pilot, a review of the open research systems landscape, contributing to thinking around future publishing platforms, electronic research notebooks and data infrastructure, and planning ahead for the upgrade to DSpace 7, improvements in the thesis service, and building connections between DSpace repositories and Octopus. This is not a comprehensive list and we plan to showcase more of their work on the blog in 2023.

Research enquiries, briefings and training

I want to end with huge thanks to the library staff based both in the Office of Scholarly Communication and in the Faculty & Department Libraries who do so much throughout the year, answering frontline research support queries, signposting as required, providing tailored briefings and training on highly complex and constantly changing topics. We especially value the disciplinary insights we get through working closely with the Research Support Librarians that are based within the Schools.

Join our team!

Open Research is an incredibly rewarding area to work in and the scale of what we’re trying to achieve is really ambitious. I’m delighted that the importance of what we are doing is recognised by both Cambridge University Libraries and the wider university and as a result we are expanding our team!

We are currently recruiting for an Open Research Community Manager to establish and develop a Cambridge Open Research Community, bringing researchers across the university community together through regular online and in person events to enable exchange of expertise in open and rigorous research practices. In January, we plan to advertise for two Research Data Coordinators and an Open Research Administrator, with a Research Services Manager post following later in the year. All of these roles will be listed on the university’s jobs site as well as on LinkedIn, mailing lists etc. If you’re interested in our work and would like to find out more about these opportunities please get in touch at info@osc.cam.ac.uk!


Open Research in the Humanities: CORE Data

Authors: Emma Gilby, Matthias Ammon, Rachel Leow and Sam Moore

This is the third of a series of blog posts, presenting the reflections of the Working Group on Open Research in the Humanities. Read the opening post at this link. The working group aimed to reframe open research in a way that was more meaningful to humanities disciplines, and their work will inform the University of Cambridge approach to open research. This post reflects on the concept of FAIR data and proposes an alternative way of thinking about data in the humanities.

As a rule, data in the arts and humanities is collected, organised, recontextualised and explained. We are therefore putting forward this acronym as an alternative to LERU’s FAIR data (findable, accessible, interoperable, reusable). Our data is collected rather than generated; organised and recontextualised in order to further a cultural conversation about discoveries, methods and debates; and explained as part of the analytical process. Any view of scholarly comms as uniquely about the distribution of and access to FAIR data (‘from my bench to yours’) will seem less relevant to A&H academics. Similarly, the goal of reproducibility of data – in the sense in which this often appears in the sciences and social sciences, where it refers to the results of a study being perfectly replicable when the study is repeated – is, if anything, contrary to the aim of CORE data: i.e. the aim that this data should be built upon and thereby modified through the process of further recontextualization. Our CORE data, then, understood as information used for reference and analysis, is made up of texts, music, pictures, fabrics, objects, installations, performances, etc. Sometimes, this information does not belong to us, but is owned by another person or institution or community, in which case it is not ours to make public.

Opportunities

The A&H tend to bring information together in new ways to further discussion about socio-cultural developments across the globe. Available digital data is only the tip of the iceberg when it comes to the material that is worked with.[1] Arts and humanities scholars, who spend their lives thinking about the arrangement and communication of information, are acutely aware that archives (digital and otherwise) are not neutral spaces, but man-made and the product of human choices. This means that information available online, to a broadband-enabled public, is asymmetrical and distorted.

One of the main benefits of open research is that it is thought to make data globally accessible, especially to ‘the global south’ and to institutions with fewer available funds to ‘buy data in’. As we explore below (‘research integrity’), this unidirectional view of open access is problematic. In general, digital material tends to reproduce English-speaking structures and epistemologies. As FAIR data is redefined as CORE data, an attention to context will hopefully promote the diverse positions occupied by all those who make up the world and who produce research about it.

Support required

In order usefully to employ CORE data in the A&H, we need to bring to the surface and examine underlying assumptions about knowledge creation as well as knowledge dissemination.

The work of the digital humanities – rooted explicitly in digital technologies and the forms of communication that they enable – is obviously a vital part of these discussions about opening up the CORE data of the humanities. Digital work, in the same way as any other successful A&H research, needs to consider its own materiality and conditions of production, evaluate its own history, draw attention to its own limits, and navigate its trans-temporal relationships with data in other forms (the manuscript, the printed text, the painting, the piece of music). This is a developing field and one that still has an uneasy relationship with the existing tenure/promotions system.[2] Colleagues noted that training needs are evolving constantly. It is often hard to know where to turn for specific guidance in e.g. how to manage one’s own ‘born digital’ archives, how to deconstruct a twitter archive, and so on.

This issue also overlaps with the need, as part of the ‘rewards and incentives’ process outlined below, to evaluate the success of colleagues as they undertake this training and negotiate with these processes. DH is one of the most exciting and rapidly developing areas of research and needs to be widely resourced. But it would also be harmful to collapse all A&H research into ‘the digital humanities’. The work of colleagues whose CORE data is resistant, for whatever reason, to wide online dissemination in English also needs to be allocated the value it deserves: some publics are simply smaller than others.

Postscript: the group subsequently became aware of the CARE Principles of Indigenous Data Governance. These principles will also be considered when developing our services in support of data management and ethical sharing.


[1] Erzsébet Tóth-Czifra, ‘The Risk of Losing the Thick Description: Data Management Challenges Faced by the Arts and Humanities in the Evolving FAIR Data Ecosystem’, in Digital Technologies and the Practices of Humanities Research, edited by Jennifer Edmond (Open Book Publishers, 2014), https://doi.org/10.11647/OBP.0192.10

[2]See the excellent article by Cait Coker and Kate Ozment ‘Building the Women in Book History Bibliography, or Digital Enumerative Bibliography as Preservation of Feminist Labor’, Digital Humanities Quarterly 13 (3), 2019, http://www.digitalhumanities.org/dhq/vol/13/3/000428/000428.html – where the authors of the ‘Women in Book History’ digital bibliography still see the tenure system as ‘monograph-driven’, and had to fund their research through selling merchandise.

Cambridge Data Week 2020 day 2: Who is reusing data? Successes and future trends?

Cambridge Data Week 2020 was an event run by the Office of Scholarly Communication at Cambridge University Libraries from 23–27 November 2020. In a series of talks, panel discussions and interactive Q&A sessions, researchers, funders, publishers and other stakeholders explored and debated different approaches to research data management. This blog is part of a series summarising each event.  

The rest of the blogs comprising this series are as follows:
Cambridge Data Week day 1 blog
Cambridge Data Week day 3 blog
Cambridge Data Week day 4 blog
Cambridge Data Week day 5 blog

Introduction

Reuse of data is the final element of the FAIR principles and has long been argued as a central benefit of data sharing, allowing others access to a wealth of research and making research funding more efficient by removing the need to duplicate work. Yet we are still in the process of evaluating success in this area. This webinar brought together speakers to discuss what we know about the current state of play around data reuse, what researchers can do to increase the reuse potential of their data, and possible future developments in data reuse.

Our speakers – Louise Corti (UK Data Archive) and Tiberius Ignat (Scientific Knowledge Services) – looked at data reuse from two different perspectives. Louise focused on the reuse of UK Data Service collections, sharing some examples of their most widely used data sets, discussing what makes them popular and sharing some principles that can be used both to make data more reusable and to promote it for reuse. Tiberius discussed the prevalence of data reuse by machines and the possibility of granting machines data reuse rights.

Louise’s presentation gave an overview of the portfolio of data sets hosted by the UK Data Service, looked at their top 20 most downloaded datasets and discussed the underlying principles that have led to them being widely reused. As well as demonstrating some commonalities between these datasets, Louise also outlined the principles used by the UK Data Service to promote their collections for reuse.

Tiberius’ presentation looked at data reuse from a different perspective, serving as a call to action to share research data responsibly and protect it against the reuse of machines designed to persuade humans. One of Tiberius’ main arguments was that no research data from public projects should be made available to feed and develop persuasive algorithms.

The presentations motivated an interesting discussion covering a broad range of topics. These included the reuse of qualitative data, how we can implement ethical safeguards data reuse, the idea of data ethics as a continuum, whether we can accept positive cases of algorithmic persuasion such as to promote equality and diversity, and the possibility of creating specific licences prohibiting data reuse by persuasive algorithms. See below for a video and transcript of the session.

Audience composition

We had 341 registrations with just over 65% originating from the Higher Education sector. Researchers and PhD students accounted for nearly 37% of the registrations whilst research support staff accounted for an additional 33%. We also had registrations from at least 30 countries outside of the UK including significant attendance from Denmark, Holland, Germany and Canada. We were thrilled to see that on the actual day 187 people attended the webinar.

We held five online webinars during Cambridge Data Week and were pleased to see that nearly 25% of the participants attended more than one webinar. A total of 1364 people registered and more than 700 attended all together, with the rest possibly watching the recordings at a later date. Most of all we were pleased to welcome participants from all over the world and see how important research data management topics are globally.

Where data was available, we identified the following countries apart from the UK:  Australia, Austria, Bangladesh, Brazil, Canada, Colombia, Croatia, Czech Republic, Denmark, France, Germany, Greece, Holland, Hungary, Iran, Luxembourg, Moldova, Norway, Poland, Romania, Singapore, Spain, Sweden, Switzerland, Turkey, Ukraine and the USA.

Recording , transcript and presentations

The video recording of the webinar can be found below and the recording, transcript and presentations are present in Apollo, the University of Cambridge repository

Bonus material

After the session ended, we continued the discussion with Louise and Tiberius looking in particular at one question posed by an audience member:

AI can always be used either for good or bad. Instead of locking-in, how can we enhance technology through data and regulation? 

Tiberius Ignat I think at this point we need regulation. I’m not a big fan of using regulations, to be honest. I think it’s much better to motivate people but, in this case, it’s quite a bit of control that has been lost, so I think we should have a regulation on how research data can be reused by others. This is how the internet has been made profitable during the last decade — through non-human persuasion. All these companies that are giving so much away for free are making billions of dollars when you look at the stock market. We were not clear how they were making this profit until recently when we realised that they are doing it by changing our behaviour and I think the rest of society – including research organisations – are behind them, so we need some regulation.

A good example is with GDPR. It has been introduced to protect our data, our digital footprint. On ResearchGate or Eurosport, or any other website, we used to be asked to agree to cookies or not. Recently, a new option called “Legitimate interest” has been slipped in and our digital data is again collected – less noticeably – by invoking questionable legitimate rights. The organisations whose model is based on persuading need cookie data, so they have moved the discussion away from remaining GDPR compliant to defending their legitimate interests. They are fighting to take data away from us. We can tackle this with regulation faster but in the long term we need to educate people to be more aware. We do have licenses such as Creative Commons but I’m not sure we have the right ones to protect us.

Louise Corti There are a variety of licenses, but they are abused and it’s very hard to track along the way what has gone wrong. I quite like the UK Government’s approach with some of their statistical data that has to go through a legal gateway. Some data can be made available for research, but it has to be done for the public good. We also have the Ethics Self-Assessment Tool, which is a grid you go through provided by the Statistics Authority and it asks you to think along lots of different dimensions of ethics. This helps researchers get a better sense of what they are trying to do, but whether the people we are talking about would care about it is a very different matter. Having been in research ethics for a very long time, that is by far the best tool I’ve seen and I recommend everyone uses it. The UK Data Archive uses it to evaluate some of the projects we deal with because you find often university ethics approvals are not good enough for the Statistics Authority because often they don’t understand quantitative secondary analysis, so the ethics scrutiny is not good enough. Self-Assessment is a much more nuanced thinking about the different dimensions of ethics and it helps researchers to be a bit more reflective about what’s good and what’s not.

Conclusion

Overall, the session provided a compelling blend of both the practical and conceptual elements of data reuse, each raising questions which could have easily been entire sessions in themselves. Louise’s presentation gave an excellent overview of the UK Data Service’s approach to making their datasets more reusable and promoting them to maximise their chances of being reused. Tiberius’ session raised some interesting questions surrounding data reuse and the ethics of using algorithms to persuade humans, as well as looking at some practical options for protecting research data from reuse for nefarious ends. At the end of the session, the audience were asked to participate in a poll on “What future developments are needed to increase the prevalence of data reuse?”.

Audience responses to poll held at the end of the event

The results were unsurprising to either speaker, with each touching on the idea that a change in research culture is necessary to ensure data reuse projects are seen as equal to data-generating projects. The need for cultural change is a theme that ran throughout each of the sessions in Data Week and is perhaps one of the current major challenges in scholarly communication.

Resources

Data Access and Research Transparency (DA-RT): A Joint Statement by Political Science Journal Editors

Robots appear more persuasive when pretending to be human

Behavioural evidence for a transparency–efficiency tradeoff in human–machine cooperation

The next-generation bots interfering with the US election

IBM’s AI Machine Makes A Convincing Case That It’s Mastering The Human Art Of Persuasion

AI Learns the Art of Debate

CSI-COP

Published on 25 January 2021

Written by Dominic Dixon

CCBY icon

Cambridge Data Week 2020 day 3: Is data management just a footnote to reproducibility?

Cambridge Data Week 2020 was an event run by the Office of Scholarly Communication at Cambridge University Libraries from 23–27 November 2020. In a series of talks, panel discussions and interactive Q&A sessions, researchers, funders, publishers and other stakeholders explored and debated different approaches to research data management. This blog is part of a series summarising each event:

The rest of the blogs comprising this series are as follows:
Cambridge Data Week day 1 blog
Cambridge Data Week day 2 blog
Cambridge Data Week day 4 blog
Cambridge Data Week day 5 blog

Introduction

The third day of Cambridge Data Week consisted of a panel discussion about the relationship between reproducibility and Research Data Management (RDM), looking for ways to advocate effectively to reach positive outcomes in both areas. Alexia Cardona (University of Cambridge), Lennart Stoy (European University Association), Florian Markowetz (University of Cambridge & UK Reproducibility Network), and René Schneider (Geneva School of Business Administration) offered their perspectives on whether RDM really is just a ‘footnote’ to the more popular concept of reproducibility.

The speakers agreed that we are still in need of cultural change towards better data management and reproducibility. The word ‘reproducibility’ is more likely to excite researchers and it is important to craft messages that work for each group, hence the emphasis on this term. In contrast to the Cambridge Data Week event on data peer review, the discussion here focused on engaging senior researchers, from PIs to Heads of Institutions, motivating them to be not just good data managers, but great data leaders.

Among the key elements needed to drive best practice in this area, two stood out. The first is communities. Whether these are reproducibility circles of peers, or networks like the Cambridge Data Champions, communities are key to creating and implementing guidelines for data management. The second element is a solid technological infrastructure. For instance, block chains could be used to enable reproducibility in citations in the humanities, or Persistent Identifiers, used at a very granular level, could lead to better data reuse.

Recording , transcript and presentations

The video recording of the webinar can be found below and the recording, transcript and presentations are present in Apollo, the University of Cambridge repository.

Bonus material

There were a few questions we did not have time to address during the live session, so we put them to the speakers afterwards. Here are their answers:

What are good practices regarding data deletion?

Florian Markowetz It very much depends on what kind of data you have, it’s hard to give general directions. However, drives and other hardware are becoming cheaper and cheaper, so I would say ‘save everything’.

René Schneider I would agree. I have spoken to researchers who keep all their data, because it would create too much work to sort what to keep and what to delete.

Alexia Cardona We tend to talk more about data archiving than data deletion. I often hear about data deletion where it has created problems, for example an account has been deleted in bulk when a researcher left an institution, so unpublished data and scripts are lost due to lack of communication. There are also cases on the internet of PhD students losing all their thesis when the laptop crashed, so this issue goes hand in hand with data storage and backup. Let’s focus on good practices and archiving of data, deletion is the very last thing to worry about.

Lennart Stoy It’s worth mentioning that there is often a compulsory period that data should be kept for, perhaps 3 years or 5 years according to funders mandates, so data should be stored for some time. I suppose the expense could become an issue in the coming years, some Universities are already concerned about the cost of having to buy large amounts of cloud storage space. There are also discussions in the Open Science Could teams about what to preserve in the long term. We want to make sure we preserve the higher value datasets, but of course it’s hard to define which ones those are.

Couldn’t scholarly communities of practice or learned societies create guidelines for reproducibility and good data management?

Lennart Stoy Absolutely, they must be involved as they are the ones with the specific knowledge. This is the idea behind Research Data Alliance (RDA) and the National Research Data Infrastructure (NFDI) in Germany. In those cases, you have to prove a link to the community in that field to establish a consortium. It is great when communities structure their areas of infrastructure from the bottom up.

What roles could Early Career Researchers (ECRs) have? Could they act as code-checkers to assist reproducibility, or are we asking too much of them given their busy schedules? Would they receive credit for this?

Florian Markowetz Senior academics have no excuses for not getting more involved in this once they have stable positions. It’s easy for people in my position to point to students, or to funders, saying they are not doing enough, but we should not be pointing away from ourselves, we should do the work. It could be coupled to pay rises: if you hold any role above grade 12 it’s your job now to sort this all out.

René Schneider I have been thinking about the role of data custodians or similar. If we ask researchers to spend a lot of time just checking data, like ‘warehouse workers’, we could be undervaluing their role. I don’t think it’s necessarily the researchers who should do the work, especially not ECRs, there should be other roles dedicated to this.

Alexia Cardona I second that, researchers are supposed to focus on the research, not necessarily the data checking and curation. But the unfortunate truth is that with short contracts and lack of resources the work is left to them. Another problem is the lack of rewards. For instance in my area, training, there’s no reward for people who take the time to make their training FAIR. We should embrace more openness and fairness, including rewarding those who do the work.

Lennart Stoy This is something we’ve been working on but it’s a challenging system to change because there are so many elements to disentangle. It relates to intense competition for jobs, the culture in different disciplines, and the pressure to publish in certain journals. Some Universities are very serious about implementing DORA and I hope that in a few years these will be able to show high levels of satisfaction among PhD students and ECRs. A lot depends on the leadership at the institutional level to initiate change, for instance the rector at Ghent University in Belgium has been driving DORA-inspired reward mechanisms and the Netherlands is also moving ahead and moving away from journal-based factors. The University of Bath is an example in the UK that I’ve heard mentioned a lot. We’re following progress in all these examples and will write up DORA good practice case studies to inspire other organisations. But it is a hard problem, ECRs have a lot on the line, it’s important not to jeopardise their careers.

Conclusion

This compelling discussion left us feeling that it does not matter too much which words we emphasise: reproducibility, data management, data leadership, or something else entirely. What matters is that we spark interest and commitment in the right groups of researchers to drive progress. Creating a culture where great research practices are routine will take effective advocacy, but also rewards that align with our aims and the right technical infrastructure to underpin them.

Resources

UK data service is a data repository funded by the Economic and Social Research Council (ESRC), which also provides extensive resources on data practices.

The journal PLOS Computational Biology introduced a pilot in 2019 where all papers are checked for the reproducibility of models.

Is there a reproducibility crisis? Baker’s 2016 paper in Nature reporting the results of a survey that exposed the extent of the reproducibility crisis.

San Francisco Declaration on Research Assessment (DORA), a set of recommendations for institutions, funders, publishers, metrics companies and researchers, aiming for a fairer and more varied system of research quality assessment.

Published on 25 January 2021

Written by Beatrice Gini

CCBY icon