Tag Archives: sustainability

The Research Data Sustainability Workshop – November 2024

The rapid advance of computing and data centres means there is an increasing amount of generated and stored research data worldwide, leading to an emerging awareness that this may have an impact on the environment. Wellcome have recently published their Environmental sustainability policy, which stipulates that any Wellcome funded research projects must be conducted in an environmentally sustainable way. Cancer Research UK have also updated their environmental sustainability in research policy and it is anticipated that more funders will begin to adopt similar policies in the near future. 

In November we held our first Research Data Sustainability Workshop in collaboration with Cambridge University Press & Assessment (CUP&A). The aim was to address some of the areas common to researchers with a focus on how research data can impact the environment. The workshop was attended by Cambridge Data Champions and other interested researchers at the University of Cambridge. This blog summarises some of the presentations and group activities that took place at the workshop to help us to better understand the impact that processing and storing data can have on the environment and identify what steps researchers could take in their day-to-day research to help minimise their impact.  

The Invisible Cost to Storing Data 

Our first speaker at the workshop was Dr Loïc Lannelongue, Research Associate at the University of Cambridge. Loïc leads on the Green Algorithms initiative which aims to promote more environmentally sustainable computational science and has developed an online calculator to check computational carbon footprint. Loïc suggested that the aim is not that we shouldn’t have data, as we all use computing, just that we should be more aware of the work we do and the impact it has on the environment so we can make informed choices. Loïc emphasised that computing is not free, even though it might look like that to the end user. There is an invisible cost to storing data, whilst the exact costs are largely unknown, the estimates calculated for data centres suggest that they emit around 126mt of CO2 e/year. Loïc furthered explained that there are many more aspects to the footprint than just greenhouse gas emissions such as water use, toxicity, land use, minerals, metals and human toxicity. For example, there is a huge amount of water consumption needed to cool data centres, and you often find that cheaper data centres tend to use larger amounts of water. 

Loïc continued to discuss how there are a wide range of carbon footprints in research with some datasets having a large footprint. The estimate for storing data is ~10kg CO2 per tb per year, although there are many varying factors that could affect this figure. Loïc pointed out that the bottom line is – don’t store useless data! He suggested we shouldn’t stop doing research, we just have to do it better. Incentivising and addressing sustainability in Data Management Plans from the outset of projects could help. Artificial Intelligence (AI) is predicted to help to combat the impact on the environment in the future, although as AI comes at a large environmental cost, whether any benefit will outweigh the impact is still unknown. Loic has written a paper on the Ten simple rules to make your computing more sustainable, and he recommends looking at the Green DiSC Certification which is a free, open-access roadmap for anyone working in research (dry lab) to learn how to be more sustainable.

The Shift to Digital Publishing 

Next to present was Andri Johnston, Digital Sustainability Lead at CUP&A. Andri discussed how her role was newly created to address the carbon footprint within the digital publishing environment at CUP&A. In publishing, there has been a shift from print to digital, but after publishing digitally, what can be done to make it more sustainable? CUP&A are committed to being carbon zero by 2048, aiming for a 72% reduction by 2030. As 43% of all their digital emissions for the wider technology sector come from digital products such as software, CUP&A have been looking at how they can create their digital products more sustainably. They have been investigating methods to calculate digital emissions by looking at their hardware and cloud hosting, which is mostly Amazon Web Services (AWS) but they use some Cambridge data centres. Andri explained how it has been hard to find information on AWS data centres emissions and knowing whether your users use a fixed line or cellular internet network connection (some cellular network towers use backup diesel generators which have a higher environmental impact) is hard to pinpoint. AWS doesn’t supply accurate information on the emissions of using their services and Andri is fully aware that they are using data to get data!

Andri introduced the DIMPACT project (digital impact), where they are using the DIMPACT tool to report and better understand the carbon emissions of platforms serving digital media products. Carbon emissions of the academic publishing websites at CUP&A have reduced in the last year as the team looked at where they can make improvements. At CUP&A, they want to publish more and allow more to access their content globally, but this needs to be done in a sustainable way to not increase the websites’ carbon emissions. The page weight of web pages is also something to consider; heavy web pages due to media such as videos can be difficult to download for people in areas with low bandwidth so this needs to be taken into account when designing them. The Sustainable web design guide for 2023 has been produced with Wholegrain Digital, and can be downloaded for free. Andri mentioned that in the future they need to be aware of the impact of AI as it is becoming a significant part of publishing and academia and will increase energy consumption. 

Andri concluded by summarising that in academic publishing, they will always be adding more content such as videos and articles for download. It is likely that researchers may need to report on the carbon impact of research in the future, but the question on how best to do this is still to be decided. The impact of downloaded papers is also a question that the industry is struggling with, for example how many of these papers are read and stored. 

Digital Preservation: Promising for the Environment and Scholarship  

Alicia Wise who is Executive Director at CLOCKSS gave us an overview of the infrastructure in place to preserve scholarship for the long-term. This is vital to be able to reliably refer to research from the past. Alicia explained that there is an awareness to consider sustainability during preservation. When print publishing was the main research output, preservation was largely taken care of by librarians, in a digital world this is now undertaken by digital archives such as CLOCKSS. The plan is to prepare to archive research for future generations 200-500 years from now!

CLOCKSS was founded in the 1990’s to solve the problem of digital preservation. There is a now a growing collection of digitally archived books, articles, data, software and code. CLOCKSS consists of 12 mirror repository sites located across the world, all of which contain the same copies. The 12 sites are in constant communication, using network self-healing to restore the original if a problem is detected. CLOCKSS currently store 57.5 million journal articles and 530,500 books.  

CLOCKSS are a dark archive, this means they don’t provide access unless it is needed, such as when a publisher goes out of business, or a repository goes down. If this happens, the lost material is made open access. CLOCKSS have been working with the DIMPACT project to map and calculate their carbon footprint. They have looked at the servers at all their 12 repository sites to estimate the environmental impact. It became clear that not all their sites are equal. The best was their site at Stanford University, where the majority of the CLOCKSS machines are located. Stanford has a high renewable energy profile, largely due to their climate and even have their own a solar power plant! They also have a renewable, recirculating, chilled underground water system for cooling the servers. The site at Indiana University was their worst performing as this is supplied by 70% coal. The estimated carbon emissions at the Indiana University site is estimated to be 9 tonnes of carbon per month (equivalent to a fleet of 20 petrol cars). 

Alicia explained that most of the carbon emissions come from the integrity checking (self-healing network). CLOCKSS mission is to reduce the emissions, and they are looking into whether reducing the number of repository sites to 6 copies could still predict preservation will be available in 500 years’ time. They are reviewing what they need to keep and informing publishers of their contribution so they can consider this impact.  

Alicia summarised by saying that it appears that digital preservation may have a lower carbon footprint than print preservation. CLOCKSS are working with the Digital Preservation Coalition to help other digital archives reduce their footprint too (with DIMPACT), they are finalising a general tool for calculation of emissions that can be used by other archives. They don’t want to discourage long-term preservation, as currently, 25% of academic journals are not preserved anywhere. This risks access to scholarship in the future. They want to encourage preservation, but in an environmentally friendly way. 

Preserving for the future at the University of Cambridge 

There are many factors that could impact data remaining accessible now and over time. Digital Preservation maintains the integrity of digital files and ensures ongoing access to content for as long as necessary. Caylin Smith, Head of Digital Preservation at Cambridge University Libraries, gave an overview of the CUL Digital Preservation Programme that is changing how the Libraries manages its digital collection materials to ensure they can be accessed for teaching, learning, and research. These include the University’s research outputs in a wide range of content types and formats; born-digital special collections, including archives; and digitised versions of print and physical collection items.  

Preserving and providing access to data, as well as using cloud services and storing multiple copies of files and metadata, all impact the environment.  Monitoring usage of cloud services and appraising the content are two ways of working towards more responsible Digital Preservation. Within the Programme, the team is delivering a Workbench, which is a web user interface for curatorial staff to undertake collection management activities, including appraising files and metadata deposited to archives.  This work will help confirm that any deposited files, whether these are removed from a storage carrier or securely uploaded, must be preserved long term. Curatorial staff will also be alerted to any potential duplicate files, export metadata for creating archival records, and create an audit trail of appraisal activities before the files are moved to the preservation workflow and storage.  

Within the University Library, where the Digital Preservation team is based, there may be additional carbon emissions from computers kept on overnight to run workflows and e-waste (some of the devices that become obsolete may still have a use for reading data from older carriers e.g. floppy disk drives). Caylin explained that CUL pays for the cloud services and storage used by the Digital Preservation infrastructure, which means you can scale up and scale down as needed. They are considering whether there is a need for an offline backup and weighing up if the benefit to having such a backup would outweigh costs and energy consumption.  

Caylin discussed what they and other researchers could do to reduce the impact on the environment: use tools available to estimate personal carbon footprint and associated costs of research; minimise access to data where necessary to minimise use of computing. Ideally data centres and cloud computing suppliers should have green credentials so researchers can make informed choices. There is a choice to make between using second hand equipment and repair equipment where possible. At Cambridge we have the Research Cold Store which is low energy as it uses tapes and robots to store dark data, but the question remains as to whether this is really more energy efficient in the long term.   

What could help reduce the impact of research data on the environment? 

The afternoon session at the workshop involved group work to discuss two extreme hypothetical mandated scenarios for research data preservation. It allowed the pros and cons of each scenario to be addressed, how this could impact sustainability and problems that could arise. We will use the information gathered in this group session to consider what is possible right now to help researchers at the University of Cambridge make informed choices for research data sustainability. Some of the suggestions that could reduce research data storage (and carbon footprint) include improving documentation and metadata of files, regularly appraising files as part of weekly tasks and making data open to prevent duplication of research. It could also be helpful to address environmental sustainability at the start of projects such as in a Data Management Plan.  

We have learned in this workshop, that research data can have an environmental impact and as computing capabilities expand, this impact is likely to increase in the future. There are now tools available to help estimate research carbon footprints. We also need stakeholders (e.g. publishers, funders) to work together to advocate that relevant companies provide transparent information so researchers can make informed choices on managing their research data more sustainably.  


The art of software maintenance

When it comes to software management there are probably more questions than answers to problems – that was the conclusion of a recent workshop hosted by the Office of Scholarly Communication (OSC) as part of a national series on software sustainability, sharing and management, funded by Jisc. The presentations and notes from the day are available, as is a Storify from the tweets.

The goal of these workshops was to flesh out the current problems in software management and sharing and try to identify possible solutions. The researcher-led nature of this event provided researchers, software engineers and support staff with a great opportunity to discuss the issues around creating and maintaining software collaboratively and to exchange good practice among peers.

Whilst this might seem like a niche issue, an increasing number of researchers are reliant on software to complete their research, and for them the paper at the end is merely an advert for the research it describes. Stephen Eglen described this in his talk as an ‘inverse problem’ – papers are published and widely shared but it is very hard to get to the raw data and code from this end product, and the data and code are what is required to ensure reproducibility.

These workshops were inspired by our previous event in 2015, where Neil Chue Hong and Shoaib Sufi spoke with researchers at Cambridge about software licensing and Open Access. Since then the OSC has had several conversations with Daniela Duca at Jisc and together we came up with an idea of organising researcher-led workshops across several institutions in the UK.

Opening up software in a ‘post-expert world’

We began the day with a keynote from Neil Chue-Hong from the Software Sustainability Institute who outlined the difficulties and opportunities of being an open researcher in a ‘post-expert world’ (the slides are available here). Reputation is crucial to a researcher’s role and therefore researchers seek to establish themselves as experts. On the other hand, this expert reputation might be tricky to maintain since making mistakes is an inevitable part of research and discovery, which is poorly understood outside of academia. Neil introduced Croucher’s Law to help us understand this: everyone will make mistakes, even an expert, but an expert will be aware of this so will automate and share their work as much as possible.

Accepting that mistakes are inevitable in many ways makes sharing less intimidating. Papers are retracted regularly due to errors and Neil gave examples from a variety of disciplines and career stages where people were open about their errors so their communities were accepting of the mistakes. In fact, once you accept that we will all make mistakes then sharing becomes a good way to get feedback on your code and to help you fix bugs and errors.

This feeds into another major theme of the workshop which Neil introduced; that researchers need to stop aiming for perfect and adopt ‘good enough’ software practices for achievable reproducibility. This recognises that one of the biggest barriers to sharing is the time it takes to learn software skills and prepare data to the ‘best’ standards. Good enough practices mean accepting that your work may not be reproducible forever but that it is more important to share your code now so that it is at least partially reproducible now. Stephen Eglen built on this with his paper on ‘Towards standard practices for sharing computer code and programs in neuroscience’ which includes providing data, code, tests for your code and using licences and DOIs.

Both speakers and the focus groups in the afternoon highlighted that political work is needed, as well as cultural change, to normalise code sharing. Many journals now ask for evidence of the data which supports articles and the same standards should apply to software code. Similarly, if researchers ask for access to data when reviewing articles then it makes sense to ask for the code as well.

Automating your research: Managing software

Whilst sharing code can be seen as the end of the lifecycle of research software, writing code with the intention of sharing it was repeatedly highlighted as a good way to make sure it is well-written and documented. This was one of several ‘selfish’ reasons to share, where sharing also helps the management of software, through better collaboration, the ability to track your work and being able to use students’ work after they leave.

Croucher’s Law demonstrates one of the main benefits of automating research through software; the ability to track the mistakes to improve reproducibility and make fixing mistakes easier. There were lots of tools mentioned throughout the day to assist with managing software from the well-known version control and collaboration platform Github to the more dynamic such as Jupyter notebooks and Docker. As well as these technical tools there was also discussion of more straightforward methods to maintain software such as getting a code buddy who can test your code and creating appropriate documentation.

Despite all of these tools and methods to improve software management it was recognised by many participants that automating research through software is not a panacea; the difficulties of working with a mix of technical and non-technical people formed the basis of one of the focus groups.

Sustaining software

Managing software appropriately allows it to be shared but re-using it in the long- (or even medium) term means putting time into sustaining code and make sure it is written in a way that is understandable to others. The main recommendations from our speakers and focus groups to ensure sustainability were to use standards, create thorough documentation and embed extensive comments within your code.

As well as thinking about the technical aspects of sustaining software there was also discussion of what is required to motivate people to make their code re-usable. Contributing to a community seemed to be a big driver for many participants so finding appropriate collaborators is important. However larger incentives are needed and creating and maintaining software is not currently well-rewarded as an academic endeavour. Suggestions to rectify this included more software-oriented funding streams, counting software as an output when assessing academics, and creating a community of software champions to mirror the Data Champions scheme we recently started in Cambridge.

Next steps

This workshop was part of a national discussion around research software so we will be looking at outcomes of other workshops and wider actions the Office of Scholarly Communication can support to facilitate sharing and sustaining research software. Apart from Cambridge, five other institutions held similar workshops (Bristol, Birmingham, Leicester, Sheffield, and the British Library). As one of the next steps, all organisers of these events want to meet up to discuss the key issues raised by researchers to see what national steps should be taken to better support the community of researchers and software engineers and also to consider if there any remaining problems with software which could require a policy intervention.

However, following the maxim to ‘think global, act local’, Neil’s closing remarks urged everyone to consider the impact they can have by influencing those directly around them to make a huge difference to how software is managed, sustained and shared across the research community.

Published 29 January 2017
Written by Rosie Higman
Creative Commons License