Category Archives: Uncategorized

Data Diversity Podcast (#4) – Dr Stefania Merlo (1/2) 

Welcome back to the fourth instalment of Data Diversity, the podcast where we speak to Cambridge University Data Champions about their relationship with research data and highlight their unique data experiences and idiosyncrasies in their journeys as a researcher. In this edition, we speak to Data Champion Dr Stefania Merlo from the McDonald Institute of Archaeological Research, the Remote Sensing Digital Data Coordinator and project manager of the Mapping Africa’s Endangered Archaeological Sites and Monuments (MAEASaM) project and coordinator of the Metsemegologolo project. This is the first of a two-part series and in this first post, Stefania shares with us her experiences of working with research data and outputs that are part of heritage collections, and how her thoughts about research data and the role of the academic researcher have changed throughout her projects. She also shares her thoughts about what funders can do to ensure that research participants, and the data that they provide to researchers, can speak for themselves.   

This is the first of a two-part series and in this first post, Stefania shares with us her experiences of working with research data and outputs that are part of heritage collections, and how her thoughts about research data and the role of the academic researcher have changed throughout her projects. She also shares her thoughts about what funders can do to ensure that research participants, and the data that they provide to researchers, can speak for themselves.   


I’ve been thinking for a while about the etymology of the word data. Datum in Latin means ‘given’. Whereas when we are collecting data, we always say we’re “taking measurements”. Upon reflection, it has made me come to a realisation that we should approach data more as something that is given to us and we hold responsibility for, and something that is not ours, both in terms of ownership, but also because data can speak for itself and tell a story without our intervention – Dr Stefania Merlo


Data stories (whose story is it, anyway?) 

LO: How do you use data to tell the story that you want to tell? To put it another way, as an archaeologist, what is the story you want to tell and how do you use data to tell that story?

SM: I am currently working on two quite different projects. One is Mapping Africa’s Endangered Archaeological Sites and Monuments (funded by Arcadia) which is funded to create an Open Access database of information on endangered archaeological sites and monuments in Africa. In the project, we define “endangered” very broadly because ultimately, all sites are endangered. We’re doing this with a number of collaborators and the objective is to create a database that is mainly going to be used by national authorities for heritage management. There’s a little bit less storytelling there, but it has more to do with intellectual property: who are the custodians of the sites and the custodians of the data? A lot of questions are asked about Open Access, which is something that the funders of the projects have requested, but something that our stakeholders have got a lot of issues with. The issues surround where the digital data will be stored because currently, it is stored in Cambridge temporarily. Ideally all our stakeholders would like to see it stored in a server in the African continent at the least, if not actually in their own country. There are a lot of questions around this. 

The other project stems out of the work I’ve been doing in Southern Africa for almost the past 20 years, and is about asking how do you articulate knowledge of the African past that is not represented in history textbooks? This is a history that is rarely taught at university and is rarely discussed. How do you avail knowledge to publics that are not academic publics? That’s where the idea of creating a multimedia archive and a platform where digital representations of archaeological, archival, historical, and ethnographic data could be used to put together stories that are not the mainstream stories. It is a work in progress. The datasets that we deal with are very diverse because it is required to tell a history in a place and in periods for which we don’t have written sources.  

It’s so mesmerizing and so different from what we do in contexts where history is written. It gives us the opportunity to put together so many diverse types of sources. From oral histories to missionary accounts with all the issues around colonial reports and representations of others as they were perceived at the time, putting together information on the past environment combining archaeological data. We have a collective of colleagues that work in universities and museums. Each performs different bits and pieces of research, and we are trying to see how we would put together these types of data sets. How much do we curate them to avail them to other audiences? We’ve used the concept of data curation very heavily, and we use it purposefully because there is an impression of the objectivity of data, and we know, especially as social scientists, that this just doesn’t exist. 

I’ve been thinking for a while about the etymology of the word data. Datum in Latin means ‘given’. Whereas when we are collecting data, we always say we’re taking measurements. Upon reflection, it has made me come to a realisation that we should approach data more as something that is given to us and we hold responsibility for, and something that is not ours, both in terms of ownership, but also because data can speak for itself and tell a story without our intervention. That’s the kind of thinking surrounding data that we’ve been going through with the project. If data are given, our work is an act of restitution, and we should also acknowledge that we are curating it. We are picking and choosing what we’re putting together and in which format and framework. We are intervening a lot in the way these different records are represented so that they can be used by others to tell stories that are perhaps of more relevance to us. 

So there’s a lot of work in this project that we’re doing about representation. We are explaining – not justifying but explaining – the choices that we have made in putting together information that we think could be useful to re-create histories and tell stories. The project will benefit us because we are telling our own stories using digital storytelling, and in particular story mapping, but it could become useful for others as resources that can be used to tell their own stories. It’s still a work in progress because we also work in low resourced environments. The way in which people can access digital repositories and then use online resources is very different in Botswana and in South Africa, which are the two countries where I mainly work with in this project. We also dedicate time into thinking how useful the digital platform will be for the audiences that we would like to get an engagement from. 

The intended output is an archive that can be used in a digital storytelling platform. We have tried to narrow down our target audience to secondary school and early university students of history (and archaeology). We hope that the platform will eventually be used more widely, but we realised that we had to identify an audience to be able to prepare the materials. We have also realised that we need to give guidance on how to use such a platform so in the past year, we have worked with museums and learnt from museum education departments about using the museum as a space for teaching and learning, where some of these materials could become useful. Teachers and museum practitioners don’t have a lot of time to create their own teaching and learning materials, so we’re trying to create a way of engaging with practitioners and teachers in a way that doesn’t overburden them. For these reasons, there is more intervention that needs to come from our side into pre-packaging some of these curations, but we’re trying to do it in collaboration with them so that it’s not something that is solely produced by us academics. We want this to be something that is negotiated. As archaeologists and historians, we have an expertise on a particular part of African history that the communities that live in that space may not know about and cannot know because they were never told. They may have learned about the history of these spaces from their families and their communities, but they have learned only certain parts of the history of that land, whereas we can go much deeper into the past. So, the question becomes, how do you fill the gaps of knowledge, without imposing your own worldview? It needs to be negotiated but it’s a very difficult process to establish. There is a lot of trial and error, and we still don’t have an answer. 

Negotiating communities and funders 

LO: Have you ever had to navigate funders’ policies and stakeholder demands?  

SM: These kinds of projects need to be long and they need continuous funding, but they have outputs that are not always necessarily valued by funding bodies. This brings to the fore what funding bodies are interested in – is it solely data production, as it is called, and then the writing up of certain academic content? Or can we start to acknowledge that there are other ways of creating and sharing knowledge? As we know, there has been a drive, especially with UK funding bodies, to acknowledge that there are different ways in which information and knowledge is produced and shared. There are alternative ways of knowledge production from artistic ones to creative ones and everything in between, but it’s still so difficult to account for the types of knowledge production that these projects may have. When I’m reporting on projects, I still find it cumbersome and difficult to represent these types of knowledge production. There’s so much more that you need to do to justify the output of alternative knowledge compared to traditional outputs. I think there needs to be change to make it easier for researchers that produce alternative forms of knowledge to justify it rather than more difficult than the mainstream. 

One thing I would say is there’s a lot that we’ve learned with the (Mapping Africa’s Endangered Archaeological Sites and Monuments) project because there we engage directly with the custodians of the site and of the analog data. When they realise that the funders of the project expect to have this data openly accessible, then the questions come and the pushback comes, and it’s a pushback on a variety of different levels. The consequence is that basically we still haven’t been able to finalise our agreements with the custodians of the data. They trust us, so they have informed us that in the interim we can have the data as a project, but we haven’t been able to come to an agreement on what is going to happen to the data at the end of the project. In fact, the agreement at the moment is the data are not going to be going on a completely Open Access sphere. The negotiation now is about what they would be willing to make public, and what advantages they would have as a custodian of the data to make part, or all, of these data public.

This has created a disjuncture between what the funders thought they were doing. I’m sure they thought they were doing good by mandating that the data needs to be Open Access, but perhaps they didn’t consider that in other parts of the world, Open Access may not be desirable, or wanted, or acceptable, for a variety of very valid reasons. It’s a node that we still haven’t resolved and it makes me wonder: when funders are asking for Open Access, have they really thought about work outside of UK contexts with communities outside of the UK context? Have they considered these communities’ rights to data and their right to say, “we don’t want our data to be shared”? There’s a lot of work that has happened in North America in particular, because indigenous communities are the ones that put forward the concept of C.A.R.E., but in UK we are still very much discussing F.A.I.R. and not C.A.R.E.. I think the funders may have started thinking about it, but we’re not quite there. There is still this impression that Open Data and Open Access is a universal good without having considered that this may not be the case. It puts researchers that don’t work in UK or the Global North in an awkward position. This is definitely something that we are still grappling with very heavily. My hope is that this work is going to help highlight that when it comes to Open Access, there are no universals. We should revisit these policies in light of the fact that we are interacting with communities globally, not only those in some countries of the world. Who is Open Access for? Who does it benefit? Who wants it and who doesn’t want it, and for what reasons? These are questions that we need to keep asking ourselves. 

LO: Have you been in a position where you had to push back on funders or Open Access requirements before? 

Not necessarily a pushback, but our funders have funded a number of similar projects in South Asia, in Mongolia, in Nepal and the MENA region and we have come together as a collective to discuss issues around the ethics and the sustainability of the projects. We have engaged with representatives of our funders trying to explain that what they wanted initially, which is full Open Access, may not be practicable. In fact, there has already been a change in the terminology that is used by the funders. From Open Access, they changed the concept to Public Access, and they have come back to us to say that they can change their contractual terms to be more nuanced and acknowledge the fact that we are in negotiation with national stakeholders and other stakeholders about what should happen to the data. Some of this has been articulated in various meetings, but some of it was trial and error on our side. In other words, with our new proposal for renewal of funding, which was approved, we just included these nuances in the proposal and in our commitment and they were accepted. So in the course of the past four years, through lobbying of the funded projects, we have been able to bring nuance to the way in which the funders themselves think about Open Access. 


Stay tuned for part two of this conversation where Stefania will share some of the challenges of managing research data that are located in different countries!


Enriching the institutional scholarly record: Octopus outputs in repositories via Publications Router

Written by Dr Alexandra Freeman, Tim Fellows, Dr Agustina Martínez-García

Researchers at universities across the country are constantly being reminded that when they publish work in journals (or elsewhere) they have to deposit the accepted version of their article in their university’s repository as well. It ensures that this version of the article can be made freely available to the world (so-called ‘Green Open Access’) but also allows the university to keep track of their researchers’ outputs, including reporting to the all-important REF (Research Excellence Framework) exercise that takes place every few years and helps determine future government funding to each institution. It’s important that researchers deposit their works, but time-consuming for everyone involved.

In order to help automate this process, a service called Publications Router takes information from journals and other traditional academic content providers and automatically passes it along to current research information systems (CRISs), repositories, and other relevant institutional systems based upon the affiliations of each article’s authors. But Publications Router did not serve the needs of more innovative alternative publishing platforms, meaning that universities were not automatically being notified of works published on these platforms. Given that these tend to be Open Access publications, their loss to the REF assessment exercise (which counts the proportion of Open Access publications from each institution), is vital.

Thanks to the support of the Arcadia fund, the open research platform Octopus.ac is now the first alternative publishing platform integrated with Publications Router.

Octopus is designed to sit alongside journals as the place where researchers share their work in full detail. Instead of publishing a ‘paper’, Octopus allows researchers to share their work as a series of small, interconnected publications that each represent a different stage of the research process. This allows researchers to follow best practice in research, such as publishing a method before it is carried out, and to allow others to carry out linked research (such as an alternative analysis of data, or a replication of a method).

The institution that drove forward the integration was the University of Cambridge. When a Cambridge-affiliated researcher publishes their work on Octopus, a copy of the publication – along with key metadata – is now directed to Cambridge’s institutional repository, Apollo, via Publications Router.

As the structure of Octopus’ publications is different from that of a traditional research paper, one of the key challenges has been how Octopus can integrate with existing research systems designed around the traditional formats. With that issue now solved, Octopus publications can much more easily be formally recognised as part of a researcher’s academic record.

Whilst the integration has been developed by Octopus and Cambridge, it has been built to allow other institutions who use Publications Router to take advantage of it, meaning the feature is now available to numerous institutions across the UK. The integration reuses and adapts existing infrastructure, i.e. Publications Router, and increases interoperability between innovative research publishing tools like Octopus, while at the same time, it removes barriers to depositing research outputs into research repositories. More so, it provides key benefits to researchers and institutions:

  • By depositing Octopus outputs into repositories, we add redundancy to ensure that additional copies will be accessible and preserved long-term.
  • The integration helps institutions tracking the impact of their research and keeping an accurate scholarly record.
  • The integration facilitates funder compliance and institutional reporting as information about Octopus outputs is made available to institutional research information systems in an automated manner.

Find out more about this integration at https://research.jiscinvolve.org/wp/2025/01/16/octopus-is-now-delivering-records-via-publications-router/

5000 datasets now in Apollo

Written by Clair Castle, Dr Kim Clugston, Dr Lutfi Bin Othman, Dr Agustina Martínez-García. 

 How the ‘second life’ of datasets is impacting the research world. Researchers share their stories.

“Research data is the evidence that underpins all research findings. It’s important across disciplines: arts, humanities, social sciences, and STEMM. Preserving and sharing datasets, through Apollo, advances knowledge across research, not only in Cambridge, but across the world – furthering Cambridge’s mission for society and our mission as a national research library.”

Dr Jessica Gardner, University Librarian & Director of Library Services

The research data produced and collected during research takes many different forms: numerical data, digital images, sound recordings, films, interview transcripts, survey data, artworks, texts, musical scores, maps, and fieldwork observations. Apollo collects them all.  

Apollo is the University of Cambridge repository for research datasets. Managed by the Research Data team at Cambridge University Library, Apollo stores and preserves the University’s research outputs.  

The Research Data team guides researchers through all aspects of research data management – how to plan, create, organise, curate and share research materials, whatever form they take – and assists researchers in meeting funders’ expectations and good research practice.  

In this blog post, upon reaching our 5000 datasets milestone, we share researcher stories about the impact their datasets have had, and continue to have, across research – and explain how researchers at the University can benefit from depositing their datasets on Apollo.

“Sharing data propels research forward. It recognises the importance of the original datasets in their own right, and the researchers who worked on them. Many of the research funders, supporting work at the University of Cambridge, require that research data is made openly available with as few restrictions as possible. Our researchers are fully supported to do this with Apollo and the Research Data team. I’m really excited that Apollo has reached the 5000 dataset milestone.” 

Professor Sir John Aston, Pro Vice-Chancellor for Research at the University of Cambridge 

Why should researchers share their research outputs on a repository?  

Making research data openly available is recognised as an important aspect of research integrity and in recent years has garnered support from funders, publishers and researchers. Open data supports the FAIR principles and many funders now include data sharing practices within their policies as part of the application process. Publishers and funders often require a data availability statement (DAS) to be included in publications. It is worth mentioning (including in a DAS) that there are situations where data cannot be shared, particularly if data contains personal or sensitive information or where there is no permission to share it. But a lot of data can be shared and this movement towards open data promotes greater trust, both among researchers and for engagement with the general public.   

Illustration of why it is good to share research data. The illustration is explained in the text of the blog immediately below.

In the UK, funding bodies often mandate openly sharing the data supporting their research grants. A large proportion of funding for research is from taxpayers’ money or charity donations so making data available openly for reuse provides value for money. It also allows the data behind claims to be accessed for traceability, transparency and reproducibility. Open data increases efficiency, as it prevents work being repeated that may have already been done; for this reason, it is encouraged to publish negative results too. Publishing data gives researchers credit for the work they have done, giving them more visibility in their field, and increasing the discoverability of their research which could lead to potential collaborations and increased citations. Open data also means that researchers have access to valuable datasets that could educate, enhance and further their research when applied by practitioners worldwide.  

The second life of data   

Apollo supports data from all disciplines, and this is represented by the various formats that the repository holds in its collection –  from movie files, images, audio recordings, or code, to the more common text and CSV files. The repository now also hosts methods. Researchers are encouraged to deposit these outputs onto the repository to facilitate the impact and re-use of data underlying their research, so that their research data can be cited as a form of scholarly output in their own right. In 2023, there were over 95,000 views of datasets and software and associated metadata items on Apollo, and over 37,000 files were downloaded (source: IRUS). This proves that datasets and software deposited on Apollo are easy to discover and are highly used.  

One example is a dataset deposited by Douglas Brion at the end of his PhD in the Engineering department. Brion’s dataset, titled Data set for “Generalisable 3D printing error detection and correction via multi-head neural networks” has been downloaded 2,600 times. This dataset has also been featured in 20 online news publications (including in the University’s Research blog) and has an Altmetric attention score of 151. Brion’s dataset is also one of the larger outputs on Apollo, comprising over 1.2 million labelled images and over 900,000 pre-filtered images.   

The open availability of Brion’s data that can be used to train AI (a significant trajectory for research currently) is welcomed by researchers such as AI specialist Bill Marino, a PhD candidate and Data Champion from the Department of Computer Science and Technology: “It’s really important that AI researchers are able to reproduce each other’s findings. The opaque nature of some of today’s AI models means that access to data is a key ingredient of AI reproducibility. This effort really helps get us there.”   

Brion considers that sharing his data “has significantly enhanced the impact and reach” of his research and that “it has increased the visibility and credibility of my work, as other scientists can validate and build upon my findings.” On the benefits of depositing data on a repository, he says that sharing “ensures that the data is preserved and accessible for the long term, which is crucial for reproducibility and transparency in research”. He adds, “Repositories often provide metadata and tools that make it easier for other researchers to find and use the data”, which “promotes a culture of openness and collaboration, which can accelerate scientific discovery and innovation.”  

Photo of a researcher searching Apollo, the University of Cambridge repository, on a computer.

Research data supporting “Regime transitions and energetics of sustained stratified shear flows” is a dataset from another depositor, Adrien Lefauve, from the Department of Applied Mathematics and Theoretical Physics and consists of MATLAB codes and accompanying movies files. Lefauve is, in fact, a frequent dataset depositor with 10 datasets published in Apollo. He considers that data sharing gives his data “a second life” by allowing researchers to reuse his data in pursuit of new projects but admits that “there is also a selfish reason for doing it!”. He explains that “After several months or years without having worked on a dataset, I sometimes need to go back to it, either by myself or when I hand it over to a colleague or student to test new ideas. Having a well-structured, user-friendly and thoroughly documented dataset is invaluable and will save you a lot of time and frustration when you need to resurrect your own research.”  

Lefauve’s dataset has been cited in other publications and he encourages other researchers to look at his datasets and reuse them: “When people see that datasets can be cited in their own right and attract citations, it can encourage them to make the extra effort to deposit their data”. Lefauve is an advocate for sharing data on a repository and in his view data sharing is: “not only important for research integrity and reproducibility, but it also ensures that research funds are used efficiently. My datasets are usually from laboratory experiments which can take a lot of time and resources to perform. Hence, I feel there is a duty to ensure the data can be used to the fullest by the community. It also helps build a researcher’s profile and credentials as a valuable contributor to the community, beyond simply publication output, which often only use a small fraction of a dataset.”  

Lefauve describes his field (fluid mechanics) as one that has benefited from the explosion of open data that is made available to the research community, but he is also aware that for a dataset to be reused, it requires comprehensive documentation and curation. Lefauve hopes that sharing data in a repository “will become increasingly commonplace as the next generation is taught that this is an essential part of data-intensive research.”  

How to deposit data on Apollo, and why choose Apollo 

There are thousands of data repositories to submit data to, so how to choose the right one? Funders may specify a disciplinary or institutional repository (see re3data.org for a directory of all repositories). Members of the University of Cambridge can deposit their data on the institutional repository, Apollo. Apollo has CoreTrustSeal certification, which means it has met the 16 requirements to be a sustainable and trustworthy infrastructure. Research outputs can be deposited as several types, such as dataset, code or method.  

We have a step by step guide to uploading a dataset, which is submitted through Symplectic Elements, the University’s research information management system. There is also a helpful information guide about Symplectic Elements on the Research Information Sharepoint site. The Research Data team are on hand to help researchers with any queries they might have during this process.    

The importance of good metadata  

Researchers may think that the files are the most important aspect when depositing a dataset, but we cannot emphasise enough the importance of providing good metadata (data about data) to go alongside the files. This is the area that we find researchers need some encouragement with, but we hope that the experiences of the researchers we have featured above highlight the importance that good metadata has for their data. No one knows their data better than the person who generated it, so they are in the best position to describe it. A good description of a dataset enables users with no prior knowledge about the dataset to be able to discover, understand and reuse the data correctly, avoiding misinterpretation, without having access to the paper it supports. Be aware that others may discover datasets in isolation from a paper that it supports: we recommend that researchers avoid referring to the paper or using the abstract of the paper to describe their dataset. An article abstract describes the contents of the article, not of the dataset. It can also be really useful for researchers to describe their methods and how their files are organised for example, by providing README files. These give the dataset context as to how the data was generated, collected and processed. Good metadata will also enhance a dataset’s discoverability.  

Another benefit of sharing data on Apollo is that our datasets are indexed on Google’s Dataset Search, a search engine for datasets. It is best practice to cite any datasets used in research in the bibliography/reference list of the paper, thesis etc. In fact, there is new guidance for Data Reuse on the Apollo website which describes how to use Apollo to discover research data and how to cite it. We advise that researchers start doing this now (if they don’t already) so they get into a good habit: it will encourage others to do the same and make it a lot easier for others to reuse data and for researchers to receive recognition for it. Citation data for datasets are displayed on Apollo and alongside this it is possible to track the attention that a dataset receives via an Altmetric Attention Score.   

Apollo repository key milestones  

Illustration of Apollo repository key milestones represented as a timeline. The illustration is explained in the text of the blog immediately below.

Since its inception in 2016, when it started minting DOIs (Digital Object Identifiers), Apollo has continued to hit milestones and develop into the robust, safe and resilient repository infrastructure that it is today.  

Apollo has continued to support FAIR principles by incorporating new and critical functionality to further enhance discovery, access and long-term preservation of the University research outputs it holds. For example, integration with our CRIS (Current Research Information System), Symplectic Elements, to streamline the depositing process, and integration with JISC Publications Router to automatically deposit metadata-rich records in Apollo (2016, 2019, 2021).  

2000 datasets were deposited in Apollo by 2020. DOI versioning was enabled in 2023, as well as accepting more research output types than ever before, such as methods and pre-prints. A major milestone was hit in 2023 when Apollo achieved CoreTrustSeal certification and status as a trustworthy repository.  

The latest milestone will be for research outputs published within Octopus, a novel approach to publication, to be preserved together with associated publications and underpinning research datasets in Apollo to facilitate sharing and re-use (2024-25). In future we want to develop our ability to collect and interpret data citation statistics for Apollo so we can better assess the impact of the research data generated at the University.  

How we can support researchers  

The Research Data team is here to help!   

We can be contacted by email at info@data.cam.ac.uk. Researchers can also request a consultation with us to discuss any aspect of their research data management (RDM) needs, including data management plans, data storage and backup, data organisation, data deposition and sharing, funder data policies, or to request bespoke training.   

Remember that there is also an amazing network of Data Champions that can be called upon for advice, particularly from a disciplinary perspective.  

We deliver regular RDM training as part of the Research Skills Programme.   

Finally, there is our Research Data website for comprehensive advice and information.