Tag Archives: research data management

Milestone -1000 datasets in Cambridge’s repository

Last week, Cambridge celebrated a huge milestone – the deposit of the 1000th dataset to our repository Apollo since the launch of the Research Data Facility in early 2015. This is the culmination of a huge amount of work by the team in the Office of Scholarly Communication, in terms of developing systems, workflows, policies and through an extensive advocacy campaign. The Research Data team have run 118 events over the past couple of years and published 39 blogs.

In the past 12 months alone there have been 26000 downloads of the data in Apollo. In some cases the dataset has been downloaded many times – 170 – and the data has featured in news, blogs and Twitter.

An event was held at Cambridge University Library last week to celebrate this milestone.

   

Opening remarks

The Director of Library Services, Dr Jess Gardner opened proceedings with a speech where she noted “the Research Data Services and all who sail in her are at the core of our mission in our research library”.

Dr Gardner referred to the library’s long and proud history of collecting and managing research data that “began on vellum, paper, stone and bone”. The research data of luminaries such as Isaac Newton and Charles Darwin was on paper and, she noted “we have preserved that with great care and share it openly on line through our digital library.”

Turning to the future, Dr Gardner observed: “But our responsibility now is today’s researcher and today’s scientists and people working across all disciplines across our great university. Our preservation stewardship of that research data from the digital humanities across the biomedical is a core part of what we now do.”

“In the 21st century our support and our overriding philosophy is all about supporting open research and opening data as widely as possible,” she noted.  “It is about sharing freely wherever it is appropriate to do so”. [Dr Gardner’s speech is in full at the end of this post.]

Perspectives from a researcher

The second speaker was Zoe Adams, a PhD student at Cambridge who talked about the work she has done with Professor Simon Deakin on the Labour Regulation Index in association with the Centre for Business Research.

Ms Adams noted it was only in retrospect she could “appreciate the benefit of working in a collaborative project and open research generally”. She discussed how helpful it had been as an early career researcher to be “associated with something that was freely available”. She observed that few of her peers had many citations, and the reason she did was because “the dataset is online, people use the data, they cite the data, and cite me”.

Working openly has also improved the way she works, she explained, saying “It has given me a new perspective on what research should be about. …  It gives me a sense that people are relying on this data to be accurate and that does change the way you approach it.”

View from the team

The final speaker was Dr Lauren Cadwallader, Joint Deputy Head of the OSC with responsibility for the Research Data Facility, who discussed the “showcase dataset of the data that we can produce in the OSC” which is  taken from usage of our Request a Copy service.

Dr Cadwallader noted there has been an increase in the requests for theses over time. “This is a really exciting observation because the Board of Graduate studies have agreed that all students should deposit a digital copy of their thesis in our repository,” she said. “So it is really nice evidence that we can show our PhD students that by putting a copy in the repository people can read it and people do want to read theses in our repository.”

One observation was that several of the theses that were requested were written 60 years ago, so the repository is sharing older research as well. The topics of these theses covered algebra, Yorkshire evangelists and one of the oldest requested theses was written in 1927 about the Falkland Islands. “So there is a longevity in research and we have a duty to provide access to that research, ” she said.

Thanks go to…

The dataset itself is one created by the OSC team looking at the usage of our Request a Copy service. The analysis undertaken by Peter Sutton Long and we recently published a blog post about the findings.

The music played at the event was complied by Tony Malone and covers almost 1000 years of music, from Laura Cannell’s reworking of Hildegard of Bingen, to Jane Weaver’s Modern Cosmology. There are acknowledgments to Apollo, and Cambridge too. The soundtrack is available for those interested in listening.

This achievement is entirely due to the incredible work of the team in the Research Data Facility and their ability to engage with colleagues across the institution, the nation and the world. In particular the vision and dedication of Dr Marta Teperek cannot be understated.

In the words of Dr Gardner: “They have made our mission different, they have made our mission better, through the work they have achieved and the commitment they have.”

The event was supported by the Arcadia Fund, a charitable fund of Lisbet Rausing and Peter Baldwin.

 

 

Published 21 September 2017
Written by Dr Danny Kingsley
Creative Commons License

Speech by Dr Jess Gardner

First let us begin with some headline numbers. One thousand datasets. This is hugely significant and a very high level when looking at research repositories around the country. There is every reason to be proud of that achievement and what it means for open research.

There have been 26000 downloads of that data in the past 12 months alone – that is about use and reuse of our research data and is changing the face of how we do research. Some of these datasets have been downloaded 117 times and used in news, blogs and Twitter. The Research Data team have written 39 blogs about research data and have run 118 events, most of these have been with researchers.

While the headline numbers give us a sense of volume, perhaps let’s talk about the underlying rationale and philosophy behind this, which is core.

Cambridge University Library has a 600 year old history we are very proud of. In that time we have had an abiding responsibility to collect, care for and make available for use and reuse, information and research objects that form part of the intrinsic international scholarly record of which Cambridge has been such a strong part. And the ability for those ideas to inspire new ideas. The collection began on vellum, paper, and stone and bone.

And today much of that of course is digital. You can’t see that in the same way you can see the manuscripts and collections. It is sometimes hard to grasp when we are in this grand old dame of a building that I dare you not to love. It is home to the physical papers of such greats as Isaac Newton and Charles Darwin. Their research data was on paper and we have preserved that with great care and share it openly on line through our digital library. But our responsibility now is today’s researcher and today’s scientists and people working across all disciplines across our great university. Our preservation stewardship of that research data from the digital humanities across the biomedical is a core part of what we now do.

And the people in this room have changed that. They have made our mission different, they have made our mission better through the work they have achieved and the commitment they have.

Philosophically this is very natural extension of what we have done in the Library and the open library and its great research community for which this very building is designed. Some of you may know there is a philosophy behind this building and the famous ‘open library Cambridge’. In the 19th century and 20th century that was mostly about our open stack of books and we have quite a few of them, we are a little weighed down by them.

Our research data weighs less but it is just as significant and in the 21st century our support and our overriding philosophy is all about supporting open research and opening data as widely as possible. It is about sharing freely wherever it is appropriate to do so and there are many reasons why data isn’t open sometimes, and that is fine. What we are looking for is managing so we can make those choices appropriately, just as we have with the archive for many, many years.

So whilst as there is a fantastic achievement to mark tonight with those 1000 datasets it really is significant, we are really celebrating a deeper milestone with our research partners, our data champions, our colleagues in the research office and in the libraries across Cambridge, and that is about the changing role in research support and library research support in the digital age, and I think that is something we should be very proud of in terms of what we have achieved at Cambridge. I certainly am.

I am relatively new here at Cambridge. One of the things that was said to me when I was first appointed to the job was how lucky I was to be working at this University but also with the Office of Scholarly Communication in particular and that has proved to be absolutely true. I like to take this opportunity to note that achievement of 1000 datasets and to state very publicly that the Research Data Services and all who sail in her are at the core of our mission in our research library. But also to thank you and the teams involved for your superb achievements. It really is something to be very proud of and I thank you.

 

Researchers championing data – what works?

Here we follow up on our earlier piece “Creating a research data community”, where Rosie Higman and Hardy Schwamm discussed innovative ways of researcher engagement with research data management.

This blog discusses the outcome from a dedicated Birds of a Feather session at the 9th Research Data Alliance Plenary meeting in Barcelona in April 2017. The session discussed three different programmes for engaging researchers with data management and sharing: University of Cambridge Data Champions programme, TU Delft’s Data Stewardship and SPARC Europe’s Open Data Champions. The purpose of this session was to exchange practice, discuss the difference between the programmes and talk about possible next steps. All presentations from the sessions are available.

Cambridge’s Data Champions

Cambridge’s Data Champions programme was started in Autumn 2016 and is a programme in which researchers volunteered to become a local community expert and advocate on research data management and sharing. The main expectation of those appointed as Data Champions was to run at least one workshop on a topic related to research data management for their research community and to act as the local expert connecting researchers and central data services. In return Champions were offered new networking opportunities, training in research data management and sharing and also a boost to their CVs. Detailed information about the expectations, benefits of becoming a Champion, as well as the support from central services are publicly available.

The Data Champions programme is coordinated during bi-monthly meetings during which Champions exchange practice, talk to each other about their interactions with other researchers and provide each other with advice on tackling some of the data-related challenges. Over time Champions formed a community of practice and the central Research Data Team started to act more as hands-off facilitators of these activities and discussions rather than prescribing Champions what to do and how to best engage with researchers locally. The rationale behind this was that Data Champions would know their own research communities best and would be best positioned to decide what types of training and engagement methods would work for them.

And in fact the Champions delivered quite unexpected and diverse range of outputs. The initial requirement was to deliver a training on research data management to their local communities. The Research Data Management workshop template was shared with the Champions and they were all trained about the content and the methods of the workshop delivery. However, Champions were given discretion on what training they provided and how they wish to deliver. And in fact they developed all sorts of materials and strategies for engaging their local communities: from highly successful regular research data ‘tips’ emails sent to everyone in a department, through data sharing FAQs for chemists and ORCiD drop-in sessions, to organising Electronic Lab Notebooks trials. While certainly interesting and valuable, this also raised a questions as to whether the messages about data management and sharing are still consistent and aligned with the central data services, and also if the high quality of training is maintained.

TU Delft’s Data Stewardship programme

Madeleine de Smaele from TU Delft spoke about their Data Stewardship programme. The goal of the programme is to create mature working practices and policies for research data management across each of the eight faculties at TU Delft, so that any project can make sure their data is managed well. The programme is part of the broader Open Science agenda at TU Delft, which aims to make research more accessible and more re-usable. In contrast to the hands-off and decentralised Data Champions programme at Cambridge, TU Delft’s Data Stewardship programme has a solid framework as its core: a team of eight Data Stewards (a dedicated Data Steward for each one of eight TU Delft’s faculties), led centrally by the Data Stewardship Coordinator.

Data Stewards are disciplinary experts, who are embedded within faculties, and are able to understand and address the specific data management needs of their research communities. However, thanks to working as a team, which is centrally coordinated, the work of Data Stewards is coherent and aligned. This is reflected for example in research data policy development. TU Delft will have a central policy framework for research data management; however, it is Data Stewards working with their faculties who will develop research data policies, tailored to specific needs of individual faculties.

SPARC Europe’s Open Data Champions

SPARC Europe’s Open Data Champions initiative took yet a different approach from Cambridge and TU Delft and it aims to help promote the use of ambassadors or champions in the scientific community to help unlock more scientific data. The focus of the Open Data Champions Initiative is to achieve cultural change needed to see more research data shared and re-used.

Similarly to their previous SPARC Europe’s Open Access Champions initiative, the rationale behind the Open Data Champions is that activists who stimulate cultural change need to be promoted and supported to effect greater, speedier, more motivated research-driven change to help make Open the default in Europe. SPARC Europe wants to identify Champions at different career levels (from PhD students to vice chancellors), from a range of disciplines and from a variety of European countries to inspire broad range of stakeholders.

Are the programmes really effective?

After short presentations about the three programmes, the attendees started discussing different aspects of all programmes: their different aims, audiences, reward systems and sustainability of these activities. Perhaps the most interesting discussion was around measuring the effectiveness of these initiatives. All three programmes aim to ultimately achieve cultural change towards better data management and greater openness. Are the programmes all equally effective at achieving cultural change? Or are perhaps different modes of engagement bringing different results? How to measure cultural change?

And, finally, what are the costs and benefits of each programme? TU Delft’s Data Stewardship programme, with discipline-specific Data Stewards, is more resource-intensive than Cambridge’s Data Champions relying on researchers volunteering their time; both programmes are however more costly than SPARC Europe’s Open Data Champions.

Need for international collaboration and practice exchange

Our discussions brought more questions than answers but we all agreed that the exchange of ideas and practice was productive and useful. Many attendees expressed their interest for starting dedicated researcher engagement programmes at their institutions. Therefore, one of the main conclusions of the session was that it would be valuable to create a forum where those running programmes for researcher engagement could regularly discuss their programmes, exchange ideas and problem-solve jointly. This is particularly important for difficult questions, which the community struggles to address, such as metrics for assessing cultural change in data management and sharing. Working collaboratively can prove incredibly efficient, which was recently demonstrated by a teamwork effort which led to the development of metrics for assessment of data management training programmes.

Next steps

As a next step to extend our conversations and start identifying solutions to common problems, the University of Cambridge, SPARC Europe and Jisc are co-organising a dedicated event “Engaging Researchers in Good Data Management” on 15 November 2017 in Cambridge, United Kingdom. The event intends to bring together those working to support and engage researchers with open research and Research Data Management (RDM), including librarians, scholarly communication specialists and researchers from both the sciences and humanities. So if you are reading this blog post and would like to be part of these discussions, do come and join!

Published 15 September 2017
Written by Dr Marta Teperek
Creative Commons License

Sustaining long-term access to open research resources – a university library perspective

In the third in a series of three blog posts, Dave Gerrard, a Technical Specialist Fellow from the Polonsky-Foundation-funded Digital Preservation at Oxford and Cambridge project, describes how he thinks university libraries might contribute to ensuring access to Open Research for the longer-term.  The series began with Open Resources, who should pay, and continued with Sustaining open research resources – a funder perspective.

Blog post in a nutshell

This blog post works from the position that the user-bases for Open Research repositories in specific scientific domains are often very different to those of institutional repositories managed by university libraries.

It discusses how in the digital era we could deal with the differences between those user-bases more effectively. The upshot might be an approach to the management of Open Research that requires both types of repository to work alongside each other, with differing responsibilities, at least while the Open Research in question is still active.

And, while this proposed method of working together wouldn’t clarify ‘who is going to pay’ entirely, it at least clarifies who might be responsible for finding funding for each aspect of the task of maintaining access in the long-term.

Designating a repository’s user community for the long-term

Let’s start with some definitions. One of the core models in Digital Preservation, the International Standard Open Archival Information System Reference Model (or OAIS) defines ‘the long term’ as: 

“A period of time long enough for there to be concern about the impacts of changing technologies, including support for new media and data formats, and of a changing Designated Community, on the information being held in an OAIS. This period extends into the indefinite future.”

This leads us to two further important concepts defined by the OAIS:

Designated Communities” are an identified group of potential Consumers who should be able to understand a particular set of information”, i.e. the set of information collected by the ‘archival information system’. 

A “Representation Information Network” is the tool that allows the communities to explore the metadata which describes the core information collected. This metadata will consist of:

  • descriptions of the data contained in the repository
  • metadata about the software used to work with that data,
  • the formats in which the data are stored and related to each other, and so forth.  

In the example of the Virtual Fly Brain Platform repository discussed in the first post in this series, the Designated Community appears to be: “… neurobiologists [who want] to explore the detailed neuroanatomy, neuron connectivity and gene expression of Drosophila melanogaster.” And one of the key pieces of Representation Information, namely “how everything in the repository relates to everything else”, is based upon a complex ontology of fly anatomy.

It is easy to conclude, therefore, that you really do need to be a neurobiologist to use the repository: it is fundamentally, deeply and unashamedly confusing to anyone else that might try to use it.

Tending towards a general audience

The concept of Designated Communities is one that, in my opinion, the OAIS Reference Model never adequately gets to grips with. For instance, the OAIS Model suggests including explanatory information in specialist repositories to make the content understandable to the general community.

Long term access within this definition thus implies designing repositories for Designated Communities consisting of what my co-Polonsky-Fellow Lee Pretlove describes as: “all of humanity, plus robots”. The deluge of additional information that would need to be added to support this totally general resource would render it unusable; to aim at everybody is effectively aiming at nobody. And, crucially, “nobody” is precisely who is most likely to fund a “specialist repository for everyone”, too.

History provides a solution

One way out of this impasse is to think about currently existing repositories of scientific information from more than 100 years ago. We maintain a fine example at Cambridge: The Darwin Correspondence Project, though it can’t be compared directly to Virtual Fly Brain. The former doesn’t contain specialist scientific information like that held by the latter – it holds letters, notebooks, diary entries etc – ‘personal papers’ in other words. These types of materials are what university archives tend to collect.

Repositories like Darwin Correspondence don’t have “all of humanity, plus robots” Designated Communities, either. They’re aimed at historians of science, and those researching the time period when the science was conducted. Such communities tend more towards the general than ‘neurobiologists’, but are still specialised enough to enable production and management of workable, usable, logical archives.

We don’t have to wait for the professor to die any more

So we have two quite different types of repository. There’s the ‘ultra-specialised’ Open Research repository for the Designated Community of researchers in the related domain, and then there’s the more general institutional ‘special collection’ repository containing materials that provide context to the science, such as correspondence between scientists, notebooks (which are becoming fully electronic), and rough ‘back of the envelope’ ideas. Sitting somewhere between the two are publications – the specialist repository might host early drafts and work in progress, while the institutional repository contains finished, publish work. And the institutional repository might also collect enough data to support these publications, too, like our own Apollo Repository does.

The way digital disrupts this relationship is quite simple: a scientist needs access to her ‘personal papers’ while she’s still working, so, in the old days (i.e. more than 25 years ago) the archive couldn’t take these while she was still active, and would often have to wait for the professor to retire, or even die, before such items could be donated. However, now everything is digital, the prof can both keep her “papers” locally and deposit them at the same time. The library special collection doesn’t need to wait for the professor to die to get their hands on the context of her work. Or indeed, wait for her to become a professor.

Key issues this disruption raises

If we accept that specialist Open Research repositories are where researchers carry out their work, that the institutional repository role is to collect contextual material to help us understand that work further down the line, then what questions does this raise about how those managing these repositories might work together?

How will the relationship between archivists and researchers change?

The move to digital methods of working will change the relationships between scientists and archivists.  Institutional repository staff will become increasingly obliged to forge relationships with scientists earlier in their careers. Of course, the archivists will need to work out which current research activity is likely to resonate most in future. Collection policies might have to be more closely in step with funding trends, for instance? Perhaps the university archivist of the digital future might spend a little more time hanging round the research office?

How will scientists’ behaviour have to change?

A further outcome of being able to donate digitally is that scientists become more responsible for managing their personal digital materials well, so that it’s easier to donate them as they go along. This has been well highlighted by another of the Polonsky Fellows, Sarah Mason at the Bodleian Libraries, who has delivered personal digital archiving training to staff at Oxford, in part based on advice from the Digital Preservation Coalition. The good news here is that such behaviour actually helps people keep their ongoing work neat and tidy, too.

How can we tell when the switch between Designated Communities occurs?

Is it the case that there is a ‘switch-over’ between the two types of Designated Community described above? Does the ‘research lifecycle’ actually include a phase where the active science in a particular domain starts to die down, but the historical interest in that domain starts to increase? I expect that this might be the case, even though it’s not in any of the lifecycle models I’ve seen, which mostly seem to model research as either continuing on a level perpetually, or stopping instantly. But such a phase is likely to vary greatly even between quite closely-related scientific domains. Variables such as the methods and technologies used to conduct the science, what impact the particular scientific domain has upon the public, to what degree theories within the domain conflict, indeed a plethora of factors, are likely to influence the answer.

How might two archives working side-by-side help manage digital obsolescence?

Not having access to the kit needed to work with scientific data in future is one of the biggest threats to genuine ‘long-term’ access to Open Research, but one that I think it really does fall to the university to mitigate. Active scientists using a dedicated, domain specific repository are by default going to be able to deal with the material in that repository: if one team deposits some material that others don’t have the technology to use, then they will as a matter of course sort that out amongst themselves at the time, and they shouldn’t have to concern themselves with what people will do 100 years later.

However, university repositories do have more of a responsibility to history, and a daunting responsibility it is. There is some good news here, though… For a start, universities have a good deal of purchasing power they can bring to bear upon equipment vendors, in order to insist, for example, that they produce hardware and software that creates data in formats that can be preserved easily, and to grant software licenses in perpetuity for preservation purposes.

What’s more fundamental, though, is that the very contextual materials I’ve argued that university special collections should be collecting from scientists ‘as they go along’ are the precise materials science historians of the future will use to work out how to use such “ancient” technology.

Who pays?

The final, but perhaps most pressing question, is ‘who pays for all this’? Well – I believe that managing long-term access to Open Research in two active repositories working together, with two distinct Designated Communities, at least might makes things a little clearer. Funding specialist Open Research repositories should be the responsibility of funders in that domain, but they shouldn’t have to worry about long-term access to those resources. As long as the science is active enough that it’s getting funded, then a proportion of that funding should go to the repositories that science needs to support it. The exact proportion should depend upon the value the repository brings – might be calculated using factors such as how much the repository is used, how much time using it saves, what researchers’ time is worth, how many Research Excellence Framework brownie points (or similar) come about as a result of collaborations enabled by that repository, etc etc.

On the other hand, I believe that university / institutional repositories need to find quite separate funding for their archivists to start building relationships with those same scientists, and working with them to both collect the context surrounding their science as they go along, and prepare for the time when the specialist repository needs to be mothballed. With such contextual materials in place, there don’t seem to be too many insurmountable technical reasons why, when it’s acknowledged that the “switch from one Designated Community to another” has reached the requisite tipping point, the university / institutional repository couldn’t archive the whole of the specialist research repository, describe it sensibly using the contextual material they have collected from the relevant scientists as they’ve gone along, and then store it cheaply on a low-energy medium (i.e. tape, currently). It would then be “available” to those science historians that really wanted to have a go at understanding it in future, based on what they could piece together about it from all the contextual information held by the university in a more immediately accessible state.

Hence the earlier the institutional repository can start forging relationships with researchers, the better. But it’s something for the institutional archive to worry about, and get the funding for, not the researcher.

Published 11 September 2017
Written by Dave Gerrard

Creative Commons License