All posts by Office of Scholarly Communication

Sustaining long-term access to open research resources – a university library perspective

In the third in a series of three blog posts, Dave Gerrard, a Technical Specialist Fellow from the Polonsky-Foundation-funded Digital Preservation at Oxford and Cambridge project, describes how he thinks university libraries might contribute to ensuring access to Open Research for the longer-term.  The series began with Open Resources, who should pay, and continued with Sustaining open research resources – a funder perspective.

Blog post in a nutshell

This blog post works from the position that the user-bases for Open Research repositories in specific scientific domains are often very different to those of institutional repositories managed by university libraries.

It discusses how in the digital era we could deal with the differences between those user-bases more effectively. The upshot might be an approach to the management of Open Research that requires both types of repository to work alongside each other, with differing responsibilities, at least while the Open Research in question is still active.

And, while this proposed method of working together wouldn’t clarify ‘who is going to pay’ entirely, it at least clarifies who might be responsible for finding funding for each aspect of the task of maintaining access in the long-term.

Designating a repository’s user community for the long-term

Let’s start with some definitions. One of the core models in Digital Preservation, the International Standard Open Archival Information System Reference Model (or OAIS) defines ‘the long term’ as: 

“A period of time long enough for there to be concern about the impacts of changing technologies, including support for new media and data formats, and of a changing Designated Community, on the information being held in an OAIS. This period extends into the indefinite future.”

This leads us to two further important concepts defined by the OAIS:

Designated Communities” are an identified group of potential Consumers who should be able to understand a particular set of information”, i.e. the set of information collected by the ‘archival information system’. 

A “Representation Information Network” is the tool that allows the communities to explore the metadata which describes the core information collected. This metadata will consist of:

  • descriptions of the data contained in the repository
  • metadata about the software used to work with that data,
  • the formats in which the data are stored and related to each other, and so forth.  

In the example of the Virtual Fly Brain Platform repository discussed in the first post in this series, the Designated Community appears to be: “… neurobiologists [who want] to explore the detailed neuroanatomy, neuron connectivity and gene expression of Drosophila melanogaster.” And one of the key pieces of Representation Information, namely “how everything in the repository relates to everything else”, is based upon a complex ontology of fly anatomy.

It is easy to conclude, therefore, that you really do need to be a neurobiologist to use the repository: it is fundamentally, deeply and unashamedly confusing to anyone else that might try to use it.

Tending towards a general audience

The concept of Designated Communities is one that, in my opinion, the OAIS Reference Model never adequately gets to grips with. For instance, the OAIS Model suggests including explanatory information in specialist repositories to make the content understandable to the general community.

Long term access within this definition thus implies designing repositories for Designated Communities consisting of what my co-Polonsky-Fellow Lee Pretlove describes as: “all of humanity, plus robots”. The deluge of additional information that would need to be added to support this totally general resource would render it unusable; to aim at everybody is effectively aiming at nobody. And, crucially, “nobody” is precisely who is most likely to fund a “specialist repository for everyone”, too.

History provides a solution

One way out of this impasse is to think about currently existing repositories of scientific information from more than 100 years ago. We maintain a fine example at Cambridge: The Darwin Correspondence Project, though it can’t be compared directly to Virtual Fly Brain. The former doesn’t contain specialist scientific information like that held by the latter – it holds letters, notebooks, diary entries etc – ‘personal papers’ in other words. These types of materials are what university archives tend to collect.

Repositories like Darwin Correspondence don’t have “all of humanity, plus robots” Designated Communities, either. They’re aimed at historians of science, and those researching the time period when the science was conducted. Such communities tend more towards the general than ‘neurobiologists’, but are still specialised enough to enable production and management of workable, usable, logical archives.

We don’t have to wait for the professor to die any more

So we have two quite different types of repository. There’s the ‘ultra-specialised’ Open Research repository for the Designated Community of researchers in the related domain, and then there’s the more general institutional ‘special collection’ repository containing materials that provide context to the science, such as correspondence between scientists, notebooks (which are becoming fully electronic), and rough ‘back of the envelope’ ideas. Sitting somewhere between the two are publications – the specialist repository might host early drafts and work in progress, while the institutional repository contains finished, publish work. And the institutional repository might also collect enough data to support these publications, too, like our own Apollo Repository does.

The way digital disrupts this relationship is quite simple: a scientist needs access to her ‘personal papers’ while she’s still working, so, in the old days (i.e. more than 25 years ago) the archive couldn’t take these while she was still active, and would often have to wait for the professor to retire, or even die, before such items could be donated. However, now everything is digital, the prof can both keep her “papers” locally and deposit them at the same time. The library special collection doesn’t need to wait for the professor to die to get their hands on the context of her work. Or indeed, wait for her to become a professor.

Key issues this disruption raises

If we accept that specialist Open Research repositories are where researchers carry out their work, that the institutional repository role is to collect contextual material to help us understand that work further down the line, then what questions does this raise about how those managing these repositories might work together?

How will the relationship between archivists and researchers change?

The move to digital methods of working will change the relationships between scientists and archivists.  Institutional repository staff will become increasingly obliged to forge relationships with scientists earlier in their careers. Of course, the archivists will need to work out which current research activity is likely to resonate most in future. Collection policies might have to be more closely in step with funding trends, for instance? Perhaps the university archivist of the digital future might spend a little more time hanging round the research office?

How will scientists’ behaviour have to change?

A further outcome of being able to donate digitally is that scientists become more responsible for managing their personal digital materials well, so that it’s easier to donate them as they go along. This has been well highlighted by another of the Polonsky Fellows, Sarah Mason at the Bodleian Libraries, who has delivered personal digital archiving training to staff at Oxford, in part based on advice from the Digital Preservation Coalition. The good news here is that such behaviour actually helps people keep their ongoing work neat and tidy, too.

How can we tell when the switch between Designated Communities occurs?

Is it the case that there is a ‘switch-over’ between the two types of Designated Community described above? Does the ‘research lifecycle’ actually include a phase where the active science in a particular domain starts to die down, but the historical interest in that domain starts to increase? I expect that this might be the case, even though it’s not in any of the lifecycle models I’ve seen, which mostly seem to model research as either continuing on a level perpetually, or stopping instantly. But such a phase is likely to vary greatly even between quite closely-related scientific domains. Variables such as the methods and technologies used to conduct the science, what impact the particular scientific domain has upon the public, to what degree theories within the domain conflict, indeed a plethora of factors, are likely to influence the answer.

How might two archives working side-by-side help manage digital obsolescence?

Not having access to the kit needed to work with scientific data in future is one of the biggest threats to genuine ‘long-term’ access to Open Research, but one that I think it really does fall to the university to mitigate. Active scientists using a dedicated, domain specific repository are by default going to be able to deal with the material in that repository: if one team deposits some material that others don’t have the technology to use, then they will as a matter of course sort that out amongst themselves at the time, and they shouldn’t have to concern themselves with what people will do 100 years later.

However, university repositories do have more of a responsibility to history, and a daunting responsibility it is. There is some good news here, though… For a start, universities have a good deal of purchasing power they can bring to bear upon equipment vendors, in order to insist, for example, that they produce hardware and software that creates data in formats that can be preserved easily, and to grant software licenses in perpetuity for preservation purposes.

What’s more fundamental, though, is that the very contextual materials I’ve argued that university special collections should be collecting from scientists ‘as they go along’ are the precise materials science historians of the future will use to work out how to use such “ancient” technology.

Who pays?

The final, but perhaps most pressing question, is ‘who pays for all this’? Well – I believe that managing long-term access to Open Research in two active repositories working together, with two distinct Designated Communities, at least might makes things a little clearer. Funding specialist Open Research repositories should be the responsibility of funders in that domain, but they shouldn’t have to worry about long-term access to those resources. As long as the science is active enough that it’s getting funded, then a proportion of that funding should go to the repositories that science needs to support it. The exact proportion should depend upon the value the repository brings – might be calculated using factors such as how much the repository is used, how much time using it saves, what researchers’ time is worth, how many Research Excellence Framework brownie points (or similar) come about as a result of collaborations enabled by that repository, etc etc.

On the other hand, I believe that university / institutional repositories need to find quite separate funding for their archivists to start building relationships with those same scientists, and working with them to both collect the context surrounding their science as they go along, and prepare for the time when the specialist repository needs to be mothballed. With such contextual materials in place, there don’t seem to be too many insurmountable technical reasons why, when it’s acknowledged that the “switch from one Designated Community to another” has reached the requisite tipping point, the university / institutional repository couldn’t archive the whole of the specialist research repository, describe it sensibly using the contextual material they have collected from the relevant scientists as they’ve gone along, and then store it cheaply on a low-energy medium (i.e. tape, currently). It would then be “available” to those science historians that really wanted to have a go at understanding it in future, based on what they could piece together about it from all the contextual information held by the university in a more immediately accessible state.

Hence the earlier the institutional repository can start forging relationships with researchers, the better. But it’s something for the institutional archive to worry about, and get the funding for, not the researcher.

Published 11 September 2017
Written by Dave Gerrard

Creative Commons License

Continuing the conversation: a CRUK workshop on RDM

In May 2017 the Office of Scholarly Communication organised a workshop with Paola Quattroni from Cancer Research UK (CRUK) focusing on data sharing policy and practices. It was a great opportunity for the funder to outline its policies and current initiatives on data sharing and for the Cambridge researchers to discuss the issues, suggest further solutions and give feedback to the funder about the changes they would like to see implemented. This blog highlights the main points of the workshop.

This session was continuing  the conversation from February last year when the CRUK and Wellcome Trust came to Cambridge to speak to our research community.

CRUK’s grand ambition

In her presentation “Data sharing in policy and practice with Cancer Research UK“, Paola Quattroni began with CRUK’s grand ambition: “To bring forward the day all cancers are cured” and “see three quarters of people surviving cancer within the next 20 years.

One of the key elements to materialise this and maximise public benefit is data sharing. CRUK firmly believes that transparency, research integrity and swift dissemination and reproducibility of research results are key ingredients to the success.

“Our goal is to improve how research is carried out,” explained Paola, who is the Research Funding Manager – Data at CRUK. “We fund the best science and expect researchers to follow best practices… Improving patient benefit and health is our ambition.”

She emphasised the need to have ongoing discussions with the research community and work together on how to overcome barriers to data sharing. Appropriate sharing and dissemination of research data are particularly important for CRUK, and good data management is the first step to get most from the data and facilitate sharing and re-use. In this context, CRUK is actively working to increase and improve data sharing by being instructive but not necessarily demanding in its requirements.

The audience

The majority of the attendees came from the fields of Biological Sciences and Clinical Medicine. When asked why they came to the workshop the consensus was to be informed regarding the CRUK policy and what actions they needed to take. Examples of individual responses included:

  • To learn how to fulfil funders’ requirements.
  • To learn more about processing data.
  • To know the policy on sharing code and data.
  • To learn the difference between data sharing and open data.
  • To discuss about the costs of storing data and how to be able to forecast costs for periods of more than 10 years.
  • To learn more about contractual agreements.
  • To learn what the funder expects regarding data sharing.
  • To learn and inform other colleagues about it.

The structure

The workshop started with an icebreaker. The audience was asked to pinpoint why they came to the workshop and what they hoped to gain from it. Following that, Paola Quattroni presented CRUK’s policy on the management and sharing of data, explained why data sharing is important, what are the barriers and outlined current initiatives to improve data sharing among researchers.

Paola highlighted some of the work CRUK is doing to increase data sharing such as the recent signing of the San Francisco Declaration of Research Assessment (DORA) and the fact that CRUK is continuing to work with others to put it into practice. Other future activities include:

  • Encouraging grant applicants to explain the significance and impact of their discoveries, publications and a broad range of other outputs (e.g. policy influence).
  • Being more explicit about evaluating grant applicants’ publications according to their scientific content, rather than simply consider where they are published.
  • Working with reviewers and committee members to evaluate the impact of all research outputs.
  • Measuring the re-use of research.
  • Encouraging replication studies.
  • Recognising and rewarding researchers who share their data.

After the presentation, everybody split into groups and identified various challenges of data sharing which were then analysed by the teams and the trainers. The last part of the workshop concentrated on group feedback and suggestions from the audience on what funders could do to further enhance collaboration with the research community.

Challenges

The workshop continued by splitting into groups. Each group identified challenges and problems of data sharing with regard to Publishing, Skills and Training, Rewards and Data Infrastructure:

Publishing

A recurring item among all groups was the fear of being scooped and the loss of publication opportunities. Also, that the impact factor is still be-all, end-all. Other challenges included:

  • Accepting citations of preprints as a metric of achievement – can be dangerous as groups can release data non-peer reviewed online to discourage innovation of competitors.
  • Range of requirements across different journals/publishers.
  • Need to take care not to kill analytical innovation.
  • The larger the collaboration the higher the importance of a standardised data format and analysis.

Skills & Training

The Skills and Training section concentrated on how to write data management plans and standardise laboratory notes as well as the necessary training to catch up with technology. Other points included:

  • Lack of computer skills/knowledge to physically upload data.
  • Formatting data.
  • Version Control.

Rewards

It was apparent in most of the groups that time, cost and re-usability problems were significant inhibitors regarding rewards and incentives:

  • There is a need to overcome the ‘time burden’ aspect of sharing.
  • Cost and Time – solution: Electronic Laboratory Notebooks (ELN) – one or many? Public or private?
  • New PI (Persistent Identifier) for metrics.
  • Re-usability – how do you measure it?
  • DMPs are required at the time of grant submission. However, the researcher needs to report after one year because various parameters can change and might need to be re-adjusted.

Data Infrastructure

The need for standardisation in data acquisition, storage and analysis methods and how ‘big data’ is handled by the funders were common themes in this category. In addition, it was pinpointed that individual Institutes should have the infrastructure to support data sharing and DMP writing.

Other data infrastructure challenges included:

  • Data formats – for example there are so many different scanners for imaging, which all have different formats.
  • EU project testing imaging modality across 20 sites where integrating the data is a challenge. The analogy is a clinical trial where protocols and practices have to produce comparable data.
  • Cost of the software: there are open source imaging software available. However, you may need different imaging analysis tools.

Solutions

Although there was not enough time to concentrate on all challenges, the ongoing discussions turned into ideas that provided the seeds for possible solutions or change of strategies regarding how data is being valued and shared.

For example, what if you are just scooped? Would citations help? One solution is that if you have a DOI stamp this can be evidence that you were first.

Currently, publications are considered to be the sole reward so there is a wide fear of loss of publication opportunities. However, if your data is more valuable than the paper, then the dataset becomes the incentive and is highly valued. How can this be achieved? Micropublishing? If you can build a career on data publishing instead of papers, it would change the incentive strategy. Instead of relying on the old system where there is a big story, what about writing a small story or event data papers? Data in conjunction with data notes is a type of article. These kind of outputs are valuable and publishers should consider this.

Despite the fact that staff working for funders have often been researchers themselves, they could visit researchers from different disciplines to get an idea of what is needed, especially with discipline specific DMPs. Some participants suggested that DMPs should be discipline-specific and standardised. As an example, if preclinical and clinical data had the same format, such data could easily be compared.

Another solution proposed by the participants to the financial challenges associated with data sharing could be an open access fund for data, similar to COAF that supports the cost of infrastructure and rewards openness.

Conclusions

As already mentioned, the discussions evolved to the point that there was no time left to analyse all challenges and talk about practical issues.

For example, there was a clear need from the participants’ point of view for practical guidance on data plans and distinct approaches per field (STEM/HASS). Questions arose about the use and cost of ELNs and any implications in the future.  Similarly, about what happens if data needs to be deposited somewhere else or in the middle of the plan. What would the rules be for additional funding midway in such instances? Lastly, preservation and infrastructure costs that associate projects in the long term was another big topic as well future funders’ strategies regarding ‘big data’. (See this blog for a discussion on the cost issue).

This workshop brought together researchers from different disciplines interested in learning more about data management and sharing at CRUK. From the funder’s perspective, it was a great opportunity to discuss policies and initiatives in data sharing and to hear directly from researchers about the main barriers to data sharing. CRUK strives to help researchers overcome these barriers and is actively working to facilitate the way research is carried out and ultimately shared.

It was agreed that this workshop was only the beginning and highlighted that collaboration is key to overcome some of these challenges.

The main outcomes, however, were clear from the onset:

  • There is a recognised need for ongoing collaboration between funders, researchers and institutions.
  • A global view is required – all funders should have the same vision and aims regarding data sharing.
  • Reporting and disseminating all data is key.
  • Data needs to be available and reusable.
  • We need to overcome the technical and infrastructure challenges of how to measure the “journey” of the data and its re-usability.

Published 07 September 2017
Written by Maria Angelaki

Creative Commons License

What I wish I’d known at the start – setting up an RDM service

In August, Dr Marta Teperek began her new role at Delft University in the Netherlands. In her usual style of doing things properly and thoroughly, she has contributed this blog reflecting on the lessons learned in the process of setting up Cambridge University’s highly successful Research Data Facility.

On 27-28 June 2017 I attended the Jisc’s Research Data Network meeting at the University of York. I was one of several people invited to talk about experiences of setting up RDM services in a workshop organised by Stephen Grace from London South Bank University and Sarah Jones from the Digital Curation Centre. The purpose of the workshop was to share lessons learned and help those that were just starting to set up research data services within their institutions. Each of the presenters prepared three slides: 1. What went well, 2. What didn’t go so well, 3. What they would do differently. All slides from the session are now publicly available.

For me the session was extremely useful not only because of the exchange of practices and learning opportunity, but also because the whole exercise prompted me to critically reflect on Cambridge Research Data Management (RDM) services. This blog post is a recollection of my thoughts on what went well, what didn’t go so well and what could have been done differently, as inspired by the original workshop’s questions.

What went well

RDM services at Cambridge started in January 2015 – quite late compared to other UK institutions. The late start meant however that we were able to learn from others and to avoid some common mistakes when developing our RDM support. The Jisc’s Research Data Management mailing list was particularly helpful, as it is a place used by professionals working with research data to look for help, ask questions, share reflections and advice. In addition, Research Data Management Fora organised by the Digital Curation Centre proved to be not only an excellent vehicle for knowledge and good practice exchange, but also for building networks with colleagues in similar roles. In addition, Cambridge also joined the Jisc Research Data Shared Service (RDSS) pilot, which aimed to create a joint research repository and related infrastructure. Being part of the RDSS pilot not only helped us to further engage with the community, but also allowed us to better understand the RDM needs at the University of Cambridge by undertaking the Data Asset Framework exercise.

In exchange for all the useful advice received from others, we aimed to be transparent about our work as well. We therefore regularly published blog posts about research data management at Cambridge on the Unlocking Research blog. There were several additional advantages of the transparent approach: it allowed us to reflect on our activities, it provided an archival record of what was done and rationale for this and it also facilitated more networking and comments exchange with the wider RDM community.

Engaging Cambridge community with RDM

Our initial attempts to engage research community at Cambridge with RDM was compliance based: we were telling our researchers that they must manage and share their research data because this was what their funders require. Unsurprisingly however, this approach was rather unsuccessful – researchers were not prepared to devote time to RDM if they did not see the benefits of doing so. We therefore quickly revised the approach and changed the focus of our outreach to (selfish) benefits of good data management and of effective data sharing. This allowed us to build an engaged RDM community, in particular among early career researchers. As a result, we were able to launch two dedicated programmes, further strengthening our community involvement in RDM: the Data Champions programme and also the Open Research Pilot Project. Data Champions are (mostly) researchers, who volunteered their time to act as local experts on research data management and sharing to provide advice and specialised training within their departments.The Open Research Pilot Project is looking at the benefits and barriers to conducting Open Research.

In addition, ensuring that the wide range of stakeholders from across the University were part of the RDM Project Group and had an oversight of development and delivery of RDM services, allowed us to develop our services quite quickly. As a result, services developed were endorsed by wide range of stakeholders at Cambridge and they were also developed in a relatively coherent fashion. As an example, effective collaboration between the Office of Scholarly Communication, the Library, the Research Office and the University Information Services allowed integration between the Cambridge research repository, Apollo, and the research information system, Symplectic Elements.

What didn’t go so well

One of the aspects of our RDM service development that did not go so well was the business case development. We started developing the RDM business case in early 2015. The business case went through numerous iterations, and at the time of writing of this blog post (August 2017), financial sustainability for the RDM services has not yet been achieved.

One of the strongest factors which contributed to the lack of success in business case development was insufficient engagement of senior leadership with RDM. We have invested a substantial amount of time and effort in engaging researchers with RDM and by moving away from compliance arguments, to the extent that we seem to have forgotten that compliance- and research integrity-based advocacy is necessary to ensure the buy in of senior leadership.

In addition, while trying to move quickly with service development, and at the same time trying to gain trust and engagement in RDM service development from the various stakeholder groups at Cambridge, we ended up taking part in various projects and undertakings, which were sometimes loosely connected to RDM. As a result, some of the activities lacked strategic focus and a lot of time was needed to re-define what the RDM service is and what it is not in order to ensure that expectations of the various stakeholders groups could be properly managed.

What could have been done differently

There are a number of things which could have been done differently and more effectively. Firstly, and to address the main problem of insufficient engagement with senior leadership, one could have introduced dedicated, short sessions for principal investigators on ensuring effective research data management and research reproducibility across their research teams. Senior researchers are ultimately those who make decisions at research-intensive institutions, and therefore their buy-in and their awareness of the value of good RDM practice is necessary for achieving financial sustainability of RDM services.

In addition, it would have been valuable to set aside time for strategic thinking and for defining (and re-defining, as necessary) the scope of RDM services. This is also related to the overall branding of the service. In Cambridge a lot of initial harm was done due to negative association between Open Access to publications and RDM. Due to overarching funders’ and government’s requirements for Open Access to publications, many researchers started perceiving Open Access to publications merely as a necessary compliance condition. The advocacy for RDM at Cambridge started as ‘Open Data’ requirements, which led many researchers to believe that RDM is yet another requirement to comply with and that it was only about open sharing of research data. It took us a long time to change the messages and to rebrand the service as one supporting researchers in their day to day research practice and that proper management of research data leads to efficiency savings. Finally, only research data which are management properly from the very start of the research process can be then easily shared at the end of the project.

Finally, and which is also related to the focusing and defining of the service, it would have been useful to decide on a benchmarking strategy from the very beginning of the service creation. What is the goal(s) of the service? Is it to increase the number of shared datasets? Is it to improve day to day data management practice? Is to to ensure that researchers know how to use novel tools for data analysis? And, once the goal(s) is decided, design a strategy to benchmark the progress towards achieving this goal(s). Otherwise it can be challenging to decide which projects and undertakings are worth continuation and which ones are less successful and should be revised or discontinued. In order to address one aspect of benchmarking, Cambridge led the creation of an international group aiming to develop a benchmarking strategy for RDM training programmes, which aims to create tools for improving RDM training provision.

Final reflections

My final reflection is to re-iterate that the questions asked of me by the workshop leaders at the Jisc RDN meeting really inspired me to think more holistically about the work done towards development of RDM services at Cambridge. Looking forward I think asking oneself the very same three questions: what went well, what did not go so well and what you would do differently, might become for a useful regular exercise ensuring that RDM service development is well balanced and on track towards its intended goals.


Published 24 August 2017
Written by Dr Marta Teperek

Creative Commons License