Tag Archives: research data management

Show me the money – the path to a sustainable Research Data Facility

Like many institutions in the UK, Cambridge University has responded to research funders’ requirements for data management and  sharing with a concerted effort to support our research community in good data management and sharing practice through our Research Data Facility. We have written a few times on this blog and presented to describe our services. This blog is a description of the process we have undertaken to support these services in the long term.

Funders expect  that researchers make the data underpinning their research available and provide a link to this data in the paper itself. The EPSRC started checking compliance with their data sharing requirement on 1 May 2015. When we first created the Research Data Facility we spoke to many researchers across the institution and two things became very clear. One was that there was considerable confusion about what actually counts as data, and the second was that sharing data on publication is not something that can be easily done as an afterthought if the data was not properly managed in the first place.

We have approached these issues separately. To try and determine what is actually required from funders beyond the written policies we have invited representatives from our funders to come to discussions and forums with our researchers to work out the details. So far we have hosted Ben Ryan from the EPSRC, Michael Ball from the BBSRC and most recently David Carr and Jamie Enoch from the Wellcome Trust and CRUK respectively.

Dealing with the need for awareness of research data management has been more complex. To raise awareness of good practice in data management and sharing we embarked on an intense advocacy programme and in the past 15 months have organised 71 information sessions about data sharing (speaking with over 1,700 researchers). But we also needed to ensure the research community was managing its data from the beginning of the research process. To assist this we have developed workshops on various aspects of data management (hosting 32 workshops in the past year), a comprehensive website, a service to support researchers with their development of their research data management plans and a data management consultancy service.

So far, so good. We have had a huge response to our work, and while we encourage researchers to use the data repository that best suits their material, we do offer our institutional repository Apollo as an option. We are as of today, hosting 499 datasets in the repository. The message is clearly getting through.

Sustainability

The word sustainability (particularly in the scholarly communication world) is code for ‘money’. And money has become quite a sticking point in the area of data management. The way Cambridge started the Research Data Facility was by employing a single person, Dr Marta Teperek for one year, supported by the remnants of the RCUK Transition Fund. It became quickly obvious that we needed more staff to manage the workload and now the Facility employs half an Events and Outreach Coordinator and half a Repository Manager plus a Research Data Adviser who looks after the bulk of the uploading of data sets into the repository.

Clearly there was a need to work out the longer term support for staffing the Facility – a service for which there are no signs of demand slowing. Early last year we started scouting around for options.  In April 2013 the RCUK released some guidance that said it was permissible to recover costs from grants through direct charges or overheads – but noted institutions could not charge twice. This guidance also mentioned that it was permissible for institutions to recover costs of RDM Facilities as other Small Research Facilities, “provided that such facilities are transparently charged to all projects that use them”.

Transparency

On the basis of that advice we established a Research Data Facility as a Small Research Facility according to the Transparent Approach to Costing (TRAC) methodology. Our proposal was that Facility’s costs will be recovered from grants as directly allocated costs. We chose this option rather than overheads because of the advantage of transparency to the funder of our activities. By charging grants this way it meant a bigger advocacy and education role for the Facility. But the advantage is that it would make researchers aware that they need to consider research data management seriously, that this involves both time and money, and that it is an integral part of a grant proposal.

Dr Danny Kingsley has argued before (for example in a paper ‘Paying for publication: issues and challenges for research support services‘) that by centralising payments for article processing charges, the researchers remain ignorant of the true economics of the open access system in the way that they are generally unaware of the amounts spent on subscriptions. If we charged the costs of the Facility into overheads, it becomes yet another hidden cost and another service that ‘magically’ happens behind the scenes from the researcher’s point of view.

In terms of the actual numbers, direct costs of the Research Data Facility included salaries for 3.2 FTEs (a Research Data Facility Manager, Research Data Adviser, 0.5 Outreach and Engagement Coordinator, 0.5 Repository Manager, 0.2 Senior Management time), hardware and hardware maintenance costs, software licences, costs of organising events as well as the costs of staff training and conference attendance. The total direct annual cost of our Facility was less than £200,000. These are the people cost of the Facility and are not to be confused with the repository costs (for which we do charge directly).

Determining how much to charge

Throughout this process we have explored many options for trying to assess a way of graduating the costing in relation to what support might be required. Ideally, we would want to ensure that the Facility costs can be accurately measured based on what the applicant indicated in their data management plan. However, not all funders require data management plans. Additionally, while data management plans provide some indication of the quantity of data (storage) to be generated, they do not allow a direct estimate of the amount of data management assistance required during the lifetime of the grant. Because we could not assess the level of support required for a particular research project from a data management plan, we looked at an alternative charging strategy.

We investigated charging according to the number of people on a team, given that the training component of the Facility is measurable by attendees to workshops. However, after investigation we were unable to easily extract that type of information about grants and this also created a problem for charging for collaborative grants. We then looked at charging a small flat charge on every grant requiring the assistance of the Facility and at charging proportionally to the size (percentage of value) of the grant. Since we did not have any compelling evidence that bigger grants require more Facility assistance, we proposed a model of flat charging on all grants, which require Facility assistance. This model was also the most cost-effective from an administrative point of view.

As an indicator of the amount of work involved in the development of the Business Case, and the level of work and input that we have received relating to it, the document is now up to version 18 – each version representing a recalculation of the costings.

Collaborative process

A proposal such as we were suggesting – that we charge the costs of the Facility as a direct charge against grants – is reasonably radical. It was important that we ensure the charges would be seen as fair and reasonable by the research community and the funders. To that end we have spent the best part of a year in conversation with both communities.

Within the University we had useful feedback from the Open Access Project Board (OAPB) when we first discussed the option in July last year. We are also grateful to the members of our community who subsequently met with us in one on one meetings to discuss the merits of the Facility and the options for supporting it. At the November 2015 OAPB meeting, we presented a mature Business Case. We have also had to clear the Business Case through meetings of the Resource Management Committee (RMC).

Clearly we needed to ensure that our funders were prepared to support our proposal. Once we were in a position to share a Business Case with the funders we started a series of meetings and conversations with them.

The Wellcome Trust was immediate in its response – they would not allow direct charging to grants as they consider this to be an overhead cost, which they do not pay. We met with Cancer Research UK (CRUK) in January 2016 and there was a positive response about our transparent approach to costing and the comprehensiveness of services that the Facility provides to researchers at Cambridge. These issues are now being discussed with senior management at CRUK and discussions with CRUK are still ongoing at the time of writing this report (May 2016). [Update 24 May: CRUK agreed to consider research data management costs as direct costs on grant applications on a case by case basis, if justified appropriately in the context of the proposed research].

We encourage open dialogue with the RCUK funders about data management. In May 2015 we invited Ben Ryan to come to the University to talk about the EPSRC expectations on data management and how Cambridge meets these requirements. In August 2015 Michael Ball from the BBSRC came to talk to our community. We had an indication from the RCUK that our proposal was reasonable in principle. Once we were in a position to show our Business Case to the RCUK we invited Mark Thorley to discuss the issue and he has been in discussion with the individual councils for their input to give us a final answer.

Administrative issues

Timing in a decision like this is challenging because of the large number of systems within the institution that would be affected if a change were to occur. In anticipation of a positive response we started the process of ensuring our management and financial systems were prepared and able to manage the costing into grants – to ensure that if a green light were given we would be prepared.  To that end we have held many discussions with the Research Office on the practicalities of building the costing into our systems to make sure the charge is easy to add in our grant costing tool. We also had numerous discussions on how to embed these procedures in their workflows for validating whether the Facility services are needed and what to do if researchers forget to add them. The development has now been done.

A second consideration is the necessity to ensure all of the administrative staff involved in managing research grants (at Cambridge this is a  group of over 100 people) are aware of the change and how to manage both the change to the grant management system and also manage the questions from their research community. Simultaneously we were also involved in numerous discussions with our invaluable TRAC team at the Finance Division at the University who helped us validate all the Facility costs (to ensure that none of the costs are charged twice) and establishing costs centres and workflows for recovering money from grants.

Meanwhile we have had to keep our Facility staff on temporary contracts until we are in a position to advertise the roles. There is a huge opportunity cost in training people up in this area.

Conclusion

As it happened, the RCUK has come back to us to say that we can charge this cost to grants but as an overhead rather than direct cost. Having this decision means we can advertise the positions and secure our staffing situation. But we won’t be needing the administrative amendments to the system, nor the advocacy programme.

It has been a long process given we began preparing the Business Case in March 2015. The consultation throughout the University and the engagement of our community (both research and funder) has given us an opportunity to discuss the issues of research data management more widely. It is a shame – from our perspective – that we will not be able to be transparent about the costs of managing data effectively.

The funders and the University are all working towards a shared goal – we are wanting a culture change towards more open research, including the sharing of research data. To achieve this we need a more aware and engaged research community on these matters.  There is much advocacy to do.

Published 8 May 2016
Written by Dr Danny Kingsley and Dr Marta Teperek
Creative Commons License

Forget compliance. Consider the bigger RDM picture

The Office of Scholarly Communication sent Dr Marta Teperek, our Research Data Facility Manager to the  International Digital Curation Conference held in in Amsterdam on 22-25 February 2016. This is her report from the event.

Fantastic! This was my first IDCC meeting and already I can’t wait for next year. There was not only amazing content in high quality workshops and conference papers, but also a great opportunity to network with data professionals from across the globe. And it was so refreshing to set aside our UK problem of compliance with data sharing policies, to instead really focus on the bigger picture: why it is so important to manage and share research data and how to do it best.

Three useful workshops

The first day started really intensely – the plan was for one full day or two half-day workshops, but I managed to squeeze in three workshops in one day.

Context is key when it comes to data sharing

The morning workshop was entitled “A Context-driven Approach to Data Curation for Reuse” by Ixchel Faniel (OCLC), Elizabeth Yakel (University of Michigan), Kathleen Fear (University of Rochester) and Eric Kansa (Open Context). We were split into small groups and asked to decide what was the most important information about datasets from the re-user’s point of view. Would the re-user care about the objects themselves? Would s/he want to get hints about how to use the data?

We all had difficulties in arranging the necessary information in order of usefulness. Subsequently, we were asked to re-order the information according to the importance from the point of view of repository managers. And the take-home message was that for all of the groups the information about datasets required by the re-user was the not same as that required from the repository.

In addition, the presenters provided discipline-specific context based on interviews with researchers – depending on the research discipline, different information about datasets was considered the most important. For example, for zoologists, the information about specimen was very important, but it was of negligible importance to social scientists. So context is crucial for the collection of appropriate metadata information. Insufficient contextual information makes data not useful.

So what can institutional repositories do to address these issues? If research carried out within a given institution only covers certain disciplines, then institutional repositories could relatively easily contextualise metadata information being collected and presented for discovery. However, repositories hosting research from many different disciplines will find this much more difficult to address. For example, Cambridge repository has to host research spanning across particle physics, engineering, economics, archaeology, zoology, clinical medicine and many, many others. This makes it much more difficult (if not impossible) to contextualise the metadata.

It is not surprising that information most important from the repository’s point of view is different that the most important information required by the data re-users. In order to ensure that research data can be effectively shared and preserved in long term, repositories need to collect certain amount of administrative metadata: who deposited the data, what are the file formats, what are the data access conditions etc. However, repositories should collect as much administrative metadata as possible in an automated way. For example, if the user logs in to deposit data, all the relevant information about the user should be automatically harvested by feeds from human resources systems.

EUDAT – Pan-European infrastructure for research data

The next workshop was about EUDAT – the collaborative Pan-European infrastructure providing research data services, training and consultancy for researchers. EUDAT is an impressive project funded by Horizon2020 grant and it offers five different types of services to researchers:

  • B2DROP – a secure and trusted data exchange service to keep research data synchronized, up-to-date and easy to exchange with other researchers;
  • B2SHARE – service for storing and sharing small-scale research data from diverse contexts;
  • B2SAFE – service to safely store research data by replicating it and depositing at multiple trusted repositories (additional data backups);
  • B2STAGE – service to transfer datasets between EUDAT storage resources and high-performance computing (HPC) workspaces;
  • B2FIND – discovery service harvesting metadata from research data collections from EUDAT data centres and other repositories.

The project has a wide range of services on offer and is currently looking for institutions to pilot these services with. I personally think these are services which (if successfully implemented) would be of a great value to Pan-European research community.

However, I have two reservations about the project:

  • Researchers are being encouraged to use EUDAT’s platforms to collaborate on their research projects and to share their research data. However, the funding for the project runs out in 2018. EUDAT team is now investigating options to ensure the sustainability and future funding for the project, but what will happen to researchers’ data if the funding is not secured?
  • Perhaps if the funding is limited it would be more useful to focus the offering on the most useful services, which are not provided elsewhere. For example, another EC-funded project, Zenodo, already offers a user-friendly repository for research data; Open Science Framework offers a platform for collaboration and easy exchange of research data. Perhaps EUDAT could initially focus on developing services which are not provided elsewhere. For example, having a Pan-Europe service harvesting metadata from various data repositories and enabling data discovery is clearly much needed and would be extremely useful to have.

Jisc Shared RDM Services for UK institutions

I then attended the second half of Jisc workshop on shared Research Data Management services for UK institutions. The University of York and the University of Cambridge are two of 13 pilot institutions participating in the pilot. Jenny Mitcham from York and I gave presentations on our institutional perspectives on the pilot project: where we are at the moment and what are our key expectations from the pilot. Jenny gave an overview of an impressive work by her and her colleagues on addressing data preservation gaps at the University of York. Data preservation was one of the areas in which Cambridge hopes to get help from the Jisc RDM shared services project. Additionally, as we described before, Cambridge would greatly benefit from solutions for big data and for personal/sensitive data. My presentation from the session is available here.

Presentations were followed by breakout group discussions. Participants were asked to identify the areas of priorities for the Jisc RDM pilot. The top priority identified by all the groups seemed to be solutions for personal/sensitive data and for effective data access management. This was very interesting to me as at similar workshops held by Jisc in the UK, breakout groups prioritised interoperability with their existing institutional systems and cost-effectiveness. This could be one of the unforeseen effects of strict funders’ research data policies in the UK, which required institutions to provide local repositories to share research data.

As a result of these policies, many institutions were tasked with creating institutional data repositories from scratch in a very short time. Most of the UK universities now have institutional repositories which allow research data to be uploaded and shared. However, very few universities have their repositories well integrated with other institutional systems. Not having the policy pressure in non-UK countries perhaps allowed institutions to think more strategically about developing their RDM service provisions and ensure that developed services are well embedded within the existing institutional infrastructure.

Conference papers and posters

The two following days were full of excellent talks. My main problem was which sessions to attend: talking with other attendees I am aware that the papers presented at parallel sessions were also extremely useful. If the budget allows, I certainly think that it would be useful for more participants from each institution to attend the meeting to cover more parallel sessions.

Below are my main reflections from keynote talks.

Barend Mons – Open Science as a Social Machine

This was a truly inspirational talk, raising a lot of thought-provoking discussions. Barend started from a reflection that more and more brilliant brains, with more and more powerful computers and with billions of smartphones, created a single, interconnected social super-machine. This machine generates data – vast amount of data – which is difficult to comprehend and work with, unless proper tools are used.

Barend mentioned that with the current speed of new knowledge being generated and papers being published, it is simply impossible for human brains to assimilate the constantly expanding amount of new knowledge. Brilliant brains need powerful computers to process the growing amount of information. But in order for science to be accessible to computers, we need to move away from pdfs. Our research needs to be machine-readable. And perhaps if publishers do not want to support machine-readability, we need to move away from the current publishing model.

Barend also stressed that if data is to be useful and correctly interpretable, it needs to be accessible not only to machines, but also to humans, and that effort is needed to make data well described. Barend said that research data without proper metadata description is useless (if not harmful). And how to make research data meaningful? Barend proposed a very compelling solution: no more research grants should be awarded without 5% of money dedicated for data stewardship.

I could not agree more with everything that Barend said. I hope that research funders will also support Barend’s statement.

Andrew Sallans – nudging people to improve their RDM practice

Andrew started his talk from a reflection that in order to improve our researchers’ RDM practice we need to do better than talking about compliance and about making data open. How a researcher is supposed to make data accessible, if the data was not properly managed in the first place? The Open Science Framework has been created with three mission statements:

  • Technology to enable change;
  • Training to enact change;
  • Incentives to embrace change.

So what is the Open Science Framework (OSF)? It is an open source platform to support researchers during the entire research lifecycle: from the start of the project, through data creation, editing and sharing with collaborators and concluding with data publication. What I find the most compelling about the OSF is that is allows one to easily connect various storage platforms and places where researchers collaborate on their data in one place: researchers can easily plug their resources stored on Dropbox, Googledrive, GitHub and many others.

To incentivise behavioural change among researchers, the OSF team came up with two other initiatives:

Personally, I couldn’t agree more with Andrew that enabling good data management practice should be the starting point. We can’t expect researchers to share their research data if we have not helped them with providing tools and support for good data management. However, I am not so sure about the idea of cash rewards.

In the end researchers become researchers because they want to share the outcomes of their research with the community. This is the principle behind academic research – the only way of moving ideas forward is to exchange findings with colleagues. Do researchers need to be paid extra to do the right thing? I personally do not think so and I believe that whoever decides to pursue an academic career is prepared to share. And it is our task to make data management and sharing as easy as possible, and the use of OSF will certainly be of a great aid for the community.

Susan Halford – the challenge of big data and social research

The last keynote was from Susan Halford. Susan’s talk was again very inspirational and thought-provoking. She talked about the growing excitement around big data and how trendy it has become; almost being perceived as a solution to every problem. However, Susan also pointed out the problems with big data. Simply increasing the computational power and not fully comprehending the questions and the methodology used can lead to serious misinterpretations of results. Susan concluded that when doing big data research one has to be extremely careful about choosing proper methodology for data analysis, reflecting on both the type of data being collected, as well as (inter)disciplinary norms.

Again – I could not agree more. Asking the right question and choosing the right methodology are key to make the right conclusions. But are these problems new to big data research? I personally think that we are all quite familiar with these challenges. Questions about the right experimental design and the right methodology have been known to humankind since scientific method is used.

Researchers always needed to design studies carefully before commencing to do the experiments: what will be the methodology, what are the necessary controls, what should be the sample size, what needs to happen for the study to be conclusive? To me this is not a problem of big data, to me this is a problem that needs to be addressed by every researcher from the very start of the project, regardless of the amount of data the project generates or analyses.

Birds of a Feather discussions

I had not experienced Birds of a Feather Discussions (BoF) before at a conference and I am absolutely amazed by the idea. Before the conference started the attendees were invited to propose ideas for discussions keeping in mind that BoF sessions might have the following scope:

  • Bringing together a niche community of interest;
  • Exploring an idea for a project, a standard, a piece of software, a book, an event or anything similar.

I proposed a session about sharing of personal/sensitive data. Luckily, the topic was selected for a discussion and I co-chaired the discussion together with Fiona Nielsen from Repositive. We both thought that the discussion was great and our blog post from the session is available here.

And again, I was very sorry to be the only attendee from Cambridge at the conference. There were four parallel discussions and since I was chairing one of them, I was unable to take part in the others. I would have liked to be able to participate in discussions on ‘Data visualisation’ and ‘Metadata Schemas’ as well.

Workshops: Appraisal, Quality Assurance and Risk Assessment

The last day was again devoted to workshops. I attended an excellent workshop from the Pericles project on the appraisal, quality assurance and risk assessment in research data management. The project was about how an institutional repository should conduct data audits when accepting data deposits and also how to measure the risks of datasets becoming obsolete.

These are extremely difficult questions and due to their complexity, very difficult to address. Still, the project leaders realised the importance of addressing them systematically and ideally in an (semi)automated way by using specialised software to help repository managers making the right preservation decisions.

In a way I felt sorry for the presenters – their project progress and ambitions were so high that probably none of us, attendees, were able to critically contribute to the project – we were all deeply impressed by the high level of questions asked, but our own experience with data preservation and policy automation was nowhere at the level demonstrated by the workshop leaders.

My take home message from the workshop is that proper audit of ingested data is of crucial importance. Even if there is no automation of risk assessment possible, repository managers should at least collect information about files being deposited to be able to assess the likelihood of their obsolescence in the future. Or at least to be able to identify key file formats/software types as selected preservation targets to ensure that the key datasets do not become obsolete. For me the workshop was a real highlight of the conference.

Networking and the positive energy

Lots of useful workshops, plenty of thought-provoking talks. But for me one of the most important parts of the conference was meeting with great colleagues and having fascinating discussions about data management practices. I never thought I could spend an evening (night?) with people who would be willing to talk about research data without the slightest sights of boredom. And the most joyful and refreshing part of the conference was that due to the fact we were from across the globe, our discussions diverted away from the compliance aspect of data policies. Free from policy, we were able to address issues of how to best support research data management: how to best help researchers, what are our priority needs, what data managers should do first with our limited resources.

I am looking forward to catching up next year with all the colleagues I have met in Amsterdam and to see what progress we will have all made with our projects and what should be our collective next moves.

Summarising, I came back with lots of new ideas and full of energy and good attitude – ready to advocate for the bigger picture and the greater good. I came back exhausted, but I cannot imagine spending four days any more productively and fruitfully than at IDCC.

Thanks so much to the organisers and to all the participants!

Published 8 March 2016
Written by Dr Marta Teperek

Creative Commons License