Itamar Shatz has written a guest blog post for the Office of Scholarly Communication about how public trust in the scientific community increases when researchers make their data openly available to all. He also emphasizes that science communicators (e.g. press offices, journalists, publishers) have a responsibility to point attention directly at the primary source of the data. Itamar is a PhD candidate in the Department of Theoretical and Applied Linguistics at the University of Cambridge. He is also a member of the Cambridge Data Champion programme, having joined at the start of this year. He writes about science and philosophy that have practical applications at Effectiviology.com.
It’s no secret that the public’s view of the
scientific community is far from ideal.
For example, a global survey published by the Wellcome Trust in 2019 showed that, on average, only 18% of people indicate that they have a high level of trust in scientists. Furthermore, the survey showed that there are stark differences between people living in different areas of the world; for instance, this rate was more than twice as high in Northern Europe (33%) and Central Asia (32%) than in Eastern Europe (15%), South America (13%), and Central Africa (12%).
Things do appear to be improving, to some degree, especially in light of the recent pandemic. For example, a recent survey in the UK, conducted by the Open Knowledge Foundation, has found that, following the COVID-19 pandemic, 64% of people are now “more likely to listen expert advice from qualified scientists and researchers”. Similar increases in public confidence have been found in other countries, such as Germany and the USA. However, despite these recent increases, there is still much room for improvement.
Open data can help increase the public’s confidence in
scientists
The public’s lack of confidence in
scientists is a complex, multifaceted issue, that is unlikely to be resolved by
a single, neat solution. Nevertheless, one thing that can help alleviate this
issue to some degree is open data, which is the practice of making data
from scientific studies publicly accessible.
Research on the topic shows just how powerful this tool can be. For example, the recent survey by the Open Knowledge Foundation, conducted in the UK in response to the COVID-19 pandemic, found that 97% of those polled believed that it’s important for COVID-19 data to be openly available for people to check, and 67% believed that all COVID-19 related research and data should be openly available for anyone to use freely. Similarly, a 2019 US survey conducted before the pandemic found that 57% of Americans say that they trust the outcomes of scientific studies more if the data from the studies is openly available to the public.
Overall, such surveys strongly suggest that
open data can help increase the public’s trust in scientists. However, it’s not
enough for studies to just have open data for it to increase the
public’s trust; if people don’t know about the open data, or if don’t fully understand
what it means, then open data is unlikely to be as beneficial as it could be.
As such, in the following section we will see some guidelines on how to properly
incorporate open data into science communication, in order to utilize this tool
as effectively as possible.
How to incorporate open data into science communication
To properly incorporate open data into science
communication, there are several key things that people who engage in science
communication—such as journalists and scientists—should generally do:
Say that the study has open data. That
is, you should explicitly mention that the researchers have made the data from
their research openly available. Do not assume that people will go to
the original study and then learn there about the data being open.
Explain what open data is. That is, you should briefly explain what it means for the data to
be openly available, and potentially also mention the benefits of making the
data available, for example in terms of making research more transparent, and
in terms of helping other researchers reproduce the results.
Describe what sort of data
has been made openly available. For example, you
can include descriptions of the type of data involved (surveys, clinical
reports, brain scans, etc.), together with some concrete examples that help the
audience understand the data.
Explain where the data can
be found. For example, this can be in the article’s
“supplementary information” section, though data should preferably be available
in a repository where the dataset has its own persistent identifier, such as a
DOI. This ensures that the audience can find and access the data, which may
otherwise be hidden behind a paywall, and offers other benefits, such as
allowing researchers to directly access and cite the dataset, without navigating
through the article.
These practices can help people better
understand the concept of open data, particularly as it pertains to the study
in question, and can help increase their trust in the openness of the data,
especially if it is placed somewhere that they can access themselves.
For one example of how open data might be
communicated effectively in a press release, consider the following:
“The researchers have made all the data from this study openly available; this means that all the results from their experiments can be freely accessed by anyone through a repository available at: https://www.doi.org/10.xxxxx/xxxxxxx. This can help other scientists verify and reproduce their results, and will aid future research on the topic.”
Open data in different types of scientific communications
It’s important to note that there’s no
single right way to incorporate open data into scientific communications. This
can be attributed to various factors, such as:
Differences between fields
(e.g. biology, economics, or psychology)
Differences between types
of studies (e.g. computational or experimental)
Differences between media
(e.g. press release or social media post).
Nevertheless, the guidelines outlined
earlier can be beneficial as initial considerations to take into account when
deciding how to incorporate open data into science communication. It is up to
communicators to make the final modifications, in order to use open data as
effectively as possible in their particular situation.
Summarizing what we’ve learned
Though the public’s trust in science is currently growing, there is much room for improvement. One powerful tool that can aid the academic community is open data—the practice of making data from research studies openly available. However, to benefit as much as possible from the presence of open data, it’s not sufficient for a study to merely make its data open. Rather, the accessibility of the data needs to be promoted and explained in scientific communication, and the dataset needs to be cited appropriately (see the Joint Declaration of Data Citation Principles for guidelines regarding this latter point).
What is currently being done
It is important to note that much work is already being done to promote the concept of open data. For example, organizations such as the Research Data Alliance promote discussion of the topic and publish relevant material, as in the case of their recent guidelines and recommendations regarding COVID-19 data.
In addition, at the University of Cambridge, in particular, we can already see a substantial push for open data practices, where appropriate, and from many angles as outlined in the University’s Open Research position statement. Many funding bodies mandate that data be made available, and the University facilitates the process of sharing the data via Apollo, the institutional repository. Furthermore, there are the various training courses and publications—including this very blog—led by bodies such as the Office of Scholarly Communication (OSC), which help to promote Open Research practices at the University. Most notably, there is the OSC’s Data Champion programme, which deals, among other things, with supporting researchers with open data practices.
Moving forward
Promoting the use of open data in scientific
communication is something that different stakeholders can do in different
ways.
For example, those engaging in science
communication—such as journalists and universities’ communication offices—can
mention and explain open data when covering studies. Similarly, scientists can
ask relevant communicators to cite their open data, and can also mention this
information themselves when they engage in science communication directly. In
addition, consumers of scientific communication and other relevant stakeholders—such
as the general public, politicians, regulators, and funding bodies—can ask,
whenever they hear about new research findings, whether the data was made
openly available, and if not, then why.
Overall, such actions will lead to increased and more effective use of open data over time, which will help increase the trust people have in scientists. Furthermore, this will help promote the adoption of open data practices in the scientific community, by making more scientists aware of the concept, and by increasing their incentives for engaging in it.
The Cambridge Data Champions are an example of a community of volunteers engaged in promoting open research and good research data management (RDM). Currently entering its third year, the programme has attracted a total of 127 volunteers (86 current, 41 alumni) from diverse disciplinary backgrounds and positions. It continues to grow and has inspired similar initiatives at other universities within and outside the UK (Madsen, 2019). Dr Sacha Jones, Research Data Coordinator at the Office of Scholarly Communication, recently shared information about the programme at ‘FAIR Science: tricky problems and creative solutions’, an Open Science event held on 4th June 2019 at The Queen’s Medical Research Institute in Edinburgh, and organised by a previous Cambridge Data Champion – Dr Ralitsa Madsen. The aim of this event was to disseminate information about Open Science and promote the subsequent set-up of a network of Edinburgh Open Research Champions, with inspiration from the Cambridge Data Champion programme. Running a Data Champion programme, however, is not free of challenges. In this blog, Sacha highlights some of these alongside potential solutions in the hope that this information may be helpful to others. In this vein, Ralitsa adds her insights from ‘FAIR Science’ in Edinburgh and discusses how similar local events may spearhead the development of additional Open Science programmes/networks, thus broadening the local reach of this movement in the UK and beyond.
#FAIRscienceEDI
On 4 June 2019, the University of Edinburgh hosted ‘FAIR Science: tricky problems and creative solutions’ – a one-day event that brought together local life scientists and research support staff to discuss systemic flaws within current academic culture as well as potential solutions. Funded by the Institute for Academic Development and the UK Biochemical Society, the event was popular – with around 100 attendees – featuring both students, postdocs, principal investigators (PIs) and administrative staff. The programme featured talks by a range of local researchers – Dr Ralitsa Madsen (postdoctoral fellow and event organiser), Dr William Cawthorn (junior PI), Prof Robert Semple (Dean of Postgraduate Research and senior PI), Prof Malcolm Macleod (senior PI and member of the UK Reproducibility Network steering group), Prof Andrew Millar (senior PI and Chief Scientific Advisor on Environment, Natural Resources and Agriculture, for Scottish Government), Aki MacFarlene (Wellcome Trust Open Research Programme Officer), Dr Naomi Penfold (Associate Director, ASAPbio), Dr Nigel Goddard and Rory Macneil (RSpace developers) and Robin Rice (Research Data Service, University of Edinburgh), and Dr Sacha Jones (University of Cambridge). All slides have been made available via the Open Science Framework, and “live” tweets can be found via #FAIRScienceEDI.
Shifting the balance of research culture for the better. Image source: Presentation by Ralitsa Madsen, ‘Why FAIR Science and why now?’
Why is open science important? What is the extent of the reproducibility problem in science, and what are the responsibilities of individual stakeholders? Do all researchers need to engage with open research? Are the right metrics used when assessing researchers for appointment, promotion and funding? What are the barriers to widespread change, and can they be overcome through collective efforts? These were some of the ‘tricky’ problems that were addressed during the first half of the ‘Fair Science’ event, with the second half focussing on ‘creative solutions’, including: abandoning the journal impact factor in favour of alternative and fairer assessment criteria such as those proposed in DORA; preprinting of scientific articles and pre-registration of individual studies; new incentives introduced by funders like the Wellcome Trust who seek to promote Open Science; and data management tools such as electronic lab notebooks. Finally, the event sought to inspire local efforts in Edinburgh to establish a volunteer-driven network of Open Research Champions by providing insight into the maturing Data Champion programme at the University of Cambridge. This was a popular ‘creative solution’, with more than 20 attendees providing their contact details to receive additional information about Open Science and the set-up of a local network.
Overall, community engagement was a recurring theme during the ‘FAIR Science’ event, recognised as a catalyst required for research culture to change direction toward open practices and better science. Robert Semple discussed this in the greatest detail, suggesting that early stage researchers – PhDs and post-docs – are the building blocks of such a community, supported also by senior academics who have a responsibility to use their positions (e.g. as group leaders, editors) to promote open science. “Open Science is a responsibility also of individual groups and scientists, and grass roots efforts will be key to culture shift” (Robert Semple’s presentation). On a larger scale, Aki MacFarlene aptly stated that a supportive research ecosystem is needed to support open research; for example, where institutions as well as funders recognise and reward open practices.
Insights from the Cambridge Data Champion programme
The Data Champions at the University of Cambridge are an example of a community and a source of support for others in the research ecosystem. Promoting good RDM and the FAIR principles are two fundamental goals that Data Champions commit to when they join the programme. For some, endorsing open research practices is a fortuitous by-product of being part of the programme, yet for others, this is a key motivation for joining.
This word cloud depicts the reasons why the Cambridge Data Champions applied to become a Data Champion (the larger the text size, the more common the response). It is based on data from 105 applicants responding to the following: “What is your main motivation for becoming a Data Champion?”
Now that the Data Champion programme has been running for three years, what challenges does it face, and might disclosing these here – alongside ongoing efforts to solve them – help others to establish and maintain similar initiatives elsewhere?
Four main challenges are outlined that the programme either has or continues to experience. These are discussed in increasing scale of difficulty to overcome.
Support
Retention
Disciplinary coverage
Measuring effectiveness
(See also a recent article about the Data Champion programme by James Savage and Lauren Cadwallader.)
What challenges does the Cambridge Data Champion programme face and how may these be overcome? (image: CC0)
Support
At a basic level, an initiative like the Data Champion programme needs both financial and institutional support. The Data Champions commit their time on a voluntary basis, yet the management of the programme, its regular events and occasional ad hoc projects all require funds. Currently, the programme is secure, but we continue to seek funding opportunities to support a community that is both expanding and deserving of reward (e.g. small grants awarded to Data Champions to support their ‘championing’ activities). Institutional support is already in place and hopefully this will continue to consolidate and grow now that the University has publicly committed to supporting open research.
Retention
Not all Data Champions who join will remain Data Champions. In fact, there is a growing community of alumni Data Champions. There are currently 41 alumni Data Champions. From the feedback provided by just over half of these, 68% left the programme because they left the University of Cambridge (as expected given that the majority of Data Champions are either post-docs or PhD students), and 32% left because of a lack of time to commit to the role. Of course, there might be other reasons that we are not aware of, and we cannot speculate here in the absence of data. Feedback from Data Champions is actively sought and is an essential part of sustaining and developing this type of community.
We are exploring various methods to enhance retention. To combat the pressures of individuals’ workloads, we are being transparent about the time that certain activities will involve – a task or process may be less overwhelming when a time estimate is provided (cf ‘this survey should take approximately ten minutes to complete’). We also initiated peer-mentoring amongst Data Champions this year, in part to encourage a stronger community. We are attempting to enhance networking within the community in other ways, during group discussion sessions in the bimonthly forums, and via a virtual space where Data Champions can view each other’s data-related specialisms – with mutual support and collaboration as intended by-products. These are just a few examples, and given that Data Champions are volunteers, retention is one of several aspects of the programme that requires frequent assessment.
Disciplinary coverage
Cambridge has six Schools – Arts and Humanities, Humanities and Social Sciences, Biological Sciences, Physical Sciences, Clinical Medicine, and Technology – with faculties, departments, centres, units, institutes nested within these. The ideal situation would be for each research community (e.g. a department) to be supported by at least one Data Champion. Currently this is not the case, and the distribution of Data Champions across the different disciplinary areas is patchy. Biological Sciences is relatively well-represented by Data Champions (there are 22 Data Champions to represent around 1742 researchers in the School, i.e. 1.3%) (see bar chart below). There is a clear bias towards STEM (science, technology, engineering and maths) disciplines, yet representation in the social sciences is fair. At the more extreme end is an absence of Data Champions in the Arts and Humanities. We are looking to resolve this via a more targeted approach, guided in part by insights gained into researcher needs via the OSC’s training programme for arts, humanities and social sciences researchers.
The bars depict the number of Data Champions within each School. Percentage values give the number of Data Champions as a proportion of the total number of researchers within each School. For example, within the School of Clinical Medicine, the ratio of Data Champions to researchers is around 1:100 (researchers include contract and established researchers, and PhD students).
Measuring effectiveness
Determining how well the Data Champion programme is working is a sizeable challenge, as discussed previously. In those research communities represented by Data Champions, do we see improvements in data management, do we see a greater awareness of the FAIR principles, is there a change in research culture toward open research? These aspects are extremely difficult to measure and to assign to cause and effect, with multiple confounding factors to consider. We are working on how best to do this without overloading Data Champions and researchers with too many administrative tasks (e.g. surveys, questionnaires, etc.). Yet, the crux is for there to exist good communication and exchange of information between us (as a unit that is centrally managing the Data Champion programme) and the Data Champions, and between the Data Champions and the researchers who they are reaching out to and working with. We need to be the recipients of this information so that we can characterise the programme’s effectiveness and make improvements. As a start, the bimonthly Data Champion forums are used as an ideal venue to exchange and sound out ideas about best approaches, so that decisions on how to measure the programme’s impact lie also with the Data Champions.
A fifth challenge – recognition and reward
At the ‘FAIR Science’ event, two speakers (Naomi Penfold and Robert Semple) made a plea for those researchers who practise open science to be recognised for this – a change in reward culture is required. In a presentation centred on the misuse of metrics, Will Cawthorn referred to poor mental health in researchers as a result of the pressures of intrinsic but flawed methods of assessment. Understandably, DORA was mentioned multiple times at ‘FAIR Science’, and hopefully, with multiple universities including the University of Cambridge and University of Edinburgh as recent signatories of DORA, this marks the first steps toward a healthier and fairer researcher ecosystem. This may seem rather tangential to the Data Champions, but it is not: 66% of Data Champions, current and alumni, are or have been researchers (e.g. PhDs, post-docs, PIs). Despite the pressures of ‘publish or perish’, they have given precious time voluntarily to be a Data Champion and require recognition for this.
This raises a fifth challenge faced by the programme – how best to reward Data Champions for their contributions? Effectively addressing this may also help, via incentivisation, toward meeting three of the four challenges above – retention, coverage and measurement. While there is no official reward structure in place (see Higman et al. 2017), the benefits of being part of the programme are emphasised (networking opportunities, skills development, online presence as an expert, etc.), and we write to Heads of Departments so that Data Champions are recognised officially for their contributions. Is this enough? Perhaps not. We will address this issue via discussions at the September forum – how would those who are PhD students, post-docs, PIs, librarians, IT managers, data professionals (to name a few of the roles of Data Champions) like to be rewarded? In sharing these thoughts, we can then see what can be done.
Towards growing communities of volunteers
The Cambridge Data Champion programme is one among several UK- and Europe-wide initiatives that seek to promote good RDM and, more generally, Open Science. Their emergence speaks to a wider community interest and engagement in identifying solutions to some of the key issues haunting today’s academic culture (Madsen 2019). While the foundations of a network of Edinburgh Open Research Champions are still being laid, TU Delft in the Netherlands has already got their Data Champion programme up and running with inspiration from Cambridge. Independently, several Universities in the UK have also established their own Open Research groups, many of which are joined together through the recently established UK Reproducibility Network (UKRN) and the associated UK Network of Open Research Working Groups (UK-ORWG). Such integration fosters network crosstalk and is a step in the right direction, giving volunteers a stronger sense of ‘belonging’ while also actively working towards their formal recognition. Network crosstalk allows for beneficial resource sharing through centralised platforms such as the Open Science Framework or through direct knowledge exchange among neighbouring institutions. Following ‘FAIR Science’ in Edinburgh, for example, a meeting to discuss its outcome(s) involved members from Glasgow University’s Library Services (Valerie McCutcheon, Research Information Manager) and the UKRN’s local lead at Aberdeen University (Dr Jessica Butler, Research Fellow, Institute of Applied Health Science). Thus, similar to plans in Aberdeen, the ‘FAIR Science’ organisers are currently working with Edinburgh University’s Research Data Support team to adapt an Open Science survey developed and used at Cardiff University to guide the development of a specific Open Science strategy. This reflects the critical requirements for such strategies to be successful – active peer-to-peer engagement and community involvement to ensure that any initiatives match the needs of those who ought to benefit from them.
The long-term success of Open Science strategies – and any associated networks – will also hinge upon incorporation of formal recognition, as alluded to in the context of the Cambridge Data Champion programme. The importance of formal recognition of Open Science volunteers is also exemplified in SPARC Europe’s recent initiative – Europe’s Open Data Champions – which aims to showcase Open Data leaders who help ‘to change the hearts and minds of their peers towards more Openness’.
For formal recognition to gain traction, it will be critical to work towards recruitment of several prominent senior academics on board the Open Science wagon. By virtue of their academic status, such individuals will be able to put Open Science credentials high on the agenda of funding and academic institutions. Indeed, the establishment of the UKRN can be ascribed to a handful of senior researchers who have been able to secure financial support for this initiative, in addition to inspiring and nucleating local engagement across several UK universities. The ‘FAIR Science’ experience in Edinburgh supports this view. While difficult to prove, its impact would likely have been minimal without the involvement of prominent senior academics, including Professor Robert Semple (Dean of Postgraduate Research), Professor Malcolm Macleod (UKRN steering group member) and Professor Andrew Millar (Chief Scientific Advisor on Environment, Natural Resources and Agriculture, for Scottish Government). Thus, in addition to targeted and continuous communication by the ‘FAIR Science’ organisers before and after the event, ongoing efforts to establish a network of Edinburgh Open Research Champions has been dependent on these senior academics and their ability to mobilise essential forces throughout the University of Edinburgh.
Amongst several other factors, community engagement is central to making improvements toward reproducibility, Open Science and Open Research in general. There are multiple stakeholders involved with their own responsibilities, and senior academics are a notable part of this. Image source: Robert Semple’s presentation at #FAIRscienceEdi, ‘The “Reproducibility Crisis”: lessons learnt on the job’.
Top-down or bottom-up?
Establishing and maintaining a champions initiative need not be conceived of as succeeding via either a top-down or bottom-up approach. Instead, a combination of the best of both of these approaches is optimal, as hopefully comes across here. The emphasis on such initiatives being community driven is essential, yet structure is also required so as to ensure their maintenance and longevity. Hierarchies have little place in such communities – there are enough of these already in the ‘researcher ecosystem’ – and the beauty of such initiatives is that they bring together people from various contexts (e.g. in terms of role, discipline, institution). In this sense, the Cambridge Data Champions community is especially robust because of its diversity, being comprised of individuals who derive from highly varied roles and disciplinary backgrounds. Every champion brings their own individual strengths; collectively, this is a powerful resource in terms of knowledge and skills. Through acting on these strengths and acknowledging their responsibilities (e.g. to influence, teach, engage others), and by being part of a community like those described here, champions have the opportunity to make perhaps a wider contribution to research than ever anticipated, and certainly one that enhances its overall integrity.
References
Higman, R., Teperek, M. & Kingsley, D. (2017). Creating a community of Data Champions. International Journal of Digital Curation 12 (2): 96–106. DOI: https://doi.org/10.2218/ijdc.v12i2.562
Savage, J. & Cadwallader, L. (2019). Establishing, Developing, and Sustaining a Community of Data Champions. Data Science Journal 18 (23): 1–8. DOI: https://doi.org/10.5334/dsj-2019-023
The Cambridge Data Champions (DCs) advocate good Research Data Management (RDM) and Open Data practices to researchers locally in their departments, within Cambridge University in general, and sometimes further afield. They network with one another, exchange good methods of RDM, share ideas and, as a collective, reflect on current issues surrounding RDM, Open Data and researcher engagement, where a major shared goal is to establish best practices when it comes to research data. By attending bi-monthly forums facilitated by the Research Data Team, the DCs convene as a community, hear speakers presenting on relevant topics, and engage in workshops that will help them in their ‘championing’ activities. Following up from our latest blog which summarised how a workshop led to the creation of cartoon postcards as a new tool to add to the DCs’ resource kit for RDM advocacy, we are now reflecting on initiatives that sprung from workshops during the past year and are considering the challenges and opportunities that this programme brings as it approaches the end of its third year.
Growing
The programme started in Autumn 2016, comprising researchers who volunteered to become local community experts and advocate on research data management and sharing. Our first call welcomed 43 DCs (September 2016), our second call 20 DCs (March 2018) and the third call 40 DCs (January 2019). For simplicity, this year we also added to our statistics the “affiliate” DCs, who are colleagues who contribute to the DC community in other ways (as interested members of Cambridge’s RDM Project Group) and not necessarily through channelling their RDM efforts for the benefit of a specific department.
We are now a community comprised of 87 active DCs.
Total number of Data Champions who joined in each year (orange column indicates Champions who are still active; blue column indicates Champions who are now alumni).
Communities within a community
Over the last year we caught ourselves using words such as the ‘old DCs’ and the ‘new DCs’ and what we really meant was ‘established DCs’ and ‘new DCs’, with the latter group being those joining the programme each year. In September we celebrate the programme’s third birthday and it is reasonable to expect that there will be more experienced DCs who have already built their networks and have, more or less, a stable offering of RDM support and an enhanced understanding of the needs of their department. On the other hand, there are those who are being welcomed into the group who seek, to differing degrees, initial support from both the RDM team and their fellow colleagues in order to become successful DCs. It is easy to imagine that different layers are being developed with different needs, both in terms of support and engagement.
Through various activities and feedback from DCs, we now have a good quantity of raw data to analyse their needs for being, as we called it, ‘a good Data Champion’. We have brainstormed ideas which we are putting into action to respond to the challenges of an ever-growing Data Champions group.
Planning
DC Welcome Pack
Every year we circulate the Data Champions Welcome Pack to coincide with the inductions we organise to welcome new DCs into the group. This year we included in the pack what it is expected from a DC when s/he joins the programme so that expectations are clearly communicated from the beginning and are the same for everybody.
Page from the Cambridge Data Champions Welcome Pack
Bi-monthly forums
Lightning talks have been introduced as a standard item in each forum. These have provided DCs with the opportunity to discuss aspects of RDM they are working on (e.g. new tools and techniques), or to feed back to the group on DC activities undertaken in their departments and data-related events they have attended so that the whole group can benefit. Importantly, the lightning talks have been used by DCs to problem solve, where the collective knowledge and experience of DCs attending a forum has been harnessed to address particular challenges faced by individual DCs. This is where the community aspect of the programme truly shines.
It is always a priority for us to invite speakers to forums who are external to the programme, reflecting the needs of both the new and established DCs. For example, Hannah Clements from Cambridge University’s Researcher Development Programme (RDP) spoke to the DCs at the January forum about mentoring, providing guidance on how support can be best delivered within the DC community. In the May forum, we had talks and discussions from a panel of experts working on different aspects of data archiving. The panellists came from across the University bringing a diversity of experience, grounded in clinical governance, computing, and more traditional archiving. These examples are just a couple of the themes that we have covered so far in the forums, which have been derived predominantly from information provided by (and the needs of) the DCs themselves. Additional topics that we plan to cover in future forums include issues surrounding reproducibility, IP and commercialisation, publishing and the impact of research data.
Key aims of these forums are to not only facilitate networking between DCs but to also act as an arena for the transfer of knowledge along the ‘researcher pipeline’, from forum to DCs and from DCs to researchers in their departments.
DCspecialisation group
As a community, we need to be able to map expertise internally and understand the make-up of such an organic group at any given moment. This makes it is easier to support each other and create collaborations, but also improves how we promote the programme externally.
Areas of expertise amongst our Data Champions
This led to the formation of the DC specialisation group, consisting of one of us and six of the DCs, which determined how to categorise expertise within the group. As a result, a spreadsheet was created where all DCs can chart their specialist areas and update or amend when necessary (and at least annually). We have top level categories for simple statistical analysis and second level categories that offer more specific details for the benefit of the DC community.
The next stage is to include the wider research community and improve how various stakeholders can reach the appropriate Data Champions for initial advice and support in RDM issues. One way to do this is by presenting more coherent and consistent specialisations on the Data Champions’ website, using the categories which we have already created for internal use within the group. This stage is due to begin this month and we hope to report on our efforts next year.
Branding group
A growing community is inevitably going to bring to the forefront various identity discussions. With this in mind, we formed a branding group to examine if a DC logo should be created to enhance the Data Champions’ visibility and raise their profile amongst their peers when advocating for RDM. A logo has been created and is going through various stages of approval before it will be released later this year.
Pilot programme – Mentoring
In February 2019, we initiated a pilot mentoring project as part of the induction process for the new DCs. The mentors are established DCs who have volunteered to support those new DCs wishing to take part in this pilot exercise. This followed on from our January forum where the benefits of mentoring for both mentees and mentors were outlined by Hannah Clements of RDP. At this forum, which preceded the University-wide call for new DCs, we also held a workshop where DCs were divided into three groups and asked three questions: what do you wish you knew when you first became a DC that you know now; what could you offer as mentors to the new DCs; how do you think the mentor-mentee system could work? The responses from DCs in the three groups informed the implementation, structure and aims of the mentoring pilot.
Our aim is to learn from this project in close consultation with both mentors and mentees. We want to see if this process helps new DCs to establish themselves within their departments/institutes. Will it be effective? The findings will inform our steps for the following year. Watch this space!
Fostering clusters within departments
We have excellent examples of departments that promote their DCs within their institutions. A good example is the Chemistry department, which has a cluster of five DCs who work together in their advocacy. During this year’s call for new DCs, and with help from the Department Librarian, we used a targeted approach at advertising the DC Programme within the Department of Engineering. This was highly successful, resulting in ten new Data Champions from Engineering from various roles and Academic Divisions. They represent a hub with the local knowledge, experience and skills to assess their department’s needs and explore best approaches to support good RDM practices and Open Research, ones that are tailored to the discipline.
Alumni community
Heading toward the programme’s third birthday means that we are growing bigger but also that we are developing an alumni community as well. This is a different kettle of fish but it is on our radar to investigate how we can foster this distinct group and build a network that is not only Cambridge based but has a more national and even international outlook.
Funding
Let’s not forget that the DC programme consists of volunteers. We are in the process of seeking more funds to support this ever increasing community, to run expanding bimonthly forums, and to be able to offer grants to assist DCs in their endeavours. As an example, we supported one of the DCs, James Savage, to bring the programme to the international stage in November at the SCIDataCon 2018 in Botswana. He talked about the programme as well as his experience of being a DC. This resulted in James writing a paper together with Lauren Cadwallader, to be published soon in Data Science Journal (the accepted manuscript and associated data available now in Apollo, the Cambridge University institutional repository).
An exciting year so far!
During this third year of the DC programme the number of active DCs across the University of Cambridge has doubled. We can only anticipate it growing further each year, yet balanced by an expanding community of alumni DCs as, for example, DCs leave Cambridge. The DC community is inherently dynamic, as is the programme. Because of this, we always seek to respond and adapt to changing conditions in novel and beneficial ways while maintaining the programme’s core structure to provide strong foundations. This has been a period of reflection, organisation and anticipation, all required to drive the Data Champion programme forward and tackle current challenges effectively, as well as those that lie ahead – more on this to come soon!
We need to recognise good practice, engage researchers early in their career with research data management and use peers to talk to those who are not ‘onboard’. These were the messages five attendees at the Engaging Researchers in Good Data Management conference held on the 15th of November.
The Data Champions and Research Support Ambassadors programmes are designed to increase confidence in providing support to researchers in issues around data management and all of scholarly communications respectively. Thanks to the generous support of the Arcadia Foundation, five places were made available to attend this event. In this blog post the three Data Champions and two Research Support Ambassadors who were awarded the places give us the low-down on what they got out of the conference and how they might put what they heard into practise.
Recordings of the talks from the event can be found on the Cambridge University Library YouTube channel.
Financial recognition is the key
Dr Laurent Gatto, Senior Research Associate, Department of Biochemistry, University of Cambridge and Data Champion
As a researcher who cherishes good and reproducible data analysis, I naturally view good data management as essential. I have been involved in research data management activities for a long time, acting as a local data champion and participating in open research and open data events. I was interested in participating in this conference because it gathered data champions, stewards and alike from various British and European institutions (Cambridge, Lancaster, Delft), and I was curious to see what approaches were implemented and issues were addressed across institutions. Another aspect of data championship/stewardship I am interested in is the recognition these efforts offer (this post touches on this a bit).
Focusing on the presentations from Lancaster, Cambridge and Delft, it is clear that direct engagement from active researchers is essential to promote healthy data management. There needs to be an enthusiastic researcher, or somebody that has some experience in research, to engage with the research community about open data, reproducibility, transparency, security; a blunt top-down approach lead to limited engagement. This is also important due to the plurality of what researchers across disciplines consider to be data. An informal setting, ideally driven by researchers and, or in collaboration with librarians, focusing on conversations, use-cases, interviews, … (I am just quoting some successful activities cited during the conference) have been the most successful, and have sometime also lead to new collaborations.
Despite the apparent relative success of these various data championing efforts and the support that the data champions get from their local libraries, these activities remain voluntary and come with little academic reward. Being a data champion is certainly an enriching activity for young researchers that value data, but is comes with relatively little credit and without any reward or recognition, suggesting that there is probably room for a professional approach to data stewardship.
With this in mind, I was very interested to hear the approach that is currently in place at TU Delft, where data stewards hold a joint position at the Centre for Research Data and at their respective faculty. This defines research data stewardship as an established and official activity, allows the stewards to pursue a research activity, and, explicitly, links research data to research and researchers.
I am wondering if this would be implemented more broadly to provide financial recognition to data stewards/champions, offer incentives (in particular for early-career researchers) to approach research data management professionally and seriously, make data management a more explicit activity that is part of research itself, and move towards a professionalisation of data management posts.
Inspiration and ideas
Angela Talbot, Research Governance Officer, MRC Biostatistics Unit and Data Champion
Tasked with improving and updating best practice in the MRC Biostatistics Unit, I went along to this workshop not really knowing what to expect but hopeful and eager to learn.
Good data management can meet with resistance as while it’s viewed as an altruistic and noble thing to do many researchers worry that to make their research open and reproducible opens them to criticism and the theft of ideas and future plans. What I wanted to know are ways to overcome this.
And boy did this workshop live up to my expectations! From the insightful opening comments to the though provoking closing remarks I was hooked. All of the audience were engaged in a common purpose, to share their successes and strategies for overcoming the barriers that ensure this becomes best practice.
Further excellent examples were provided by the lightning talks and for me, it was certainly helpful to hear of successes in engaging researchers on a departmental level.
The highlight for me were the focus groups – I was involved in Laurent Gatto’s group discussing how to encourage more good data management by highlighting what was in to for researchers who participate but I really wish I could have been in them all as the feedback indicated they had given useful insights and tips.
All in all I came away from the day buzzing with ideas. I spent the next morning jotting down ideas of events and schemes that could work within my own unique department and eager to share what I had learnt. Who knows, maybe next time I’ll be up there sharing my successes!!
We need to speak to the non-converted
Dr Stephen Eglen, Reader in Computational Neuroscience, Department of Applied Mathematics & Theoretical Physics, University of Cambridge and Data Champion
The one-day meeting on Engaging Researchers in Good Data Management served as a good chance to remind all of us about the benefits, but also the responsibilities we have to manage, and share, data. On the positive side, I was impressed to see the diversity of approaches lead by groups around the UK and beyond. It is heartening to see many universities now with teams to help manage and share data.
However, and more critically, I am concerned that meetings like this tend to focus on showcasing good examples to an audience that is already mostly convinced of the benefits of sharing. Although it is important to build the community and make new contacts with like-minded souls, I think we need to spend as much time engaging with the wider academic community. In particular, it is only when our efforts can be aligned with those of funding agencies and scholarly publishing that we can start to build a system that will give due credit to those who do a good job of managing, and then sharing, their data. I look forward to future meetings where we can have a broader engagement of data managers, researchers, funders and publishers.
I am grateful to the organisers to have given me the opportunity to speak about our code review pilot in Neuroscience. I particularly enjoyed the questions. Perhaps the most intriguing question to report came in the break when Dr Petra ten Hoopen asked me what happens if during code review a mistake is found that invalidates the findings in the paper? To which I answered (a) the code review is supposed to verify that the code can regenerate a particular finding; (b) that this is an interesting question and it would probably depend on the severity of the problem unearthed; (c) we will cross that bridge when we come to it. Dr ten Hoopen noted that this was similar to finding errors in data that were being published alongside papers. These are indeed difficult questions, but I hope in the relatively early days of data and code sharing, we err on the side of rewarding researchers who share.
Teach RDM early and often
Kirsten Elliott, Library Assistant, Sidney Sussex College, University of Cambridge and Research Support Ambassador
Prior to this conference, my experience with Research Data Management (RDM) was limited to some training through the Office of Scholarly Communication and Research Support Ambassadors programme. This however really sparked my interest and so I leapt at the opportunity to learn more about RDM by attending this event. Although at times I felt slightly out of my depth, it was fascinating to be surrounded by such experts on the topic.
The introductory remarks from Nicole Janz were a fascinating overview of the reproducibility crisis, and how this relates to RDM, including strategies for what could be done, for example setting reproducing studies as assignments when teaching statistics. This clarified for me the relationship between RDM and open data, and transparency in research.
There were many examples throughout the day of best practice in promoting good RDM, from the “Data Conversations” held at Lancaster University, international efforts from SPARC Europe and even some from Cambridge itself! Common ground across all of them included the necessity of utilising engaged researchers themselves to spread messages to other researchers, the importance of understanding discipline specific issues with data, and an expansive conception of what counts as “data”.
I am based in a college library and predominantly work supporting undergraduate students, particularly first years. In a way this makes it quite a challenge to present RDM practices as many of the issues are most obviously relevant to those undertaking research. However, I think there’s a strong argument for teaching about RDM from very early in the academic career to ingrain good habits, and I will be thinking about how to incorporate RDM into our information literacy training, and signposting students to existing RDM projects in Cambridge.
Use peers to spread the RDM message
Laura Jeffrey, Information Skills Librarian, Wolfson College, University of Cambridge and Research Support Ambassador
This inspirational conference was organised and presented by people who are passionate about communicating the value of open data and replicability in research processes. It was valuable to hear from a number of speakers (including Rosie Higman from the University of Manchester, Marta Busse-Wicher from the University of Cambridge and Marta Teperek from TU Delft) about the changing role of support staff, away from delivering training to one of coordination. Peers are seen to be far more effective in encouraging deeper engagement, communicating personal rather than prescriptive messages (evidenced by Data Conversations at Lancaster University). A member of the audience commented that where attendance is low for their courses, the institution creates video of researcher-led activities to be delivered at point of need.
I was struck by two key areas of activity that I could act on with immediate effect:
Inclusivity – Beth Montagu Hellen (Bishop Grosseteste) highlighted the pressing need for open data to be made relevant to all disciplines. Cambridge promotes a deliberately broad definition of data for this reason. Yet more could be done to facilitate this; I’ll be following @OpenHumSocSci to monitor developments. We’re fortunate to have a Data Science Group at Wolfson promoting examples of best practice. However, I’m keen to meet with them to discuss how their activities and the language they use could be made more attractive to all disciplines.
Communication – Significant evidence was presented by Nicole Janz, Stephen Eglen and others, that persuading researchers of the benefits of open data leads to higher levels of engagement than compulsion on the grounds of funder requirements. This will have a direct impact on the tone and content of our support. A complimentary approach was proposed: targeted campaigns to coincide with international events in conjunction with frequent, small-scale messages. We’ll be tapping into Love Data Week in 2018 with more regular exposure in email communication and @WolfsonLibrary.
As result of attending this conference, I’ll be blogging about open data on the Wolfson Information Skills blog and providing pointers to resources on our college LibGuide. I’ll also be working closely with colleagues across the college to timetable face-to-face training sessions.
Published 15 December 2017 Written by Dr Laurent Gatto, Angela Talbot, Dr Stephen Eglen, Kirsten Elliott and Laura Jeffrey
During International Data Week 2016, the Office of Scholarly Communication is celebrating with a series of blog posts about data. The first post was a summary of an event we held in July. This post looks at the challenges associated with financially supporting RDM training.
The problem
There is a desperate need for training in research data management. Our significant engagement with researchers at the University of Cambridge over the past 18 months has indicated to us that research data cannot be effectively shared if it has not been properly managed during the research lifecycle. Researchers cannot be expected to share their data at the end of their research project if they are unable to locate their data, if the data is not correctly labelled or if it lacks metadata to make the data re-usable. We have stated that on several occasions, including our response to the draft UK Concordat on Open Research Data which was released in its final form on the 28th July this year.
To test whether our beliefs were in-line with researchers’ needs, last year we conducted a short survey on research data management needs among our academic community. Of those responding, 94% of researchers indicated that it would be ‘useful’ or ‘very useful’ to have workshops on research data management (see our earlier blog post for the full discussion of the survey results). In response to this we developed a 1.5 hour introductory workshop to research data management and started delivering this to researchers in Cambridge in July 2015.
Train early, train often
Our workshops have evolved substantially since the initial sessions, through responses to the feedback we collect. The workshops are now considerably improved, not only more interactive, but also have been extended to 2.5 hours to allow time to provide more practical examples of good data management.
The feedback from our workshops was overwhelmingly positive. As a result, many departments identified research data management skills as core competencies needed by every PhD student and asked us to deliver our workshops as part of their compulsory training for PhD students. This is fantastic news for both our team and the awareness of RDM at Cambridge, and we have therefore accepted all these individual requests.
But sometimes you can be a little too successful. We recently received another request from the Graduate School of Life Sciences to make our workshop compulsory for all their PhD students. While this is great news, there are 400 new PhD students every year at the Graduate School of Life Sciences. The maximum capacity at our workshop is 20 attendees, which means we would need to deliver 20 workshops throughout the year to cover this cohort. And when we consider the Graduate School of Life Sciences is one of five schools accepting PhD students in Cambridge, we need to look at how we would respond if other schools approached us with similar requests.
Another issue is that while our improved workshop on research data management covered the basic research data management needs, researchers have told us they need more in-depth discipline-specific training. We recognise this, and do try to provide training that is directed at the audience – a challenge when we need to serve all researchers across the University – from arts and humanities through social and life sciences, to medical research and particle physics (and many, many other disciplines).
So we have identified a broad need for RDM training, a specific need for training for PhD students as they begin their research, as well as discipline-specific support for all researchers. But there are only two staff members in the Research Data Team who can deliver RDM training, and delivering training is only one of many tasks which we need to undertake for the Research Data Facility to function. We simply do not have the capacity to meet this obvious need. After discussion we have agreed to deliver four workshops out of the requested 20 for the Graduate School of Life Sciences and to reconsider the situation in the summer break next year.
The plan – Data Champions
We have begun to think both about how we can meet RDM training needs given current staff capacity and how we can link the experts in different aspects of data management that we know exist around the University of Cambridge. There is already an active OpenCon Cambridge group which promotes the benefits of Open Access, Open Data and Open Education; but we wanted to focus on all aspects of RDM.
We have started to develop the idea of having Data Champions in each department, institute or college who can act as the local experts as well as delivering some discipline-specific training to their community. This has the advantage of increasing the number of trainers across the University, making the workshops tailored to the relevant audience, and building a community of experts.
The University of Cambridge is not alone in facing this problem and the several other universities are pursing, or already have, Data Champions in some form. The idea was recently discussed on both Jisc RDM mailing lists and at the Research Data Network meeting which was held at Corpus Christi College, Cambridge last week. At this meeting several different models were proposed, including a national network of champions who could advocate within disciplines at a senior level.
Whilst a national network of champions would be great in the long term we still have an immediate problem within the University of Cambridge and so we have launched a Call for Data Champions to help us raise awareness and increase the amount of training available. The call is open until 17th October and we would welcome any research or support staff or research student with an interest in RDM. There will be support available in learning how to deliver RDM training, a template workshop provided, the opportunity to influence the future of RDM services and a website built to showcase our Data Champions.
We hope to bring all the Data Champions together towards the end of the Michaelmas term so they can start delivering workshops in 2017. We hope to foster a community of experts who can share their knowledge with each other and the Research Data Team so we are in a better position to support our researchers.
The future
The community of Data Champions we hope to bring together at Cambridge will begin to ameliorate some of the problems we are facing with regards to RDM training. However, we will still struggle with staff capacity and at some point researchers will need to be appropriately rewarded for sharing data and supporting others to do the same, a theme of our recent Open Research discussion event.
So we have a temporary solution, and one which we hope will significantly improve the RDM training available at Cambridge, but the issue will not be solved until the underlying incentives for sharing data, publications and all the other research outputs have been addressed.
Published 14 September 2016 Written by Rosie Higman and Dr Marta Teperek
On Friday 22 January Cambridge University invited our two main charity funders to discuss their views on data management and sharing with Cambridge researchers. David Carr from the Wellcome Trust and Jamie Enoch from Cancer Research UK came to the University to talk to our researchers.
It is not recommended that researchers simply share a link and release the data when requested. Research data should be available, accessible and discoverable.
The first responsibility is to protect the study participants. The funders provide guidance documents on sharing of patient data. Ethics committees also provide advice and guidance on what data can be shared. In principle, patient data should be safeguarded, but this should not preclude sharing. There are models for managed access to data that allow personal/sensitive data to be shared for legitimate purposes in a safe and secure manner.
The funders do not want to prevent new collaborations. When sharing data they recommend data generators provide a statement in the description of the data that they are willing to collaborate
It is recognised that it is often appropriate for researchers to have a defined period of exclusive access to the data they generate, but this should be determined by disciplinary norms. Any exemptions or delays have to be justified on a case by case basis, ideally at the outset of the project.
The funders expect research data that supports publications to be made accessible and publications should have a clear statement explaining how to access the underlying research data.
However researchers need to decide what is useful to be shared considering the effort of preparing the data for deposit and of sharing the data. If nobody is going to use the data, sharing is not a good use of researcher’s time.
Discipline-specific data repositories, where these exist, are recommended preferentially over general purpose or institutional repositories
Biosharing is an excellent resource with references to discipline-specific metadata schemas.
Staff members whose role is to manage data is an eligible cost on a grant
There are no funds for sharing data from old projects, although there are exceptions on a case by case basis
The funders are considering monitoring data management plans but their current primary goal is to encourage people to think about data management and sharing from the very start of the project
Access to research data
Q: Are funders benefiting from the expertise of organisations such as UK Data Service when providing advice on data access? UK Data Service has been managing controlled access to research data for a long time and it would be advantageous to benefit from their expertise.
A: Yes, we are in discussion with the UK Data Service. We are also working with the UK Data Service to consider whether it might be appropriate for hosting data from other disciplines beyond social science. We also believe there is significant scope to share lessons and best practices for data sharing between the social and biomedical sciences.
Q: Could we just share research data only when asked for it?
A: This is not a recommended solution: research data should be available, accessible and discoverable. Data access controls and criteria for what needs to happen for the access to be granted have to be made clear in metadata description.
Q:I have patient data which has to be stored in a secure space. I always say in my data management plan that I cannot share my data. I would like to get ethical guidance which will explain to me how to share these data. It is very easy to say that data cannot be shared. I would like to share my data, but I would like to do it properly. With patient data it is extremely difficult, especially with genomics data, where there is a risk that patients can be identified.
A: Sharing of clinical data is not easy. Both Wellcome Trust and Cancer Research UK are helping to drive a great deal of work which is considering access and governance models through which sensitive patient data can be made available for research in a safe, secure and trusted manner. They provide guidance documents on sharing of patient data. Safety of patients and patients’ data is important. Ethics committees also provide advice and guidance on what data can be shared.
Q: What about sharing of physical materials? I have received a request to share a culture derived from a patient material, but the Ethics Committee did not approve sharing of this material. What shall I do?
A (Peter Hedges, Head of Research Office): If your ethical approval says that you cannot share that material, you cannot share it. Your first responsibility is to protect your study participants.
Q:If I share my data via a repository and people can simply download my data, I can no longer collaborate with them to work on the data and I have lost the possibility of getting credit for my data.
A: Nobody wants to prevent new collaborations from happening. A solution might be to add a statement that you are willing to collaborate in the description of your data. Your data requestor might be interested in collaborating, simply because you know your data the best. Funders also expect that the data re-used by others is appropriately acknowledged/cited, and they want to ensure that due credit results from the secondary use of data.
Quality control of research data
Q: If researchers start sharing unpublished research data via data repositories there is a risk that these data will not be of good quality as they will not be peer-reviewed.
A: Authors of unpublished data can simply state in the data description that the item was not peer-reviewed. If applicable, funders also encourage reciprocal links between publications and supporting research data.
What data needs to be shared and when?
Q: If researchers start to share everything there will be a lot of useless data available in data repositories. How to prevent a flood of useless data on the internet?
A: We would like researchers to decide what data is useful to be shared. If nobody is likely to use the data, sharing is not a good use of researcher’s time. Repositories also need to make decisions over what is worth keeping over time.
Comment (Peter Hedges, Head of Research Office): The Research Council UK focuses on research data supporting publications and this is what we recommend to researchers: share research data which underpins publications.
Q:Are we expected to share large datasets resulting from bigger projects (databases, long-term datasets) or data supporting individual publications?
A: We expect research data that supports individual publications to be made available with a hyperlink to the data. We also want researchers to consider and plan more broadly how they can make data assets of value resulting from our funded research available to others in a timely and appropriate manner.
Q:What about images? Is it useful to share them? It involves a lot of time to organise images. Besides, a single confocal picture with multiple layers is 1GB. In theory it is possible to share all raw data and all raw images, but who would want to look at them? 10 figures of 10 images is already 100 GB of data. Where would I store all these images, who is going to use these data and how am I going to pay for this?
A: The effort of preparing the data for deposit and of sharing the data should be proportionate to the potential benefits of data sharing. Researchers need to decide what is useful to be shared, following disciplinary best practices and norms (recognising that disciplines are in very different places in terms of defining these).
Q:Is there a set amount of time for exclusive use of research data?
A: Researchers should adhere to disciplinary norms. For example, in genomics research data is frequently shared before publication (sometimes under a publication moratorium which protects the data generator’s right to first publication). Any exemptions or delays have to be justified on a case by case basis.
Comment (Peter Hedges, Head of Research Office): Research is competitive. Sometimes it might be useful for researchers to know who wants to get the access to data and what do they need them for.
Cost of data sharing
Q:Can I ask in my grant for a staff member to help me with data management?
A: Yes, this is an eligible cost on grant applications: you can request a salary to support a research data manager for your research project, as long as it is justified.
Q:According to CRUK policy, costs for data sharing can be budgeted in grant applications only from August 2015. What about research data from older projects, when these costs were not eligible in grant applications? Is there any transition fund available to pay for this?
A: Unfortunately, there are no additional funds to pay for these costs. Researchers who have older datasets that might be of significant value to the community should contact CRUK – all requests for support will be considered on a case by case basis.
Q: Wellcome Trust encourages data sharing and data re-use, but does not allow for costs of long-term data preservation to be budgeted in grant applications. This does not make sense to me.
A: We are still reviewing our policy on costs of data management and sharing and we might be revisiting this issue – however, it is problematic for us to consider estimated costs for preservation that extend before the life-time of the grant. Our understanding is that costs of long-term data preservation are often less significant than costs of initial data ingestion by the repository (and we will cover ingestion costs).
Q: Who is then going to pay for the long-term data storage?
A: Wellcome Trust funds some discipline-specific repositories, but this is done jointly with other funders. We support bigger undertakings and we are also working with partners to develop platforms for data sharing and discoverability in some priority areas (notably clinical trials). Cancer Research UK pays for some long-term storage options, if these are justified for particular needs of the project. These decisions are made on a case by case basis, depending on how the costs are justified and whether these are directly related to the scientific value of the project.
Metadata standards
Q:At the moment there are many general purpose and institutional repositories, which are not well structured. To support efficient re-use of data it is important to use structured data repositories and adhere to metadata standards. What are funders’ opinions about this?
A: Wherever possible, discipline-specific data repositories should be used preferentially over general purpose or institutional repositories. Adherence to discipline-specific metadata standards is also encouraged. It has to be acknowledged that development of well-structured data repositories is very resource-intensive and not all disciplines have good quality repositories to support them. For example, it took over 30 years to adapt unified metadata standards at Cambridge Crystallographic Data Centre. The time need to properly solve problems should never be underestimated.
Q:Are funders planning to provide researchers with a list of recommended schemas for metadata?
A: Biosharing is an excellent resource with references to discipline-specific metadata schemas. It is a useful suggestion to include a reference to Biosharing on our website.
Policy implementation
Q:Are you planning to monitor researchers’ adherence to data management plans? For example, the BBSRC does not have the manpower to check all data management plans manually, but they are planning to create a system to check if data has been uploaded automatically.
A: We are considering this. At the moment we require data management plans with the primary goal to encourage people to think about data management and sharing from the very start of the project.
Published 5 February 2016 Written by Dr Marta Teperek, verified by David Carr and Jamie Enoch
In 2015 the Cambridge Research Data Team organised several discussions between funders and researchers. In May 2015 we hosted Ben Ryan from EPSRC, which was followed by a discussion with Michael Ball from BBSRC in August. Now we have invited our two main charity funders to discuss their views on data management and sharing with Cambridge researchers.
David Carr from the Wellcome Trust and Jamie Enoch from Cancer Research UK (CRUK) met with our academics on Friday 22 January at the Gurdon Institute. The Gurdon Institute was founded jointly by the Wellcome Trust and CRUK to promote research in the areas of developmental biology and cancer biology, and to foster a collaborative environment for independent research groups with diverse but complementary interests.
This blog summarises the presentations and discusses the data sharing expectations from Wellcome Trust and CRUK. A second related blog ‘In conversation with Wellcome Trust and CRUK‘ summarises the question and answer session that was held with a group of researchers on the same day.
Wellcome Trust’s requirements for data management and sharing
Sharing research data is key for Wellcome’s goal of improving health
David Carr started his presentation explaining that the Wellcome Trust’s mission is to support research with the goal of improving health. Therefore, the Trust is committed to ensuring research outputs (including research data) can be accessed and used in ways that will maximise health and societal benefits. David reminded the audience of benefits of data sharing. Data which is shared has the potential to:
Enable validity and reproducibility of research findings to be assessed
Increase the visibility and use of research findings
Enable research outputs to be used to answer new questions
Reduce duplication and waste
Enable access to data to other key communities – public, policymakers, healthcare professionals etc.
Data sharing goes mainstream
David gave on overview of data sharing expectations from various angles. He started by referring to the Royal Society’s report from 2012: Science as an open enterprise, which sets sharing as the standard for doing science. He then also mentioned other initiatives like the G8 Science Ministers’ statement, the joint report from the Academy of Medical Sciences, BBSRC, MRC and Wellcome Trust on reproducibility and reliability of biomedical research and the UK Concordat on Open Research Data with a take-home message that sharing data and other research outputs is increasingly becoming a global expectation, and a core element of good research practice.
Wellcome Trust’s policy for open data
The next aspect of David’s presentation was Wellcome Trust’s policy on data management and sharing. The policy was first published almost a decade ago (2007) with subsequent modifications in 2010. The principle of the policy is simple: research data should be shared and preserved in a manner which maximises its value to advance research and improve health. Wellcome Trust also requires data management plans as a compulsory part of grant applications, where the proposed research is likely to generate a dataset that will have significant value to researchers and other users. This is to ensure that researchers understand the importance of data management and sharing and to plan for it from the start their projects.
Cost of data sharing
Planning for data management and sharing involves costing for these activities in the grant proposal. The Wellcome Trust’s FAQ guidance on data sharing policy says that: “The Trust considers that timely and appropriate data management and sharing should represent an integral component of the research process. Applicants may therefore include any costs associated with their proposed approach as part of their proposal.” David then outlined the types of costs that can be included in grant applications (including for dedicated staff, hardware and software, and data access costs). He noted that in the current draft guidance on costing for data management estimated costs for long-term preservation that extend beyond the lifetime of the grant are not eligible, although costs associated with the deposition of data in recognised data repositories can be requested.
Key priorities and emerging areas in data management and sharing
Infrastructure
The Wellcome Trust also identified key priorities and emerging areas where work needs to be done to better support of data management and sharing. The first one was to provide resources and platforms for data sharing and access. David pointed out that wherever available, discipline-specific data repositories are the best home for research data, as they provide rich metadata standards, community curation and better discoverability of datasets.
However, the sustainability of discipline-specific repositories is sometimes uncertain. Discipline-specific resources are often perceived as ‘free’. However, research data submitted to ‘free’ data repositories has to be stored somewhere and the amount of data produced and shared is growing exponentially – someone has to pay for the cost of storage and long-term curation in discipline-specific data repositories. An additional point for consideration is that many disciplines do not have their own repositories and therefore need to heavily rely on institutional support.
Access
Wellcome Trust funds a large number of projects in clinical areas. Dealing with patient data requires careful ethical considerations and planning from the very start of the project to ensure that data can be successfully shared at the end of the project. To support researchers in dealing with patient data The Expert Advisory Group on Data Access (a cross-funder advisory body established by MRC, ESRC, Cancer Research UK and the Wellcome Trust) has developed guidance documents and practice papers about handling of sensitive data: how to ask for informed consent, how to anonymise data and the procedures that need to be in place when granting access to data. David stressed that balance needs to be struck between maximising the use of data and the need to safeguard research participants.
Incentives for sharing
Finally, if sharing is to become the normal thing to do, researchers need incentives to do so. Wellcome Trust is keen to work with others to ensure that researchers who generate and share datasets of value receive appropriate recognition for their efforts. A recent report from the Expert Advisory Group on Data Access proposed several recommendations to incentivise data sharing, with specific roles for funders, research leaders, institutions and publishers. Additionally, in order to promote data re-use, the Wellcome Trust joined forces with the National Institutes of Health and the Howard Hughes Medical Institute and launched the Open Science Prize competition to encourage prototyping and development of services, tools or platforms that enable open content.
Cancer Research UK’s views on data sharing
The next talk was by Jamie Enoch from Cancer Research UK. Jamie started by saying that because Cancer Research UK (CRUK) is a charity funded by the public, it needs to ensure it makes the most of its funded research: sharing research data is elemental to this. Making the most of the data generated through CRUK grants could help accelerate progress towards the charity’s aim in its research strategy, to see three quarters of people surviving cancer by 2034. Jamie explained that his post – Research Funding Manager (Data) – has been created as a reflection of data sharing being increasingly important for CRUK.
The policy
Jamie started talking about the key principles of CRUK data sharing policy by presenting the main issues around research data sharing and explaining the CRUK’s position in relation to them:
What needs to be shared?All research data, including unpublished data, source code, databases etc, if it is feasible and safe to do so. CRUK is especially keen to ensure that data underpinning publications is made available for sharing.
Metadata:Researchers should adhere to community standards/minimum information guidelines where these exist.
Discoverability: Groups should be proactive in communicating the contents of their datasets and showcasing the data available for sharing
Jamie explained that CRUK really wants to increase the discoverability of data. For example, clinical trials units should ideally provide information on their websites about the data they generate and clear information about how it can be accessed.
Modes of sharing:Via community or generalist repositories, under the auspices of the PI or a combination of methods
Jamie explained that not all data can be/should be made openly available. Due to ethical considerations sometimes access to data will have to be restricted. Jamie explained that as long as restrictions are justified, it is entirely appropriate to use them. However, if access to data is restricted, the conditions on which access will be granted should be considered at the project outset, and these conditions will have to be clearly outlined in metadata descriptions to ensure fair governance of access.
Timeframes: Limited period of exclusive use permitted where justified
Jamie suggested adhering to community standards when thinking about any periods of exclusive use of generated research data. In some communities research data is made accessible at the time of publication. Other communities will expect data release at the time of generation (especially in collaborative genomics projects). Jamie further explained that particularly in cases where new data can affect policy development, it is key that research data is released as soon as possible.
Preservation:Data to be retained for at least 5 years after grant end
Acknowledgement:Secondary users of data should credit original researcher and CRUK
Costs:Appropriately justified costs can be included in grant proposals
As of late 2015, financial support for data management and sharing can be requested as a running cost in grant applications. Jamie explained that there are no particular guidelines in place explaining eligible and non-eligible costs and that the most important aspect is whether the costs are well justified or not, and reasonable in the context of the research envisaged.
Jamie stressed that the key point of the CRUK policy is to facilitate data sharing and to engage with the research community, recognising the challenges of data sharing for different projects and the need to work through these collaboratively, rather than enforce the policy in a top-down fashion.
Policy implementation
Subsequently, the presentation discussed ways in which CRUK policy is implemented. Jamie explained that the main tool for the policy implementation is the new requirement for data management plans as compulsory part of grant applications.
Two of the three main response mode committees: Science Committee and Clinical Research Committee have a two-step process of writing a data management plan. During the grant application stage researchers need to write a short, free-form description about how they plan to adhere to CRUK’s policy on data sharing. Only if the grant is accepted, the beneficiary will be asked to write a more detailed data management plan, in consultation with CRUK representatives.
This approach serves two purposes as it:
ensures that all applicants are aware of CRUK’s expectations on data sharing (they all need to write a short paragraph about data sharing)
saves researchers’ time: only those applicants who were successful will have to provide a detailed data management plan, and it allows the CRUK office to engage with successful applicants on data sharing challenges and opportunities
In contrast, applicants for the other main CRUK response mode committee, the Population Research Committee, all fill out a detailed data management and sharing plan at application stage because of the critical importance of sharing data from cohort and epidemiological studies.
Outlooks for the future
Similarly to the Wellcome Trust, CRUK realised that cultural change is needed for sharing to become the normality. CRUK have initiated many national and international partnerships to help the reward of data sharing.
One of them is a collaboration with the YODA (Yale Open Data Access) project aiming to develop metrics to monitor and evaluate data sharing. Other areas of collaborative work include collaboration with other funders on development of guidelines on ethics of data management and sharing, platforms for data preservation and discoverability, procedures for working with population and clinical data. Jamie stressed that the key thing for CRUK is to work closely with researchers and research managers – to understand the challenges and work through these collaboratively, and consider exciting new initiatives to move the data sharing field forwards.
On the 4 November the Research Data Facility at Cambridge University invited some inspirational leaders in the area of research data management and asked them to address the question: “is open data moving science forward or a waste of money & time?”. Below are Dr Marta Teperek’s impressions from the event.
Great discussion
Want to initiate a thought-provoking discussion on a controversial subject? The recipe is simple: invite inspirational leaders, bright people with curious minds and have an excellent chair. The outcome is guaranteed.
We asked some truly inspirational leaders in data management and sharing to come to Cambridge to talk to the community about the pros and cons of data sharing. We were honoured to have with us:
Rafael Carazo-Salas, Group Leader, Department of Genetics, University of Cambridge
@RafaCarazoSalas
Sarah Jones, Senior Institutional Support Officer from the Digital Curation Centre; @sjDCC
Frances Rawle, Head of Corporate Governance and Policy, Medical Research Council; @The_MRC
Tim Smith, Group Leader, Collaboration and Information Services, CERN/Zenodo; @TimSmithCH
Peter Murray-Rust, Molecular Informatics, Dept. of Chemistry, University of Cambridge, ContentMine; @petermurrayrust
The discussion was chaired by Dr Danny Kingsley, the Head of Scholarly Communication at the University of Cambridge (@dannykay68).
What is the definition of Open Data?
The discussion started off with a request for a definition of what “open” meant. Both Peter and Sarah explained that ‘open’ in science was not simply a piece of paper saying ‘this is open’. Peter said that ‘open’ meant free to use, free to re-use, and free to re-distribute without permission. Open data needs to be usable, it needs to be described, and to be interpretable. Finally, if data is not discoverable, it is of no use to anyone. Sarah added that sharing is about making data useful. Making it useful also involves the use of open formats, and implies describing the data. Context is necessary for the data to be of any value to others.
What are the benefits of Open Data?
Next came a quick question from Danny: “What are the benefits of Open Data”? followed by an immediate riposte from Rafael: “What aren’t the benefits of Open Data?”. Rafael explained that open data led to transparency in research, re-usability of data, benchmarking, integration, new discoveries and, most importantly, sharing data kept it alive. If data was not shared and instead simply kept on the computer’s hard drive, no one would remember it months after the initial publication. Sharing is the only way in which data can be used, cited, and built upon years after the publication. Frances added that research data originating from publicly funded research was funded by tax payers. Therefore, the value of research data should be maximised. Data sharing is important for research integrity and reproducibility and for ensuring better quality of science. Sarah said that the biggest benefit of sharing data was the wealth of re-uses of research data, which often could not be imagined at the time of creation.
Finally, Tim concluded that sharing of research is what made the wheels of science turn. He inspired further discussions by strong statements: “Sharing is not an if, it is a must – science is about sharing, science is about collectively coming to truths that you can then build on. If you don’t share enough information so that people can validate and build up on your findings, then it basically isn’t science – it’s just beliefs and opinions.”
Tim also stressed that if open science became institutionalised, and mandated through policies and rules, it would take a very long time before individual researchers would fully embrace it and start sharing their research as the default position.
I personally strongly agree with Tim’s statement. Mandating sharing without providing the support for it will lead to a perception that sharing is yet another administrative burden, and researchers will adopt the ‘minimal compliance’ approach towards sharing. We often observe this attitude amongst EPSRC-funded researchers (EPSRC is one of the UK funders with the strictest policy for sharing of research data). Instead, institutions should provide infrastructure, services, support and encouragement for sharing.
Big data
Data sharing is not without problems. One of the biggest issues nowadays it the problem of sharing of big data. Rafael stressed that with big data, it was extremely expensive not only to share, but even to store the data long-term. He stated that the biggest bottleneck in progress was to bridge the gap between the capacity to generate the data, and the capacity to make it useful. Tim admitted that sharing of big data was indeed difficult at the moment, but that the need would certainly drive innovation. He recalled that in the past people did not think that one day it would be possible just to stream videos instead of buying DVDs. Nowadays technologies exist which allow millions of people to watch the webcast of a live match at the same time – the need developed the tools. More and more people are looking at new ways of chunking and parallelisation of data downloads. Additionally, there is a change in the way in which the analysis is done – more and more of it is done remotely on central servers, and this eliminates the technical barriers of access to data.
Personal/sensitive data
Frances mentioned that in the case of personal and sensitive data, sharing was not as simple as in basic sciences disciplines. Especially in medical research, it often required provision of controlled access to data. It was not only important who would get the data, but also what they would do with it. Frances agreed with Tim that perhaps what was needed is a paradigm shift – that questions should be sent to the data, and not the data sent to the questions.
Shades of grey: in-between “open” and “closed”
Both the audience and the panellists agreed that almost no data was completely “open” and almost no data was completely “shut”. Tim explained that anything that gets research data off the laptop to a shared environment, even if it was shared only with a certain group, was already a massive step forward. Tim said: “Open Data does not mean immediately open to the entire world – anything that makes it off from where it is now is an important step forward and people should not be discouraged from doing so, just because it does not tick all the other checkboxes.” And this is yet another point where I personally agreed with Tim that institutionalising data sharing and policing the process is not the way forward. To the contrary, researchers should be encouraged to make small steps at a time, with the hope that the collective move forward will help achieving a cultural change embraced by the community.
Open Data and the future of publishing
Another interesting topic of the discussion was the future of publishing. Rafael started explaining that the way traditional publishing works had to change, as data was not two-dimensional anymore and in the digital era it could no longer be shared on a piece of paper. Ideally, researchers should be allowed to continue re-analysing data underpinning figures in publications. Research data underpinning figures should be clickable, re-formattable and interoperable – alive.
Danny mentioned that the traditional way of rewarding researchers was based on publishing and on journal impact factors. She asked whether publishing data could help to start rewarding the process of generating data and making it available. Sarah suggested that rather than having the formal peer review of data, it would be better to have an evaluation structure based on the re-use of data – for example, valuing data which was downloadable, well-labelled, re-usable.
Incentives for sharing research data
The final discussion was around incentives for data sharing. Sarah was the first one to suggest that the most persuasive incentive for data sharing is seeing the data being re-used and getting credit for it. She also stated that there was also an important role for funders and institutions to incentivise data sharing. If funders/institutions wished to mandate sharing, they also needed to reward it. Funders could do so when assessing grant proposals; institutions could do it when looking at academic promotions.
Conclusions and outlooks on the future
This was an extremely thought-provoking and well-coordinated discussion. And maybe due to the fact that many of the questions asked remained unanswered, both the panellists and the attendees enjoyed a long networking session with wine and nibbles after the discussion.
From my personal perspective, as an ex-researcher in life sciences, the greatest benefit of open data is the potential to drive a cultural change in academia. The current academic career progression is almost solely based on the impact factor of publications. The ‘prestige’ of your publications determines whether you will get funding, whether you will get a position, whether you will be able to continue your career as a researcher. This, connected with a frequently broken peer-review process, leads to a lot of frustration among researchers. What if you are not from the world’s top university or from a famous research group? Will you be able to still publish your work in a high impact factor journal? What if somebody scooped you when you were about to publish results of your five years’ long study? Will you be able to find a new position? As Danny suggested during the discussion, if researchers start publishing their data in the ‘open”’ there is a chance that the whole process of doing valuable research, making it useful and available to others will be rewarded and recognised. This fits well with Sarah’s ideas about evaluation structure based on the re-use of research data. In fact, more and more researchers go to the ‘open’ and use blog posts and social media to talk about their research and to discuss the work of their peers. With the use of persistent links research data can be now easily cited, and impact can be built directly on data citation and re-use, but one could also imagine some sort of badges for sharing good research data, awarded directly by the users. Perhaps in 10 or 20 years’ time the whole evaluation process will be done online, directly by peers, and researchers will be valued for their true contributions to science.
And perhaps the most important message for me, this time as a person who supports research data management services at the University of Cambridge, is to help researchers to really embrace the open data agenda. At the moment, open data is too frequently perceived as a burden, which, as Tim suggested, is most likely due to imposed policies and institutionalisation of the agenda. Instead of a stick, which results in the minimal compliance attitude, researchers need to see the opportunities and benefits of open data to sign up for the agenda. Therefore, the Institution needs to provide support services to make data sharing easy, but it is the community itself that needs to drive the change to “open”. And the community needs to be willing and convinced to do so.
Further resources
Click here to see the full recording of the Open Data Panel Discussion.
And here you can find a storified version of the event prepared by Kennedy Ikpe from the Open Data Team.
Thank you
We also wanted to express a special ‘thank you’ note to Dan Crane from the Library at the Department of Engineering, who helped us with all the logistics for the event and who made it happen.
Published 27 November 2015 Written by Dr Marta Teperek
As part of the Office of Scholarly Communication Open Access Week celebrations, we are uploading a blog a day written by members of the team. Wednesday is a piece by Dr Marta Teperek reporting on the Software Licensing Workshop held on 14 September 2015 at Cambridge.
Uncertainties about sharing and licensing of software
If the questions that the Research Data Service Team have been asked during data sharing information sessions with over 1000 researchers at the University of Cambridge are any indicator, then there is a great deal of confusion about sharing source code.
There have been a wide range of questions during the discussions in these sessions, and the Research Data Service Team has recorded these. We are systematically ensuring that the information we are providing to our research community is valid and accurate. To address the questions about source code we decided to call in expert help. Shoaib Sufi and Neil Chue Hong* from the Software Sustainability Institute agreed to lead a workshop on Software Licensing in September, at the Computer Lab in Cambridge. Shoaib’s slides are here, and Neil’s slides on Open Access policies and software sharing are here.
We had over 50 researchers and several research data managers from other UK universities attending the Software Licensing workshop. The main questions we were trying to resolve was: Are researchers expected to share source code they used in their projects? And if so, under what conditions?
Is software considered as ‘research data’ and does it need to be shared?
The starting question in the discussion was whether software needed to be shared. Most public funders now require that research data underpinning publications is made available. What is the definition of research data? According to the EPSRC research data “is defined as recorded factual material commonly retained by and accepted in the scientific community as necessary to validate research findings”. Therefore, if software is needed to validate findings described in a publication, researchers are expected to make it available as widely as possible. There are some exceptions to this rule. For example, if there is an intention to commercialise the software there might not be a need to share it, but the default assumption is that the software should be shared.
The importance of putting a licence on software
It is important that before any software is shared, the creator considers what they would like others to be able to do with it. The way to indicate the intended reuse of the software is to place a licence on it. This governs the permission being granted to others with regards to source code by the copyright holder(s). A licence determines whether the person who wants to get hold of software is allowed to use, copy, resell, change, or distribute it. Additionally, a licence should also determine who is liable if something goes wrong with the software.
Therefore, a licence not only protects the intellectual property, but also helps others to use the software effectively. If people who are potentially interested in a given piece of software do not know what they are allowed to do with it, it is possible they will search for alternative solutions. As a consequence, researchers could lose important collaborators, buyers, or simply decrease the citation rate that could have been gained from people using and citing software in their publications.
Who owns the copyright?
The most difficult question when it comes to software licensing is determining who owns the copyright – who is allowed to license the software used in research? If this is software created by a particular researcher then it is likely that s/he will be the copyright owner. At the University of Cambridge researchers are the primary owners of intellectual property. This is however a very generous right – typically employers do not allow their employees to retain copyright ownership. Therefore, the issue of copyright ownership might get very complicated for researchers involved in multi-institutional collaborations. Additionally, sometimes funders of research will retain copyright ownership of research outputs.
Consequences of licensing
An additional complication with licensing software is that most licences cannot be revoked. Once something has been licensed to someone under a certain licence, it is not possible to take it back and change the licence. Moreover, if there is one licence for a set of software, it might not be possible to license a patch to the software under a different licence. The issue of licence compatibility sparked a lot of questions during the workshop, with no easy answers available. The overall conclusion was that whenever possible, mixing of licences should be avoided. If use of various licences is necessary, researchers are recommended to get advice from the Legal Services Office.
Good practice for software management
So what are the key recommendations for good practice for software management? Before the start of a research project, researchers should think about who the collaborators and funders are, and what the employer’s expectations are with regards to intellectual property. This will help to determine who will own the copyright over the software. Funders’ and institutional policies for research data sharing should be consulted for expectations about software sharing With this information it is possible to prepare a data management plan for the grant application.
During the project researchers need to ensure that their software is hosted in an appropriate code repository – for example, GitHub or Bitbucket. It is important to create (and keep updating!) metadata describing any generated data and software.
Finally, when writing a paper, researchers need to deposit all releases of data/software relevant to the publication in a suitable repository. It is best to choose a repository which provides persistent links e.g. Zenodo (which has a GitHub integration), or the University of Cambridge data repository (Apollo). It is important to ensure that software is licensed under an appropriate licence – in line with what others should be allowed to do with the software, and in agreement with any obligations there might be with any other third parties (for example, funders of the research). If there is a need to restrict the access to the software, metadata description should give reasons for this restriction and conditions that need to be met for the access to be granted.
Valuable resources to help make right decisions
Both Neil and Shoaib agreed that proper management and licensing of software might be sometimes complicated. Therefore, they recommended various resources and tools to provide guidance for researchers:
The workshop was organised in collaboration with Stephen Eglen from the Department of Applied Mathematics and Theoretical Physics (University of Cambridge) who chaired the meeting, and with Andrea Kells from the Computer Lab (University of Cambridge) who hosted the workshop.
The Research Data Service is also providing various other opportunities for our research community to pose questions directly of the funding bodies. We invited Ben Ryan from the EPSRC to come to speak to a group of researchers in May and the resulting validated FAQs are now published on our research data management website. Similarly, researchers met with Michael Ball from the BBSRC in August.
These opportunities are being embraced by our research community.
*About the speakers
Shoaib Sufi – Community Lead at the Software Sustainability Institute
Shoaib leads the Institute’s community engagement activities and strategies. Graduating in Computer Science from the University of Manchester in 1997, he has worked in the commercial sector as a systems programmer and then as software developer, metadata architect and eventually a project manager at the Science and Facilities Technologies Council (STFC).
Neil Chue Hong – Director at the Software Sustainability Institute
Neil is the founding Director of the Software Sustainability Institute. Graduating with an MPhys in Computational Physics from the University of Edinburgh, he began his career at EPCC, becoming Project Manager there in 2003. During this time he led the Data Access and Integration projects (OGSA-DAI and DAIT), and collaborated in many e-Science projects, including the EU FP6 NextGRID project.
Published 21 October 2015 Written by Dr Marta Teperek
As part of the Office of Scholarly Communication Open Access Week celebrations, we are uploading a blog a day written by members of the team. Tuesday is a piece by Dr Danny Kingsley reflecting the talk she gave this morning to the Royal Society of Chemistry, Chemical Information and Computer Applications Group conference – Measurement, Information and Innovation: Digital Disruption in the Chemical Sciences.
The data policy landscape
The policy position on data management in the UK is driven on many levels. Many institutions now have policies – an example is the Cambridge University Research Data Management Policy Framework. Increasingly publishers such as PLOS are requiring that research published in their journals is accompanied by the data underpinning the research. Some journals, such as Nature’s Scientific Data are specifically data-only journals
There has been a country-wide movement towards opening up data. Consultation on the Draft Concordat on Open Research Data released by the RCUK ended on 28 September. Cambridge coordinated a joint response to the Concordat with several other universities.
However the real driver for action this year has been funder policies – specifically, the Engineering and Physical Sciences Research Council (EPSRC) which announced it was going to (and has begun) checking compliance as of 1 May 2015.
The devil is in the detail
While the Research Councils UK have RCUK Common Principles on Data Policy stating “Publicly funded research data are a public good (…), which should be made openly available with as few restrictions as possible”, these common principles are idiosyncratic when looked at from the individual council perspective, as the graphic on the second page of this document demonstrates.
There are variations on whether a data management plan is required, where the data should be stored, the level of support offered and even whether this can be funded through the grant (in most cases it can, but not all).
Places to share data
Open (cross-disciplinary) repositories
These include commercial options such as figshare which is owned by Digital Science who also produce Symplectic Elements research management systems and are is an offshoot company to Macmillian/Springer.
Open source solutions such as Zenodo, an open dependable home for the long-tail of science, enabling researchers to share and preserve any research outputs in any size, any format and from any science, developed by CERN.
Disciplinary repositories
There are a significant number of disciplinary specific data repositories. In many ways these are the most natural place for data as disciplinary experts can curate the data. For example the Natural Environment Research Council (NERC) runs seven repositories.
The first repository ever created (in 1991) was arXiv, holding e-prints in Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance and Statistics. The public functional genomics data repository is Gene Expression Omnibus . Social science data can be deposited in the UK Data Service. The Oxford Text Archive holds literary and linguistic texts for higher education.
Institutional options
Cambridge University is using its DSpace repository Apollo to store and share data. To date the largest dataset we have received is 28 GB – huge datasets need to sit externally to this repository. Not all institutions provide a data repository service. There are significant overheads associated with this activity both for the technology and the people to upload and curate the data.
Journals
Journals are increasingly requiring researchers to publish (or at least provide links to) their supporting data alongside their research articles. PLOS brought in their data sharing policy in December 2013.
The Australian National Data Service has built Research Data Australia (RDA) which collates and displays information about datasets held all over the country, both open and closed. As of the 20 October, the RDA contains 115,823 datasets, of which 23,322 are open.
There appears to be little appetite yet for this sort of service in the UK, or at least for providing funds to create one.
What are we actually trying to achieve ?
Cambridge University has taken a proactive approach to the RCUK data sharing policies by inviting funders to come and discuss issues with the research community. We have written up and published these discussions as an ‘In conversation with..’ series on our Unlocking Research blog.
In addition to clarification on some aspects of the policies, one of the questions we are trying to answer is what is the actual goal of these policies? Ben Ryan the Senior Manager, Research Outcomes of the EPSRC clarified that researchers needed to share:
the data that underpins publications
the data that validates research findings
the data that is worth keeping
To summarise the philosophical goals of the EPSRC policy:
The default position is ‘data should be open’
Published research findings should be testable
Maximise the impact of publicly funded research
Maintain public trust in science and research
They are trying to create a new research culture
Researcher responses
While these might seem like lofty and even admirable goals, it does not mean that the academic response to being informed of their grant requirement for sharing data have been met with welcoming arms. Far from it in some cases.
But it will take a huge effort to get the data in a useable form
No-one will look at it
What a waste of time
This has prompted us to play a game we call ‘data excuse bingo’ at some of our Research Data Management Workshops – see slide 16.
We are trying to start at the end
The problem might be that everyone is fixated on sharing data at the end of the research process. However this is one of the lesser data management activities if data management begins at the beginning.
The second of our “In conversation…” series was with Michael Ball the Strategy and Policy Manager at the Biotechnical and Biological Sciences Research Council (BBSRC). Amongst a long discussion about what exactly constitutes ‘data’ in the biological sciences, Michael emphasised that disciplines themselves must establish ways of dealing with data. This is the beginning of an ongoing process.
He noted that researchers need to consider how to deal with data from the beginning of a research project. Researchers can ask for money to manage data in the grant application (something which is currently quite rare).
The practice of sharing data requires the data to be: Accessible, Intelligible, Assessable and Reusable. So how do we achieve that?
Identifying any data that might be politically, personally or commercially sensitive
Determining who owns what data
Ensuring data that is being used for research across collaborations is shared in safe, secure and legal shared facilities, bearing in mind Export Control Legislation.
Having good metadata protocols
Using a reputable and reliable storage/sharing facility that offers persistent identifiers (DOIs)
Who owns the data?
This is an interesting question. EPSRC policies state that researchers should ensure that collaborators are aware of the sharing requirement before they embark on research. Then there are questions about who within the team might own the data – with again a suggestion to come to some sort of agreement before work starts.
Are all collaborators equal ‘owners’? Or does the Principal Investigator have a 50% stake, with the remaining split amongst the PhD students making up the remainder of the team? You might want to talk to your legal advisors and/or your research office about this issue.
There is also the related issue here of developing author-contribution transparency. Do you include author contribution statements in your articles?.
Staffing issues
If Michael Ball is correct and very few researchers are asking for funds associated with managing data, it is reasonable to assume that data is being managed in an ad hoc way – with reliance on the computer savvy postdoc the project hired …
Required skillsets for managing and curating data
There is a considerable range of skill sets associated with managing data, and these have been described by the Digital Curation Centre as data creator, data scientist, data librarian, data manager.
Data Creator: Researchers with domain expertise who produce data. These people may have a high level of expertise in handling, manipulating and using data
Data Scientist: People who work where the research is carried out – or, in the case of data centre personnel, in close collaboration with the creators of the data – and may be involved in creative enquiry and analysis, enabling others to work with digital data, and developments in data base technology
Data Manager: Computer scientists, information technologists or information scientists and who take responsibility for computing facilities, storage, continuing access and preservation of data
Data Librarian: People originating from the library community, trained and specialising in the curation, preservation and archiving of data
There is a simple graphic that clearly shows how these roles relate to one another.
Certainly an increasing number of data scientist jobs are being advertised. A search for ‘Data scientist’ + ‘London’ on the job site Indeed on 18 October produced 1,405 results. So where are all of these people coming from?
Training available
The Swan and Brown study in 2008 noted that ‘People in data science roles face a big, continuing challenge in remaining properly skilled up’ and this remain the situation today, although there are now many more opportunities for training – a few are listed here:
MANTRA is a free online course from the Data Library staff in Information Services, University of Edinburgh. It is designed for those who manage digital data as part of a research project. It was crafted for the use of post-graduate students, early career researchers, and also information professionals. It is freely available on the web for anyone to explore on their own.
Data Scientist training for Librarians is a collection of notes and discussions about data work done by librarians. They ran an experimental course in August to teach “the latest tools for extracting, wrangling, storing, analysing, and visualising data”.
In addition, the professionalisation of these skills is beginning. For example, Data Science London is a community of data scientists that meets regularly to discuss data science ideas, concepts, and tools, methods and technologies used by many startups to analyse large scale data (big data), extract predictive insight, and exploit business opportunities from data products. Their website offers a list of free data science courses.
Cambridge University is one of the five founding university partners of the about to be launched Alan Turing Institute, which is intended to be the UK’s national institute for data science. The Institute will be addressing some of the issues with the data management skill gap.
Issues with sharing data
For all of the ‘feel good’ ideals about sharing data, and the processes being put in place to support this, there are some serious issues raised by researchers about the requirement to share data.
To start, there is a very real concern that the UK will become unattractive for collaborations. Why would a commercial collaborator choose a UK partner when a partner from elsewhere is not under obligation to share their data?
There have been some discussions at information sessions about the possible need to change the type of research being done to reduce the amount of data being produced because of the cost of long term storage of this data.
Indeed, there is discussion in some circles about whether applying for EPSRC funding is worth the hassle. It is a fair bet that none of these were intended outcomes of the RCUK policies.
Consequences of not sharing data
However, this does need to be a balanced strategy. There is a considerable argument that openness is related to academic integrity as it allows work to be verified and validated.
Here are examples in three disciplines where sharing data had a dramatic effect:
Medicine – having the data publicly available in two trials of deworming pills demonstrated that a population wide deworming program did not improve school performance.
Economics – A study widely cited to justify budget cutting in the US had a mistake in the calculations which was only revealed when the Excel file was released
Physics – It took 12.5 years to withdraw Jan Hendrik Schon’s work on ‘organic semiconductors’ because the reviewers were unable to replicate the results without access to the original data or lab books.
Conclusion
Sharing data offers great challenges to the research community, not least because it is less than clear what ‘data’ means in different disciplines. It will take some time for the research community to change its philosophy and practice. But the positives outweigh the negatives and with hope we will look back at this time as a short transition period.
Note the slides from the talk are available in Slideshare.
Published 20 October 2015 Written by Dr Danny Kingsley