Tag Archives: open data

Open Data – moving science forward or a waste of money & time?

November 27, 2015UncategorizedBlog posts, CERN, compliance, data, funders, Medical Research Council, open data, policy, publishing, research, research data management, reward, social mediaOffice of Scholarly Communication

On the 4 November the Research Data Facility at Cambridge University invited some inspirational leaders in the area of research data management and asked them to address the question: “is open data moving science forward or a waste of money & time?”. Below are Dr Marta Teperek’s impressions from the event.

Great discussion

Want to initiate a thought-provoking discussion on a controversial subject? The recipe is simple: invite inspirational leaders, bright people with curious minds and have an excellent chair. The outcome is guaranteed.

We asked some truly inspirational leaders in data management and sharing to come to Cambridge to talk to the community about the pros and cons of data sharing. We were honoured to have with us:

Rafael Carazo-Salas, Group Leader, Department of Genetics, University of Cambridge
@RafaCarazoSalas
Sarah Jones, Senior Institutional Support Officer from the Digital Curation Centre; @sjDCC
Frances Rawle, Head of Corporate Governance and Policy, Medical Research Council; @The_MRC
Tim Smith, Group Leader, Collaboration and Information Services, CERN/Zenodo; @TimSmithCH
Peter Murray-Rust, Molecular Informatics, Dept. of Chemistry, University of Cambridge, ContentMine; @petermurrayrust

The discussion was chaired by Dr Danny Kingsley, the Head of Scholarly Communication at the University of Cambridge (@dannykay68).

What is the definition of Open Data?

The discussion started off with a request for a definition of what “open” meant. Both Peter and Sarah explained that ‘open’ in science was not simply a piece of paper saying ‘this is open’. Peter said that ‘open’ meant free to use, free to re-use, and free to re-distribute without permission. Open data needs to be usable, it needs to be described, and to be interpretable. Finally, if data is not discoverable, it is of no use to anyone. Sarah added that sharing is about making data useful. Making it useful also involves the use of open formats, and implies describing the data. Context is necessary for the data to be of any value to others.

What are the benefits of Open Data?

Next came a quick question from Danny: “What are the benefits of Open Data”? followed by an immediate riposte from Rafael: “What aren’t the benefits of Open Data?”. Rafael explained that open data led to transparency in research, re-usability of data, benchmarking, integration, new discoveries and, most importantly, sharing data kept it alive. If data was not shared and instead simply kept on the computer’s hard drive, no one would remember it months after the initial publication. Sharing is the only way in which data can be used, cited, and built upon years after the publication. Frances added that research data originating from publicly funded research was funded by tax payers. Therefore, the value of research data should be maximised. Data sharing is important for research integrity and reproducibility and for ensuring better quality of science. Sarah said that the biggest benefit of sharing data was the wealth of re-uses of research data, which often could not be imagined at the time of creation.

Finally, Tim concluded that sharing of research is what made the wheels of science turn. He inspired further discussions by strong statements: “Sharing is not an if, it is a must – science is about sharing, science is about collectively coming to truths that you can then build on. If you don’t share enough information so that people can validate and build up on your findings, then it basically isn’t science – it’s just beliefs and opinions.”

Tim also stressed that if open science became institutionalised, and mandated through policies and rules, it would take a very long time before individual researchers would fully embrace it and start sharing their research as the default position.

I personally strongly agree with Tim’s statement. Mandating sharing without providing the support for it will lead to a perception that sharing is yet another administrative burden, and researchers will adopt the ‘minimal compliance’ approach towards sharing. We often observe this attitude amongst EPSRC-funded researchers (EPSRC is one of the UK funders with the strictest policy for sharing of research data). Instead, institutions should provide infrastructure, services, support and encouragement for sharing.

Big data

Data sharing is not without problems. One of the biggest issues nowadays it the problem of sharing of big data. Rafael stressed that with big data, it was extremely expensive not only to share, but even to store the data long-term. He stated that the biggest bottleneck in progress was to bridge the gap between the capacity to generate the data, and the capacity to make it useful. Tim admitted that sharing of big data was indeed difficult at the moment, but that the need would certainly drive innovation. He recalled that in the past people did not think that one day it would be possible just to stream videos instead of buying DVDs. Nowadays technologies exist which allow millions of people to watch the webcast of a live match at the same time – the need developed the tools. More and more people are looking at new ways of chunking and parallelisation of data downloads. Additionally, there is a change in the way in which the analysis is done – more and more of it is done remotely on central servers, and this eliminates the technical barriers of access to data.

Personal/sensitive data

Frances mentioned that in the case of personal and sensitive data, sharing was not as simple as in basic sciences disciplines. Especially in medical research, it often required provision of controlled access to data. It was not only important who would get the data, but also what they would do with it. Frances agreed with Tim that perhaps what was needed is a paradigm shift – that questions should be sent to the data, and not the data sent to the questions.

Shades of grey: in-between “open” and “closed”

Both the audience and the panellists agreed that almost no data was completely “open” and almost no data was completely “shut”. Tim explained that anything that gets research data off the laptop to a shared environment, even if it was shared only with a certain group, was already a massive step forward. Tim said: “Open Data does not mean immediately open to the entire world – anything that makes it off from where it is now is an important step forward and people should not be discouraged from doing so, just because it does not tick all the other checkboxes.” And this is yet another point where I personally agreed with Tim that institutionalising data sharing and policing the process is not the way forward. To the contrary, researchers should be encouraged to make small steps at a time, with the hope that the collective move forward will help achieving a cultural change embraced by the community.

Open Data and the future of publishing

Another interesting topic of the discussion was the future of publishing. Rafael started explaining that the way traditional publishing works had to change, as data was not two-dimensional anymore and in the digital era it could no longer be shared on a piece of paper. Ideally, researchers should be allowed to continue re-analysing data underpinning figures in publications. Research data underpinning figures should be clickable, re-formattable and interoperable – alive.

Danny mentioned that the traditional way of rewarding researchers was based on publishing and on journal impact factors. She asked whether publishing data could help to start rewarding the process of generating data and making it available. Sarah suggested that rather than having the formal peer review of data, it would be better to have an evaluation structure based on the re-use of data – for example, valuing data which was downloadable, well-labelled, re-usable.

Incentives for sharing research data

The final discussion was around incentives for data sharing. Sarah was the first one to suggest that the most persuasive incentive for data sharing is seeing the data being re-used and getting credit for it. She also stated that there was also an important role for funders and institutions to incentivise data sharing. If funders/institutions wished to mandate sharing, they also needed to reward it. Funders could do so when assessing grant proposals; institutions could do it when looking at academic promotions.

Conclusions and outlooks on the future

This was an extremely thought-provoking and well-coordinated discussion. And maybe due to the fact that many of the questions asked remained unanswered, both the panellists and the attendees enjoyed a long networking session with wine and nibbles after the discussion.

From my personal perspective, as an ex-researcher in life sciences, the greatest benefit of open data is the potential to drive a cultural change in academia. The current academic career progression is almost solely based on the impact factor of publications. The ‘prestige’ of your publications determines whether you will get funding, whether you will get a position, whether you will be able to continue your career as a researcher. This, connected with a frequently broken peer-review process, leads to a lot of frustration among researchers. What if you are not from the world’s top university or from a famous research group? Will you be able to still publish your work in a high impact factor journal? What if somebody scooped you when you were about to publish results of your five years’ long study? Will you be able to find a new position? As Danny suggested during the discussion, if researchers start publishing their data in the ‘open”’ there is a chance that the whole process of doing valuable research, making it useful and available to others will be rewarded and recognised. This fits well with Sarah’s ideas about evaluation structure based on the re-use of research data. In fact, more and more researchers go to the ‘open’ and use blog posts and social media to talk about their research and to discuss the work of their peers. With the use of persistent links research data can be now easily cited, and impact can be built directly on data citation and re-use, but one could also imagine some sort of badges for sharing good research data, awarded directly by the users. Perhaps in 10 or 20 years’ time the whole evaluation process will be done online, directly by peers, and researchers will be valued for their true contributions to science.

And perhaps the most important message for me, this time as a person who supports research data management services at the University of Cambridge, is to help researchers to really embrace the open data agenda. At the moment, open data is too frequently perceived as a burden, which, as Tim suggested, is most likely due to imposed policies and institutionalisation of the agenda. Instead of a stick, which results in the minimal compliance attitude, researchers need to see the opportunities and benefits of open data to sign up for the agenda. Therefore, the Institution needs to provide support services to make data sharing easy, but it is the community itself that needs to drive the change to “open”. And the community needs to be willing and convinced to do so.

Further resources

Click here to see the full recording of the Open Data Panel Discussion.
And here you can find a storified version of the event prepared by Kennedy Ikpe from the Open Data Team.

Thank you

We also wanted to express a special ‘thank you’ note to Dan Crane from the Library at the Department of Engineering, who helped us with all the logistics for the event and who made it happen.

Published 27 November 2015
Written by Dr Marta Teperek

Software Licensing and Open Access

October 21, 2015Uncategorizedcomputer software, data sharing, EPSRC, GitHub, licencing, licensing, open data, reproducibility, software, software repository, Software Sustainability InstituteOffice of Scholarly Communication

As part of the Office of Scholarly Communication Open Access Week celebrations, we are uploading a blog a day written by members of the team. Wednesday is a piece by Dr Marta Teperek reporting on the Software Licensing Workshop held on 14 September 2015 at Cambridge.

Uncertainties about sharing and licensing of software

If the questions that the Research Data Service Team have been asked during data sharing information sessions with over 1000 researchers at the University of Cambridge are any indicator, then there is a great deal of confusion about sharing source code.

There have been a wide range of questions during the discussions in these sessions, and the Research Data Service Team has recorded these. We are systematically ensuring that the information we are providing to our research community is valid and accurate. To address the questions about source code we decided to call in expert help. Shoaib Sufi and Neil Chue Hong* from the Software Sustainability Institute agreed to lead a workshop on Software Licensing in September, at the Computer Lab in Cambridge. Shoaib’s slides are here, and Neil’s slides on Open Access policies and software sharing are here.

Malcolm Grimshaw and Chris Arnot from Cambridge Enterprise also came to the workshop to answer questions about Cambridge-specific guidance on software commercialisation.

We had over 50 researchers and several research data managers from other UK universities attending the Software Licensing workshop. The main questions we were trying to resolve was: Are researchers expected to share source code they used in their projects? And if so, under what conditions?

Is software considered as ‘research data’ and does it need to be shared?

The starting question in the discussion was whether software needed to be shared. Most public funders now require that research data underpinning publications is made available. What is the definition of research data? According to the EPSRC research data “is defined as recorded factual material commonly retained by and accepted in the scientific community as necessary to validate research findings”. Therefore, if software is needed to validate findings described in a publication, researchers are expected to make it available as widely as possible. There are some exceptions to this rule. For example, if there is an intention to commercialise the software there might not be a need to share it, but the default assumption is that the software should be shared.

The importance of putting a licence on software

It is important that before any software is shared, the creator considers what they would like others to be able to do with it. The way to indicate the intended reuse of the software is to place a licence on it. This governs the permission being granted to others with regards to source code by the copyright holder(s). A licence determines whether the person who wants to get hold of software is allowed to use, copy, resell, change, or distribute it. Additionally, a licence should also determine who is liable if something goes wrong with the software.

Therefore, a licence not only protects the intellectual property, but also helps others to use the software effectively. If people who are potentially interested in a given piece of software do not know what they are allowed to do with it, it is possible they will search for alternative solutions. As a consequence, researchers could lose important collaborators, buyers, or simply decrease the citation rate that could have been gained from people using and citing software in their publications.

Who owns the copyright?

The most difficult question when it comes to software licensing is determining who owns the copyright – who is allowed to license the software used in research? If this is software created by a particular researcher then it is likely that s/he will be the copyright owner. At the University of Cambridge researchers are the primary owners of intellectual property. This is however a very generous right – typically employers do not allow their employees to retain copyright ownership. Therefore, the issue of copyright ownership might get very complicated for researchers involved in multi-institutional collaborations. Additionally, sometimes funders of research will retain copyright ownership of research outputs.

Consequences of licensing

An additional complication with licensing software is that most licences cannot be revoked. Once something has been licensed to someone under a certain licence, it is not possible to take it back and change the licence. Moreover, if there is one licence for a set of software, it might not be possible to license a patch to the software under a different licence. The issue of licence compatibility sparked a lot of questions during the workshop, with no easy answers available. The overall conclusion was that whenever possible, mixing of licences should be avoided. If use of various licences is necessary, researchers are recommended to get advice from the Legal Services Office.

Good practice for software management

So what are the key recommendations for good practice for software management? Before the start of a research project, researchers should think about who the collaborators and funders are, and what the employer’s expectations are with regards to intellectual property. This will help to determine who will own the copyright over the software. Funders’ and institutional policies for research data sharing should be consulted for expectations about software sharing With this information it is possible to prepare a data management plan for the grant application.

During the project researchers need to ensure that their software is hosted in an appropriate code repository – for example, GitHub or Bitbucket. It is important to create (and keep updating!) metadata describing any generated data and software.

Finally, when writing a paper, researchers need to deposit all releases of data/software relevant to the publication in a suitable repository. It is best to choose a repository which provides persistent links e.g. Zenodo (which has a GitHub integration), or the University of Cambridge data repository (Apollo). It is important to ensure that software is licensed under an appropriate licence – in line with what others should be allowed to do with the software, and in agreement with any obligations there might be with any other third parties (for example, funders of the research). If there is a need to restrict the access to the software, metadata description should give reasons for this restriction and conditions that need to be met for the access to be granted.

Valuable resources to help make right decisions

Both Neil and Shoaib agreed that proper management and licensing of software might be sometimes complicated. Therefore, they recommended various resources and tools to provide guidance for researchers:

Case studies guidance on what should be shared in research projects using various forms of software – prepared by the Software Sustainability Institute;
Attorney-verified interpretation of what most popular license terms mean – prepared by tldrLegal;
Frequently Asked Questions about software sharing – with answers validated by Ben Ryan from the EPSRC.

The workshop was organised in collaboration with Stephen Eglen from the Department of Applied Mathematics and Theoretical Physics (University of Cambridge) who chaired the meeting, and with Andrea Kells from the Computer Lab (University of Cambridge) who hosted the workshop.

The Research Data Service is also providing various other opportunities for our research community to pose questions directly of the funding bodies. We invited Ben Ryan from the EPSRC to come to speak to a group of researchers in May and the resulting validated FAQs are now published on our research data management website. Similarly, researchers met with Michael Ball from the BBSRC in August.

These opportunities are being embraced by our research community.

*About the speakers

Shoaib Sufi – Community Lead at the Software Sustainability Institute

Shoaib leads the Institute’s community engagement activities and strategies. Graduating in Computer Science from the University of Manchester in 1997, he has worked in the commercial sector as a systems programmer and then as software developer, metadata architect and eventually a project manager at the Science and Facilities Technologies Council (STFC).

Neil Chue Hong – Director at the Software Sustainability Institute

Neil is the founding Director of the Software Sustainability Institute. Graduating with an MPhys in Computational Physics from the University of Edinburgh, he began his career at EPCC, becoming Project Manager there in 2003. During this time he led the Data Access and Integration projects (OGSA-DAI and DAIT), and collaborated in many e-Science projects, including the EU FP6 NextGRID project.

Published 21 October 2015
Written by Dr Marta Teperek

The purpose, practicalities, pitfalls and policies of managing and sharing data in the UK

October 20, 2015UncategorizedBBSRC, data journals, EPSRC, figshare, file naming protocol, funders, metadata, open data, policy, RCUK, repository, researcherOffice of Scholarly Communication

As part of the Office of Scholarly Communication Open Access Week celebrations, we are uploading a blog a day written by members of the team. Tuesday is a piece by Dr Danny Kingsley reflecting the talk she gave this morning to the Royal Society of Chemistry, Chemical Information and Computer Applications Group conference – Measurement, Information and Innovation: Digital Disruption in the Chemical Sciences.

The data policy landscape

The policy position on data management in the UK is driven on many levels. Many institutions now have policies – an example is the Cambridge University Research Data Management Policy Framework. Increasingly publishers such as PLOS are requiring that research published in their journals is accompanied by the data underpinning the research. Some journals, such as Nature’s Scientific Data are specifically data-only journals

There has been a country-wide movement towards opening up data. Consultation on the Draft Concordat on Open Research Data released by the RCUK ended on 28 September. Cambridge coordinated a joint response to the Concordat with several other universities.

However the real driver for action this year has been funder policies – specifically, the Engineering and Physical Sciences Research Council (EPSRC) which announced it was going to (and has begun) checking compliance as of 1 May 2015.

The devil is in the detail

While the Research Councils UK have RCUK Common Principles on Data Policy stating “Publicly funded research data are a public good (…), which should be made openly available with as few restrictions as possible”, these common principles are idiosyncratic when looked at from the individual council perspective, as the graphic on the second page of this document demonstrates.

There are variations on whether a data management plan is required, where the data should be stored, the level of support offered and even whether this can be funded through the grant (in most cases it can, but not all).

Places to share data

Open (cross-disciplinary) repositories

These include commercial options such as figshare which is owned by Digital Science who also produce Symplectic Elements research management systems and are is an offshoot company to Macmillian/Springer.

Open source solutions such as Zenodo, an open dependable home for the long-tail of science, enabling researchers to share and preserve any research outputs in any size, any format and from any science, developed by CERN.

Disciplinary repositories

There are a significant number of disciplinary specific data repositories. In many ways these are the most natural place for data as disciplinary experts can curate the data. For example the Natural Environment Research Council (NERC) runs seven repositories.

The first repository ever created (in 1991) was arXiv, holding e-prints in Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance and Statistics. The public functional genomics data repository is Gene Expression Omnibus . Social science data can be deposited in the UK Data Service. The Oxford Text Archive holds literary and linguistic texts for higher education.

Institutional options

Cambridge University is using its DSpace repository Apollo to store and share data. To date the largest dataset we have received is 28 GB – huge datasets need to sit externally to this repository. Not all institutions provide a data repository service. There are significant overheads associated with this activity both for the technology and the people to upload and curate the data.

Journals

Journals are increasingly requiring researchers to publish (or at least provide links to) their supporting data alongside their research articles. PLOS brought in their data sharing policy in December 2013.

There is also a growing selection of data only journals on the market, for example Nature’s Scientific Data. Others include Journal of Open Archaeology Data, Open Health Data, Journal of Open Psychology Data, Gigascience, Bodiversity Data Journal and Earth System Science Data.

Aggregating services

The Australian National Data Service has built Research Data Australia (RDA) which collates and displays information about datasets held all over the country, both open and closed. As of the 20 October, the RDA contains 115,823 datasets, of which 23,322 are open.

There appears to be little appetite yet for this sort of service in the UK, or at least for providing funds to create one.

What are we actually trying to achieve ?

Cambridge University has taken a proactive approach to the RCUK data sharing policies by inviting funders to come and discuss issues with the research community. We have written up and published these discussions as an ‘In conversation with..’ series on our Unlocking Research blog.

In addition to clarification on some aspects of the policies, one of the questions we are trying to answer is what is the actual goal of these policies? Ben Ryan the Senior Manager, Research Outcomes of the EPSRC clarified that researchers needed to share:

the data that underpins publications
the data that validates research findings
the data that is worth keeping

To summarise the philosophical goals of the EPSRC policy:

The default position is ‘data should be open’
Published research findings should be testable
Maximise the impact of publicly funded research
Maintain public trust in science and research
They are trying to create a new research culture

Researcher responses

While these might seem like lofty and even admirable goals, it does not mean that the academic response to being informed of their grant requirement for sharing data have been met with welcoming arms. Far from it in some cases.

A small selection of the responses we have received in our meetings with over 1500 researchers this year at Cambridge include:

What’s the minimum we can get away with?
This is crap
‘They’ are just doing this because ‘they’ can
But it will take a huge effort to get the data in a useable form
No-one will look at it
What a waste of time

This has prompted us to play a game we call ‘data excuse bingo’ at some of our Research Data Management Workshops – see slide 16.

We are trying to start at the end

The problem might be that everyone is fixated on sharing data at the end of the research process. However this is one of the lesser data management activities if data management begins at the beginning.

The second of our “In conversation…” series was with Michael Ball the Strategy and Policy Manager at the Biotechnical and Biological Sciences Research Council (BBSRC). Amongst a long discussion about what exactly constitutes ‘data’ in the biological sciences, Michael emphasised that disciplines themselves must establish ways of dealing with data. This is the beginning of an ongoing process.

He noted that researchers need to consider how to deal with data from the beginning of a research project. Researchers can ask for money to manage data in the grant application (something which is currently quite rare).

The practice of sharing data requires the data to be: Accessible, Intelligible, Assessable and Reusable. So how do we achieve that?

Basics of Research Data Management

Good data management includes very basic practices such as:

Writing a research data management plan at the beginning of the research process – identifying all of these issues
Using a file naming protocol (including version control)
Backing up work in several places
Identifying any data that might be politically, personally or commercially sensitive
Determining who owns what data
Ensuring data that is being used for research across collaborations is shared in safe, secure and legal shared facilities, bearing in mind Export Control Legislation.
Having good metadata protocols
Using a reputable and reliable storage/sharing facility that offers persistent identifiers (DOIs)

Who owns the data?

This is an interesting question. EPSRC policies state that researchers should ensure that collaborators are aware of the sharing requirement before they embark on research. Then there are questions about who within the team might own the data – with again a suggestion to come to some sort of agreement before work starts.

Are all collaborators equal ‘owners’? Or does the Principal Investigator have a 50% stake, with the remaining split amongst the PhD students making up the remainder of the team? You might want to talk to your legal advisors and/or your research office about this issue.

There is also the related issue here of developing author-contribution transparency. Do you include author contribution statements in your articles?.

Staffing issues

If Michael Ball is correct and very few researchers are asking for funds associated with managing data, it is reasonable to assume that data is being managed in an ad hoc way – with reliance on the computer savvy postdoc the project hired …

Required skillsets for managing and curating data

There is a considerable range of skill sets associated with managing data, and these have been described by the Digital Curation Centre as data creator, data scientist, data librarian, data manager.

Alma Swan and Sheridan Brown’s 2008 report to Jisc on ‘The skills, role and career structure of data scientists and curators: an assessment of current practice and future needs’ described these as:

Data Creator: Researchers with domain expertise who produce data. These people may have a high level of expertise in handling, manipulating and using data
Data Scientist: People who work where the research is carried out – or, in the case of data centre personnel, in close collaboration with the creators of the data – and may be involved in creative enquiry and analysis, enabling others to work with digital data, and developments in data base technology
Data Manager: Computer scientists, information technologists or information scientists and who take responsibility for computing facilities, storage, continuing access and preservation of data
Data Librarian: People originating from the library community, trained and specialising in the curation, preservation and archiving of data

There is a simple graphic that clearly shows how these roles relate to one another.

Certainly an increasing number of data scientist jobs are being advertised. A search for ‘Data scientist’ + ‘London’ on the job site Indeed on 18 October produced 1,405 results. So where are all of these people coming from?

Training available

The Swan and Brown study in 2008 noted that ‘People in data science roles face a big, continuing challenge in remaining properly skilled up’ and this remain the situation today, although there are now many more opportunities for training – a few are listed here:

The Digital Curation Centre offers data management and curation education and training.
MANTRA is a free online course from the Data Library staff in Information Services, University of Edinburgh. It is designed for those who manage digital data as part of a research project. It was crafted for the use of post-graduate students, early career researchers, and also information professionals. It is freely available on the web for anyone to explore on their own.
Data Scientist training for Librarians is a collection of notes and discussions about data work done by librarians. They ran an experimental course in August to teach “the latest tools for extracting, wrangling, storing, analysing, and visualising data”.

In addition, the professionalisation of these skills is beginning. For example, Data Science London is a community of data scientists that meets regularly to discuss data science ideas, concepts, and tools, methods and technologies used by many startups to analyse large scale data (big data), extract predictive insight, and exploit business opportunities from data products. Their website offers a list of free data science courses.

Cambridge University is one of the five founding university partners of the about to be launched Alan Turing Institute, which is intended to be the UK’s national institute for data science. The Institute will be addressing some of the issues with the data management skill gap.

Issues with sharing data

For all of the ‘feel good’ ideals about sharing data, and the processes being put in place to support this, there are some serious issues raised by researchers about the requirement to share data.

To start, there is a very real concern that the UK will become unattractive for collaborations. Why would a commercial collaborator choose a UK partner when a partner from elsewhere is not under obligation to share their data?

There have been some discussions at information sessions about the possible need to change the type of research being done to reduce the amount of data being produced because of the cost of long term storage of this data.

Indeed, there is discussion in some circles about whether applying for EPSRC funding is worth the hassle. It is a fair bet that none of these were intended outcomes of the RCUK policies.

Consequences of not sharing data

However, this does need to be a balanced strategy. There is a considerable argument that openness is related to academic integrity as it allows work to be verified and validated.

Here are examples in three disciplines where sharing data had a dramatic effect:

Medicine – having the data publicly available in two trials of deworming pills demonstrated that a population wide deworming program did not improve school performance.
Economics – A study widely cited to justify budget cutting in the US had a mistake in the calculations which was only revealed when the Excel file was released
Physics – It took 12.5 years to withdraw Jan Hendrik Schon’s work on ‘organic semiconductors’ because the reviewers were unable to replicate the results without access to the original data or lab books.

Conclusion

Sharing data offers great challenges to the research community, not least because it is less than clear what ‘data’ means in different disciplines. It will take some time for the research community to change its philosophy and practice. But the positives outweigh the negatives and with hope we will look back at this time as a short transition period.

Note the slides from the talk are available in Slideshare.

Published 20 October 2015
Written by Dr Danny Kingsley