Tag Archives: open data

Good news stories about data sharing?

We have been speaking to researchers around the University recently to discuss the expectations of their funders in relation to data management. This has raised the issue of how best to convince people this is a process that benefits society rather than a waste of time or just yet another thing they are being ‘forced to do’ – which is the perspective of some that we have spoken with.

Policy requirements

In general most funders require a Research Data Management Plan to be developed at the beginning of the project – and then adhered to. But the Engineering and Physical Sciences Research Council (EPSRC) have upped the ante by introducing a policy requiring that papers published from May 2015 onwards resulting from funded research include a statement about where the supporting research data may be accessed. The data needs to be available in a secure storage facility with a persistent URL, and that it must be available for 10 years from the last time it was accessed.

Carrot or stick?

While having a policy from funders does make researchers sit up and listen, there is a perception in the UK research community that this is yet another impost on time-poor researchers. This is not surprising. There has recently been an acceleration of new rules about sharing and assessing research.

The Research Excellence Framework (REF) occurred last year, and many researchers are still ‘recuperating’. Now the Higher Education Funding Council of England (HEFCE) is introducing  a policy in April 2016 that any peer reviewed article or conference paper that is to be included in the post-2014 REF must have been deposited to their institution’s repository within three months of acceptance or it cannot be counted.  This policy is a ‘green’ open access policy.

The Research Councils UK (RCUK) have had an open access policy in place for two years, introduced in 1 April 2013, a result of the 2012 Finch Report. The RCUK policy states that funded research outputs must be available open access, and it is permitted to make them available through deposit into a repository. At first glance this seems to align with the HEFCE policy, however, restrictions on the allowed embargo periods mean that in practice most articles must be made available gold open access – usually with the payment of an accompanying article processing charge. While these charges are supported by a block grant fund, there is considerable impost on the institutions to manage these.

There is also considerable confusion amongst researchers about what all these policies mean and how they relate to each other.

Data as a system

We are trying to find some examples about how making research data available can help research and society. It is unrealistic to hope for something along the lines of Jack Akandra‘s breakthrough for a diagnostic test for pancreatic cancer using only open access research.

That’s why I was pleased when Nicholas Gruen pointed me to a report he co-authored: Open for Business: How Open Data Can Help Achieve the G20 Growth Target – A Lateral Economics report commissioned by Omidyar Network – published in June 2014.

This report is looking primarily at government data but does consider access to data generated in publicly funded research. It makes some interesting observations about what can happen when data is made available. The consideration is that data can have properties at the system level, not just the individual  level of a particular data set.

The point is that if data does behave in this way, once a collection of data becomes sufficiently large then the addition of one more set of data could cause the “entire network to jump to a new state in which the connections and the payoffs change dramatically, perhaps by several orders of magnitude”.

Benefits of sharing data

The report also refers to a 2014 report The Value and Impact of Data Sharing and Curation: A synthesis of three recent studies of UK research data centres. This work explored the value and impact of curating and sharing research data through three well-established UK research data centres – the Archaeological Data Service, the Economic and Social Data Services, and the British Atmospheric Data Centre.

In summarising the results, Beagrie and Houghton noted that their economic analysis indicated that:

  • Very significant increases in research, teaching and studying efficiency were realised by the users as a result of their use of the data centres;
  • The value to users exceeds the investment made in data sharing and curation via the centres in all three cases; and
  • By facilitating additional use, the data centres significantly increase the measurable returns on investment in the creation/collection of the data hosted.
So clearly there are good stories out there.

If you know of any good news stories that have arisen from sharing UK research output data we would love to hear them. Email us or leave a comment!

Interview with Nigel Shadbolt on The Life Scientific

Sir Nigel Shadbolt was interviewed on ‘The Life Scientific‘ this morning  on BBC Radio4 about open data.

The general discussion ranged from his background and what got him interested in this area. The data being discussed is more about government public data (such as medical information or cyclist black spots) than that generated in research projects, but an interesting conversation nonetheless. A couple of items that jumped out to me:

16:50 – When we talk about data, really we are talking about information … Data and information and knowledge are kinda different and mostly when we talk about open data we are talking about information. Data (such as a number) only becomes information if it is placed in context. If you can do something with the information then it becomes knowledge – ‘actionable information’. These are different strains of stuff that the computer holds.  We need open information to build knowledge. The semantic web.

16:00 – Do the risks of making data available outweigh the benefits? And do we ask the general public’s opinion or just tell them that this is what we do? They want some sort of empowerment in this but often there is no empowerment.

29:00 – We are barely scratching the surface in terms of the insights as we anlayse and look for patterns in the information.  We are living in a world that is increasingly emitting data – people are increasingly able to collect data onto and off their phones (or supercomputers, depending on how you look at it). This data richness demands a new world for applications we haven’t thought of and ways of analysing the information.

Listen to the half hour interview here.

Blurb from the BBC webpage:

Sir Nigel Shadbolt, Professor of Artificial Intelligence at Southampton University, believes in the power of open data. With Sir Tim Berners-Lee he persuaded two UK Prime Ministers of the importance of letting us all get our hands on information that’s been collected about us by the government and other organisations. But, this has brought him into conflict with people who think there’s money to be made from this data. And open data raises issues of privacy.

Nigel Shadbolt talks to Jim al-Khalili about how a degree in psychology and philosophy lead to a career researching artificial intelligence and a passion for open data.

Published 14 April 2015
Written by Dr Danny Kingsley
Creative Commons License

FORCE2015 observations & notes

This blog first appeared on the FORCE2015 website on the 14 January 2015

First a disclaimer. This blog is not an attempt to summarise everything that happened at FORCE2015 – I’ll leave that to others. The Twitter feed using #FORCE2015 contains an interesting side discussion, and the event was livestreamed with individual sessions live in two weeks here – so you can always check bits out for yourself.

So this is a blog about the things that I as a researcher in scholarly communication working in university administration (with a nod to my previous life as a science communicator) found interesting. This is a small representative of the whole.

This was my first FORCE event, which has occurred annually since the first event FORCE11 , which happened in August 2011 after a “Beyond the pdf” workshop in January that year. It was nice to have such a broad group of attendees. There were researchers and innovators (and often people were both), research funders, publishers, geeks and administrators all sharing their ideas. Interestingly there were only a few librarians – this, in itself makes this conference stand out. Sarah Thomas, Vice President of Harvard Library observed this, noting she is shocked that there are usually only librarians at the table at these sort of events.

To give an idea of the group – when the question was asked about who had received a grant from the various funding bodies, I was in a small minority by not putting up my hand. These are actively engaged researchers.

I am going to explore some of the themes of the conference here, including:

  • Library issues
  • The data challenge
  • New types of publishing
  • Wider scholarly communication issues, and
  • The impenetrability of scientific literature

Bigger questions about effecting change

Responsibility

Whose responsibility is it to effect change in the scholarly communication space? Funders say they are looking to the community for direction. Publishers are saying they are looking to authors and editors for direction. Authors are saying they are looking to find out what they are supposed to do. We are all responsible. Funding is not the domain of the funders, it is interdependent.

What is old is still old

The Journal Incubator team asked the editorial board members of the new journal “Culture of Community” to identify what they thought will attract people to their journal. None mentioned the modern and open technology of their publishing practices. All points they identified were traditional, such as: peer review, high indexing, pdf formatting etc. Take home message – Authors are not interested in the back-end technology of a journal, they just want the thing to work. This underlies the need to ENABLE not ENGAGE.

The way forward

The way forward is three fold, and incorporates: Community – Policy – Infrastructure. Moving forward we will require initiatives focused on: Sustainability, Collaboration and Training.

Library issues

Future library

Sarah Thomas, the Vice President of the Harvard Library spoke about “Libraries at Scale or Dinosaurs Disrupted”. She had some very interesting things to say about the role of the library into the future:

  • Traditional libraries are not sustainable. Acquisition, cataloguing and storage of publications doesn’t scale.
  • We need to operate at scale, and focus on this centuries’ challenges not last, by developing new priorities and reallocate resources to them. Use approaches that dramatically increase outputs.
  • There is very little outreach of the libraries into the community –  we are not engaging broadly expect “we are the experts and you come to us and we will tell you what to do”.
  • We must let go of our outdated systems – such as insufficiently automated practices, redundant effort, ‘just in case coverage’.
  • We must let go of our outdated practices – a competitive, proprietary approach. We need to engage collaborators to advance goals.
  • Open up hidden collection and maximise access to what we have.
  • Start doing research into what we have and illuminate the contents in ways we never could in a manual world, using visualization and digital tools

Future library workforce

There was also some discussion about the skils a future library worksforce needs to have:

  • We need an agile workforce – skills training, data science social media etc – help promote the knowledge of quality to work. Put it in performance goals.
  • We need to invest in 21st century skillsets. the workforce we should be hiring includes:
    • Metadata librarian
    • Online learning librarians
    • Bioinformatics librarians
    • GIS specialist
    • Visualization librarian
    • Copyright advisor
    • Senior data research specialist
    • Data curation experts
    • Scholarly communications librarian
    • Quantitative data specialist
    • Faculty technology specialist
    • Subject specialist
Possible solution?
The Council on LIbrary and Information Resources offers PostDoc Fellowships: CLIR Postdoctoral Fellows work on projects that forge and strengthen connections among library collections, educational technologies and current research. The program offers recent PhD graduates the chance to help develop research tools, resources, and services while exploring new career opportunities.

Possible opportunity to observe change?

In summing up the conference Phil Bourne said there is an upcoming major opportunity point – both the European Bioinformatics Institute in EU and the National Library of Medicine in US will soon assume new leadership. They are receiving recommendations on what the institution of the future should look like.

The library has a tradition of supporting the community, being an infrastructure to maintain knowledge, and in the case of National Library of Medicine to set policy. If they are going to reinvent this institution we need to watch what will it look like in the future.

The future library (or whatever it will be called) should curate, catalog, preserve and disseminate the complete digital research lifecycle. This is something we need to move towards. The fact that there is an institution that might move towards this is very exciting.

The data challenge

Data was discussed at many points during the conference, with some data solutions/innovations showcased:

  • Harvard has the Harvard Dataverse Network– a repository to share data. “Data Management at Harvard” – Harvard Guidelines and Policies cranking up investment in managing data LINK
  • The Resource Identification Initiative is designed to help researchers sufficiently cite the key resources used to produce the scientific findings reported in the biomedical literature.
  • Bio Caddie is trying to do for data what PubMed central has done for publications using a Data Discovery Index. The goal of this project is to engage a broad community of stakeholders in the development of a biomedical and healthCAre Data Discovery and Indexing Ecosystem (bioCADDIE).

The National Science Foundation data policy

Amy Frielander spoke about The Long View. She posed some questions:

  • Must managing data be collated with storing the data?
  • What gets access to what and when?
  • Who and what can I trust?
  • What do we store it in? Where do we put things, where do they need to be?

The NSF don’t make a policy for each institution, they make one NSF Data Sharing Policy that works more or less well across all disciplines. There is a diversity of sciences with heterogeneous research results. Range of institutions, professional societies, stewardship institutions and publishers, and multiple funding streams.

There are two contact points – when grant is awarded, and when they report. If we focus on publications we can develop the architecture to extend to other kinds of research products. Integrate the internal systems within the enterprise architecture to minimise burden on investigators and program staff.

Take home message: The great future utopia (my word) is: We want to upload once to use many times. We want an environment in which all publications are linked to the underlying evidence (data) analytical tools, and software.

New types of publishing

There were several publishing innovations showcased.

Journal Incubator

The University of Lethbridge has a ‘journal incubator’ which was developed with the goal of sustaining scholarly communication and open and digital access. It allows the incubator to train graduate students in the task of journal editorships so the journal can be provided completely free of charge.

F1000 Research Ltd – ‘living figures’

Research is an ongoing activity but the way we publish you wouldn’t think it was. It is still very much around the static print object. The F1000 LINK has the idea that data is embedded in the article – developed a tool that allows you to see what is on the article.

Many figures don’t need to exist – you need the underlying data. Living figures in the paper. Research labs can submit data directly on top of the figure – to see if it was reproducible or not. This provides interesting potential opportunities –bringing static reseach figures to life – a “Living collection” Can have articles in different labs around that data. The tools and technologies are out there already.

Collabra – giving back to the community

New University California Open Press journal, Collabra will share a proportion of APC with researchers and reviewers. Of the $875 APC, $250 goes into the fund. Editors and reviewers get money into the fund, and there is a payout to the research community – they can decide what to do with it. Choices are to:

  • Receive it electronically
  • Pay it forward to pay APCS in future
  • Pay it forward to institution’s OA fund.

This is a journal where reviewers get paid  – or can elect to pay themselves. See if everyone can benefit from the journal. No lock-in – benefit through partnerships.

Future publishing – a single XML file

Rather than replicating old publishing processes electronically, the dream is we have one single XML file in the cloud. There is role-based access to modify the work (by editors, reviewers etc) then at the end that version is the one that gets published. Everything is in the XML and then automatic conversion at the end.  References at the click of a button are completely structured XML – tags are coloured. Can track the changes. The journal editor gets a deep link to say something to look at. Can accept or reject. XML can convert to a pdf – high level typography, fully automatically.

Wider scholarly communication issues

This year is the 350th anniversary of the first scientific journal* Philosophical Transactions of the Royal Society. Oxford holds a couple of copies of this journal and there was an excursion for those interested in seeing it.

It is a good time to look at the future.

Does reproducibility matter?

Something that was widely discussed was the question of whether research should be reproducible,which raised the following points:

  • The idea of a single well defined scientific method and thus an incremental, cumulative, scientific process is debatable.
  • Reproducibility and robustness are slightly different. Robustness of the data may be key.
  • There are no standards with a computational result that can ensure we have comparable experiments.
Possible solution?

Later in the conference a new service that tries to address the diversity of existing lab software was showcased – Riffyn. It is a cloud based software platform – a CAD for experiments. The researcher has a unified experimental view of all their processes and their data. Researchers can update it themselves – not reliant on IT staff.

Credit where credit is due

I found the reproducibility discussion very interesting, as was the discussion about authorship and attribution which posed the following:

  • If it is an acknowledgement system everyone should be on it
  • Authorship is a proxy for scientific responsibility. We are using the wrong word.
  • When crediting research we don’t make distinctions between contributions. Citation is not the problem, contribution is.
  • Which building blocks of a research project do we not give credit for? And which ones only get indirect credit? How many skills would we expect one person to have?
  • The problem with software credit is we are not acknowledging the contributors, so we are breaking the reward mechanism
  • Of researchers in research-intensive universities, 92% are using software. Of those 69% say their work would be impossible without software. Yet 71% of researchers have no formal software development training. We need standard research computer training.
Possible solutions
  • The Open Science Framework  –  provides documentation for the whole research process. This therefore determines how credit should be apportioned.
  • Project CRediT has come up with a taxonomy of terms. Proposing take advantage of an infrastructure that already exists. Using Mozilla OpenBadges – if you hear or see the word ‘badges’ think ‘Digital Credit’

The impenetrability of scientific literature

Astrophysicist Chris Lintott discussed citizen science, specifically the phenomenally successful programGalaxyZoo which taps into a massive group of interested amateur astronomers to help classify galaxies in terms of their shape. This is something that humans do better than machines.

What was interesting was the challenge that Chris identified – amateur astronomers become ‘expert’ amateurs quickly and the system has built ways of them to communicate with each other and with scientists. The problem is that the astronomical literature is simply impenetrable to these (informed) citizens.

The scientific literature is the ‘threshold fear’ for the public. This raises some interesting questions about access – and the need for some form of moderator. One suggestion is some form of lay summary of the research paper – PLOS Medicine have an editor’s summary for papers. (Nature do this for some papers, and BMJ are pretty strong on this too).

Take home message – By designing a set of scholarly communication tools for citizen scientists we improve the communication for all scientists. This is an interesting way to think about how we want to browse scholarly papers as researchers ourselves.

*Yes I know that the French Journal des scavans was published before this, but it was boarder in focus, so hence the qualifier ‘first scientific journal”
Published 18 March 2015
Written by Dr Danny Kingsley
Creative Commons License