All posts by Office of Scholarly Communication

The case for Open Research: reproducibility, retractions & retrospective hypotheses

This is the third instalment of ‘The case for Open Research’ series of blogs exploring the problems with Scholarly Communication caused by having a single value point in research – publication in a high impact journal. The first post explored the mis-measurement of researchers and the second looked at issues with authorship.

This blog will explore the accuracy of the research record, including the ability (or otherwise) to reproduce research that has been published, what happens if research is retracted, and a concerning trend towards altering hypotheses in light of the data that is produced.

Science is thought to progress  through the building of knowledge through questioning, testing and checking work. The idea of ‘standing on the shoulders of giants’ summarises this – we discover truth by building on previous discoveries. But scientists are very rarely rewarded for being right, they are rewarded for publishing in certain journals and for getting grants. This can result in distortion of the science.

How does this manifest? The Nine Circles of Scientific Hell describes questionable research practices that occur, ranging from Overselling, Post-Hoc storytelling, p-value Fishing, Creative use of Outliers to Non or Partial Publication of Data. We will explore some of these below. (Note this article appears in a special issue of Perspectives on Psychological Science on the Replicability in Psychological Science – which contains many other interesting articles).

Much as we like to think of science as an objective activity it is not. Scientists are supposed to be impartial observers, but in reality they need to get grants, and publish papers to get promoted to more ‘glamorous institutions’. This was the observation of Professor Marcus Munafo in his presentation ‘Scientific Ecosystems and Research Reproducibility’ at the Research Libraries UK conference held earlier this year (the link will take you to videos of the presentations). Monafo observed that scientists are rarely rewarded for being right, so the scientific record is being distorted by the scientific ecosystem.

Monafo, a Biological Psychologist at Bristol University, noted that research, particularly in the biomedical sciences, ‘might not be as robust as we might have hoped‘.

The reproducibility crisis

A recent survey of over 1500 scientists by Nature tried to answer the question “Is there a reproducibility crisis?” The answer is yes, but whether that matters appears to be debatable: “Although 52% of those surveyed agree that there is a significant ‘crisis’ of reproducibility, less than 31% think that failure to reproduce published results means that the result is probably wrong, and most say that they still trust the published literature.”

There are certainly plenty of examples of the inability to reproduce findings. Pharmaceutical research can be fraught. Some research into potential drug targets found that in almost two-thirds of the projects looked at, there were inconsistencies between published data and the data resulting from attempts to reproduce the findings. 

There are implications for medical research as well. A study published last month looked at functional MRI (fMRI), noting that when analysing data using different experimental designs they should in theory find a significance threshold of 5% (a p-value of less than 0.05  which is conventionally described as statistically significant). However they found “the most common software packages for fMRI analysis (SPM, FSL, AFNI) can result in false-positive rates of up to 70%. These results question the validity of some 40,000 fMRI studies and may have a large impact on the interpretation of neuroimaging results.”

A 2013 survey of cancer researchers found that approximately half of respondents had experienced at least one episode of the inability to reproduce published data. Of those people who followed this up with the original authors, most were unable to determine why the work was not reproducible. Some of those original authors were (politely) described as ‘less than “collegial”’.

So what factors are at play here? Partly it is due to the personal investment in a particular field. A 2012 study of authors of significant medical studies concluded that: “Researchers are influenced by their own investment in the field, when interpreting a meta-analysis that includes their own study. Authors who published significant results are more likely to believe that a strong association exists compared with methodologists.”

This was also a factor in a study Why Most Published Research Findings Are False that considered the way research studies are constructed. This work found that “for most study designs and settings, it is more likely for a research claim to be false than true. Moreover, for many current scientific fields, claimed research findings may often be simply accurate measures of the prevailing bias.”

Psychology is a discipline where there is a strong emphasis on novelty, discovery and finding something that has a p-value of less than 0.05. There is such an issue with reproducibility in psychology that there are large efforts to try and reproduce psychological studies to estimate the reproducibility of the research. The Association for Psychological Science has launched a new article type of Registered Replication Reports which consists of “multi-lab, high-quality replications of important experiments in psychological science along with comments by the authors of the original studies”.

This is a good initiative, although there might be some resistance to this type of scrutiny. Something that was interesting from the Nature survey on reproducibility was the question of what happened when researchers attempted to publish a replication study. Note that only a few of respondents had done this, possibly because incentives to publish positive replications are low and journals can be reluctant to publish negative findings. The study found that “several respondents who had published a failed replication said that editors and reviewers demanded that they play down comparisons with the original study”.

What is causing this distortion of the research? It is the emphasis on publication of novel results in high impact journals. There is no reward for publishing null results or negative findings.

HARKing problem

The p-value came up again in a discussion about HARKing at this year’s FORCE2016 conference (HARK stands for Hypothesising After the Results are Known – a term coined in 1998).

In his presentation at FORCE2016 Eric Turner, Associate Professor OHSU, spoke about HARKing (see this video 37 minutes onward).  The process is that the researcher conceives the study, writes the protocol up for their eyes only, with a hypothesis and then collects lots of other data – ‘the more the merrier’ according to Turner. Then the researcher runs the study and analyses the data. If there is enough data, the researcher can try alternative methods and can play with statistics. ‘You can torture the data and it will confess to anything’ noted Turner. At some point the p-value will come out below 0.05. Only then does the research get written up.

Turner noted that he was talking about the kind of research where the work is trying to confirm a hypothesis (like clinical trials). This is different to hypothesis-generating research.

In the US clinical trials with human participants must be registered with the Federal Drug Agency (FDA) so it is possible to see the results of all trials. Turner talked about his 2008 study looking at antidepressant trials, where the journal version of the results supported the general view that antidepressants always beat placebo.  However when they looked at the FDA version of all of the studies of the same drugs it happened that half of the studies were positive and half and half were not positive. The published record does not reflect the reality.

The majority of the negative studies were simply not published, but 11 of the papers had been ‘spun’ from negative to positive. These papers had a median impact factor of 5 and median citations of 68 – these were highly influential articles. As Turner noted ‘HARKing is deceptively easy’.

This perspective is supported by the finding that a researcher is more likely to falsely find evidence that an effect exists than to correctly find evidence that it does not. Indeed Munafo noted that over 90% of the psychological literature finds what it purports to set out to do. Either the research being undertaken is extraordinarily mundane, or something is wrong.

Increase in retractions

So what happens when it is discovered that something that is published is incorrect? Journals do have a system which allows for the retraction of papers, and this is a practice which has been increasing over the past few years.  Research looking at why the number of retractions have increased found that it was partly due to lower barriers to publication of flawed articles. In addition papers are now being retracted for issues like plagiarism and retractions are now happening more quickly.

Retraction Watch is a service which tracks retractions ‘as a window into the scientific process’. It is enlightening reading with several stories published every day.

An analysis of correction rates in the chemical literature found that the correction rate averaged about 1.4 percent for the journals examined. While there were numerous types of corrections, chemical structures, omission of relevant references, and data errors were some of the most frequent types of published corrections. Corrections are not the same as retractions, but they are significant.

There is some evidence to show that the higher the impact factor of the journal a work is published in, the higher the chance it will be retracted. A 2011 study showed a direct correlation between impact factor and the number of retractions, with New England Journal of Medicine topping the list. This situation has led to claims that the top ranking journals publish the least reliable science.

A study conducted earlier this year demonstrated that there are no commonly agreed definitions of academic integrity and malpractice. (I should note that amongst other findings the study found 17.9% (± 6.1%) of respondents reported having fabricated research data. This is almost 1 in 5 researchers. However there have been some strong criticisms of the methodology.)

There are questions about how retractions should be managed. In the print era it was not unheard of for library staff to put stickers into printed journals notifying a retraction. But in the ‘electronic age’ asked one author in 2002 when the record can be erased, is this the right thing to do because erasing the article entirely is amending history.  The Committee on Publication Ethics (COPE) do have some guidelines for managing retractions which suggest the retraction be linked to the retracted article wherever possible.

However, from a reader’s perspective, even if an article is retracted this might not be obvious. In 2003* a survey of 43 online journals found 17 had no links between the original articles and later corrections. When present, hyperlinks between articles and errata showed patterns in presentation style, but lacked consistency. There are some good examples – such as Science Citation Index but there was a lack of indexing in INSPEC, and a lack of retrieval with SciFinder Scholar.

[*Note this originally said 2013, amended 2 September 2016]

Conclusion

All of this paints a pretty bleak picture. In some disciplines the pressure to publish novel results in high impact journals results in the academic record being ‘selectively curated’ at best. At worst it results in deliberate manipulation of results. And if mistakes are picked up there is no guarantee that this will be made obvious to the reader.

This all stems from the need to publish novel results in high impact journals for career progression. And when those high impact journals can be shown to be publishing a significant amount of subsequently debunked work, then the value of them as a goal for publication comes into serious question.

The next instalment in this series will look at gatekeeping in research – peer review.

Published 14 July 2016
Written by Dr Danny Kingsley
Creative Commons License

The case for Open Research: the authorship problem

This is the second in a blog series about why we need to move towards Open Research. The first post about the mis-measurement problem considered issues with assessment. We now turn our attention to problems with authorship. Note that as before this is a topic of research in itself – and there is a rich vein of literature to be mined here for the interested observer.

Hyperauthorship

In May last year a high energy physics paper was published with over 5,000 authors. Of the 33 pages in this article, the paper occupied nine with the remainder listing the authors. This paper caused something of a storm of protest about ‘hyperauthorship’ (a term coined in 2001 by Blaise Cronin).

Nature published a news story on it, which was followed a week later by similar stories decrying the problem. The Independent published  a story with the angle that many people are just coasting along without contributing. The Conversation’s take on the story looked at the challenge of effectively rewarding researchers. The Times Higher Education was a bit slower off the mark, in August publishing a story questioning whether mass authorship was destroying the credibility of papers.

This paper was featured in  a keynote talk given at this year’s FORCE2016 conference. Associate Professor Cassidy Sugimoto from the School of Informatics and Computing, Indiana University Bloomington spoke about ‘Structural Disruptions in the Reward System of Science’ (video here). She noted that authorship is the coin of the realm the pivot point of the whole of the scientific system and this has resulted in the growth of authors listed on a paper.

Sugimoto asked: What does ‘authorship’ mean when there are more authors than words in a document? This type of mass authorship raises concerns about fraud and attribution. Who is responsible if something goes wrong?

The authorship ‘proxy for credit’ problem

Of course not all of those 5,000 people actually contributed to the writing of the article – the activity we would normally associate with the word ‘authorship’. Scientific authorship does not follow the logic of literary authorship because of the nature of what is being written about.

In 1998 Biagioli (who has literally written the book on Scientific Authorship or at least edited it) in a paper called ‘The Instability of Authorship: Credit and Responsibility in Contemporary Biomedicine’ said that “the kind of credit held by a scientific author cannot be exchanged for money because nature (or claims about it) cannot be a form of private property, but belongs in the public domain”.

Facts cannot be copyrighted. The inability to write for direct financial remuneration in academia has implications for responsibility (addressed further down), but first let’s look at the issue of academic credit.

When we say ‘author’ what do we mean in this context? Often people are named as ‘authors’ on a paper because their inclusion will help to have the paper accepted, or it is a token thanks for providing the grant funding for the work. These are practices referred to as ‘gift authorship‘ where co-authorship awarded to a person who has not contributed significantly to the study.

In an attempt to stop some of the more questionable practices above, the International Committee of Medical Journal Editors (ICMJE) has defined what it means to be an author which says authorship should be based on:

  • a substantial contribution
  • drafting the work
  • giving final approval and
  • agreeing to be accountable for the integrity of the work.

The problem, as we keep seeing, is that authorship on a publication is the only thing that counts for reward. This means that ‘authorship’ is used as a proxy for crediting people’s contribution to the study.

Identifying contributions

Listing all of the people who had something to do with a research project as ‘authors’ on the final publication fails to credit different aspects of the labour involved in the research. In an attempt to address this, PLOS asks for the different contributions by those named on a paper to be defined on articles, with their guidelines suggesting categories such as Data Curation, Methodology, Software, Formal Analysis and Supervision (amongst many).

Sugimoto has conducted some research to find what this reveals about what people are contributing to scientific labour. In an analysis of PLOS data on contributorship, her team showed that in most disciplines the labour was distributed. This means that often the person doing the experiment is not the person who is writing up the work. (I should note that I was rather taken aback by this when it arose in interviews I conducted for my PhD).

It is not particularly surprising that in the Arts, Humanities and Social Sciences that the listed ‘author’ is most often the person who wrote the paper. However in Clinical Medicine, Biomedicine or Biology very few authors are associated with the task of writing.  (As an aside, the analysis found women are disproportionately likely to be doing the experimentation, and men are more likely to be authoring, conceiving experimentation or obtaining resources.)

So, would it not be better if rather than placing the only emphasis on authorship of journal articles in high impact journals, we were able to reward people for different contributions to the research?

And while everyone takes credit, not all people take responsibility.

Authorship – taking responsibility

It is not just the issue of the inability to copyright ‘facts of nature’ that makes copyright unusual in academia. The academic reward system works on the ‘academic gift principle’ – academics provide the writing, the editing and the peer review for free and do not expect payment. The ‘reward’ is academic esteem.

This arrangement can seem very odd to an outsider who is used to the idea of work for hire. But there are broader implications than what is perceived to be ‘fair’ – and these relate to accountability. It is much more difficult to sue a researcher for making incorrect statements than it is to sue a person who writes for money (like a journalist).

Let us take a short meander into the world of academic fraud. Possibly the biggest and certainly highly contentious case was Andrew Wakefield and the discredited (and retracted) claim that the MMR vaccine was associated with autism in children. This has been discussed at great length elsewhere – the latest study debunking the claim was published last year. Partly because of the way science is credited and copyright is handled, there were minimal repercussions for Wakefield. He is barred from practicing medicine in the UK, but enjoys a career on the talkback circuit in the US. Recently a film about the MMR claims, directed by Wakefield was briefly shown at the Tribeca film festival before protests saw it removed from the programme.

Another high profile case is Diedderik Stapel, a Dutch social psychologist who entirely fabricated his data over many years. Despite several doctoral students’ work being based on this data and over 70 articles having to be retracted there were no charges laid. The only consequence he faced was having his professorship stripped.

Sometimes the consequences of fraud are tragic. A Japanese stem cell researcher, Haruko Obokata, who fabricated her results had her PhD stripped from her. There were no criminal charges laid but her supervisor committed suicide and the funding for the centre she was working in was cut.  The work had been published in Nature which then retracted the work and wrote some editorial about the situation.

The question of scientific accountability is so urgent that there was a call last year to criminalise scientific misconduct in this paper. Indeed things do seem to be changing slowly and there have been some high profile cases where scientific fraud has resulted in criminal charges being laid. A former University of Queensland academic is currently facing fraud related charges over his fabricated results from a study into Parkinson’s disease and multiple sclerosis. This time last year, Dong-Pyou Han, a former biomedical scientist at Iowa State University in Ames, was sentenced to 57 months for fabricating and falsifying data in HIV vaccine trials. Han has also been fined US$7.2 million. In both the cases the issue is the misuse of grant funding rather than publication of false results.

The combination of great ‘reward’ from publication in high profile journals and little repercussion (other than having that ‘esteem’ taken away) has proven to be too great a temptation for some.

Conclusion

The need to publish in high impact journals has caused serious authorship issues –  resulting in huge numbers of authors on some papers because it is the only way to allocate credit. And there is very little in the way we reward researchers that adequately allows for calling researchers to take responsibility when something goes wrong, in some cases resulting in serious fraud.

The next instalment in this series will look at ‘reproducibility, retractions and retrospective hypotheses.

Published 12 July 2016
Written by Dr Danny Kingsley
Creative Commons License

The case for Open Research: the mis-measurement problem

Let’s face it. The biggest blockage we have to widespread Open Access is not researcher apathy, a lack of interoperable systems, or an unwillingness of publishers to engage (although these do each play some part) – it is the problem that the only thing that counts in academia is publication in a high impact journal.

This situation is causing multiple problems, from huge numbers of authors on papers, researchers cherry picking results and retrospectively applying hypotheses, to the reproducibility crisis and a surge in retractions.

This blog was intended to be an exploration of some solutions prefaced by a short overview of the issues. Rather depressingly, there was so much material the blog has had to be split up, with several parts describing the problem(s) before getting to the solutions.

Prepare yourself, this will be a bumpy ride. This first instalment looks at the reward system. The second instalment will consider authorship and credit. The third will look at reproducibility, retractions and retrospective hypotheses. The fourth asks if peer review is working. And the final blog will discuss some options for solving at least part of the problem.

I should note that this is not a comprehensive literature review. Every subheading of this blog series is a topic of considerable research on its own and there are many further examples available to the interested reader. I welcome debate, suggestions and links in the comments section of the blog(s).

Measurement for reward

The Journal Impact Factor

Let’s start with how researchers are measured. For decades academia has lived with the ‘Publish or Perish’ mantra which has spawned problems with poor publication practices. Today the pressure to be published in a high impact journal is stronger than ever.

A journal’s Impact Factor (JIF) averages the number of citations received by a journal in a given year divided by the number of articles published in the previous two years. For example, a journal’s JIF for a given year is calculated by taking the number of citations made that year to the articles published in the journal in the previous years and then dividing by the total number of articles (including reviews and other non-scholarly content) published in that journal in those years.

The JIF is compiled by Journal Citation Reports – which is owned by a commercial company Thompson Reuters. The company announced its sale for $3.5 billion today.

This blog will not dig in any depth into the issues with the way the JIF is calculated, although there are some serious ones (see a 2006 paper I coauthored on this topic). Neither will it explore the problem of how much the JIF is gamed – from self-citations to journals insisting on a certain number of citations to publications within the same journal. Sufficient to say that each year a number of journals are removed from the index due to this type of behaviour. The record to date was in 2013, a year which saw 66 journals struck from the list. By comparison only 18 were suppressed in the most recent report.

There have been many, many criticisms of the Journal Impact Factor and its effects on scholarship. But the criticisms put forward a decade ago to the month by PLOS still ring true. One of the issues, PLOS argued, was that because Thompson Reuters does not make public the process for choosing ‘citable’ article types, this means “science is currently rated by a process that is itself unscientific, subjective, and secretive”.

Indeed last week a news article in Science and a related news article in Nature put forward exactly the same criticism. The stories referred to a paper: “A simple proposal for the publication of journal citation distributions” posted on BioRxiv. This described some comparative research undertaken to look at whether a reanalysis of the data would provide the same results as Thompson Reuters. It didn’t. The work found the citation distributions were “so skewed that up to 75% of the articles in any given journal had lower citation counts than the journal’s average number”. The authors likened using the JIF to determine the impact of a given article to ‘guesswork‘.

Jon Tennant, in a 2015 blog stated that “The impact factor is one of the most mis-used metrics in the history of academia” and proposed an Open Letter template for researchers to “send to people in positions of power at different institutions, co-signed by as many academics as possible who believe in fairer and evidence-based assessment”. Tennant in turn references Stephen Curry’s 2012 blog which opened with the statement “The impact factor might have started out as a good idea, but its time has come and gone”.

There are many more, but I am sure you get the idea.

This is recognised as such a big problem that in 2012 the San Francisco Declaration on Research Assessment (DORA) was conceived with the intent to: ‘Put science into the assessment of research’. Over 12,000 individuals and over 700 organisations have signed the declaration to date supporting the call for a “need to assess research on its own merits rather than on the basis of the journal in which the research is published”.

If nothing else, there is clearly a problem with measuring the worth of something by considering the packaging and not the item itself. But the academy continues to use the JIF and criticisms continue to come thick and fast.

Clearly something is rotten in the state of Denmark.

Ditching the JIF

In Stephen Jay Gould’s seminal book The Mismeasure of Man where he debunks the science behind biological determinism, he criticises “the myth that science itself is an objective enterprise, done properly only when scientists can shuck the constraints of their culture and view the world as it really is”. This observation is true of any metrics we apply to the valuing of research outputs. They are not objective, and not an accurate view. Any measurement tool causes its own problems.

An example of a non-JIF type of measurement is the increased emphasis on ‘excellence’ by funders and governments (the Research Excellence Framework in the UK and Excellence in Research for Australia being two examples). But ‘excellence rhetoric’ is counterproductive to good research, according to one argument which concludes that ‘excellence’ is a “pernicious and dangerous rhetoric that undermines the very foundations of good research and scholarship”.

The insistence on excellence, it can be argued, have spawned problems with reproducibility and fraud. In other words, the same problems that the JIF has caused.

There have been many other suggestions for ways to measure researchers, such as the h-index which has its own set of issues, and the Eigenfactor Score – these are only two of a myriad of options. But as the system changes, so does researcher behaviour. A clear example was in Australia when the funding mechanism moved to a simple count of research papers rather than any assessment of the value of those papers. This resulted in a marked increase in the number of papers being produced and a concurrent decrease in the overall quality as described in ‘Modifying publication practices in response to funding formulas‘.

Clifford Lynch, the Executive Director of CNI noted in his welcome talk at the JISC-CNI event held at Wadham College, Oxford last week that using alternative metrics means we start running into issues about vendor lock-in and data confidentiality.

While alternate metrics might solve the ‘valuing the article rather than journal’ issue, they bring up problems of their own. In HEFCE’s 2015 report on metrics being used in assessment in the future noted that some indicators can be misused or ‘gamed’ – with journal impact factors, university rankings and citation counts put forward as three prominent examples. The report recommended that metrics should be updated in response to their potential effects. In deciding what metrics to use, the report recommended using the best possible data in terms of accuracy and scope, and that the data collection and analytical processes should be open and transparent to allow verification. It also suggested using a range of indicators.

Financial implications

What does this emphasis on particular publication outlets have to do with Open Access? Well a great deal as it happens. It is the big blocker to widespread change. As long as we continue with this emphasis we will not get any real traction with Open Access because it locks us into an old print paradigm of academia.

Much ink has been spilt over the cost of publication and the added cost of open access (some of it mine) which includes not just the cost of the article processing charges but the burden of administering multiple micropayments.

As I have said on numerous occasions (see here and here) funders paying for hybrid open access is expensive and has not resulted in journals flipping to gold (as a transition to fully Open Access environment) despite this being a stated aim of the process. It makes sense from a publisher’s perspective not to flip journals – why, when researchers are under pressure to publish in high impact journals, and there is a new revenue stream associated with that publishing, would you kill the proverbial goose?

Indeed, a paper earlier this year argued that “Open Access has the potential to become unsustainable for research communities if high-cost options are allowed to continue to prevail in a widely unregulated scholarly publishing market.”

The problem, it can be argued, is that the infrastructure underpinning open access is ‘path dependent’ a concept proposed in 1985 which explains how the set of decisions in the present is limited by the decisions one has made in the past, even though the contextual factors shaping the past decision no longer apply. Scholarly publishing is path-dependent, some authors argue “because it still heavily depends on a few players that occupy crucial nodes in the scientific information infrastructure. In the past, these players were scientific associations, but now these players are commercial publishing companies”.

As long as the current reward system remains, the crucial nodes will not change and we are stuck.

Conclusion

So that covers some of the problems with the way we measure our researchers, and some of the financial implications of this. The next blog in this series will cover some of the issues with authorship.

Published 11 July 2016
Written by Dr Danny Kingsley
Creative Commons License

Show me the money – the path to a sustainable Research Data Facility

Like many institutions in the UK, Cambridge University has responded to research funders’ requirements for data management and  sharing with a concerted effort to support our research community in good data management and sharing practice through our Research Data Facility. We have written a few times on this blog and presented to describe our services. This blog is a description of the process we have undertaken to support these services in the long term.

Funders expect  that researchers make the data underpinning their research available and provide a link to this data in the paper itself. The EPSRC started checking compliance with their data sharing requirement on 1 May 2015. When we first created the Research Data Facility we spoke to many researchers across the institution and two things became very clear. One was that there was considerable confusion about what actually counts as data, and the second was that sharing data on publication is not something that can be easily done as an afterthought if the data was not properly managed in the first place.

We have approached these issues separately. To try and determine what is actually required from funders beyond the written policies we have invited representatives from our funders to come to discussions and forums with our researchers to work out the details. So far we have hosted Ben Ryan from the EPSRC, Michael Ball from the BBSRC and most recently David Carr and Jamie Enoch from the Wellcome Trust and CRUK respectively.

Dealing with the need for awareness of research data management has been more complex. To raise awareness of good practice in data management and sharing we embarked on an intense advocacy programme and in the past 15 months have organised 71 information sessions about data sharing (speaking with over 1,700 researchers). But we also needed to ensure the research community was managing its data from the beginning of the research process. To assist this we have developed workshops on various aspects of data management (hosting 32 workshops in the past year), a comprehensive website, a service to support researchers with their development of their research data management plans and a data management consultancy service.

So far, so good. We have had a huge response to our work, and while we encourage researchers to use the data repository that best suits their material, we do offer our institutional repository Apollo as an option. We are as of today, hosting 499 datasets in the repository. The message is clearly getting through.

Sustainability

The word sustainability (particularly in the scholarly communication world) is code for ‘money’. And money has become quite a sticking point in the area of data management. The way Cambridge started the Research Data Facility was by employing a single person, Dr Marta Teperek for one year, supported by the remnants of the RCUK Transition Fund. It became quickly obvious that we needed more staff to manage the workload and now the Facility employs half an Events and Outreach Coordinator and half a Repository Manager plus a Research Data Adviser who looks after the bulk of the uploading of data sets into the repository.

Clearly there was a need to work out the longer term support for staffing the Facility – a service for which there are no signs of demand slowing. Early last year we started scouting around for options.  In April 2013 the RCUK released some guidance that said it was permissible to recover costs from grants through direct charges or overheads – but noted institutions could not charge twice. This guidance also mentioned that it was permissible for institutions to recover costs of RDM Facilities as other Small Research Facilities, “provided that such facilities are transparently charged to all projects that use them”.

Transparency

On the basis of that advice we established a Research Data Facility as a Small Research Facility according to the Transparent Approach to Costing (TRAC) methodology. Our proposal was that Facility’s costs will be recovered from grants as directly allocated costs. We chose this option rather than overheads because of the advantage of transparency to the funder of our activities. By charging grants this way it meant a bigger advocacy and education role for the Facility. But the advantage is that it would make researchers aware that they need to consider research data management seriously, that this involves both time and money, and that it is an integral part of a grant proposal.

Dr Danny Kingsley has argued before (for example in a paper ‘Paying for publication: issues and challenges for research support services‘) that by centralising payments for article processing charges, the researchers remain ignorant of the true economics of the open access system in the way that they are generally unaware of the amounts spent on subscriptions. If we charged the costs of the Facility into overheads, it becomes yet another hidden cost and another service that ‘magically’ happens behind the scenes from the researcher’s point of view.

In terms of the actual numbers, direct costs of the Research Data Facility included salaries for 3.2 FTEs (a Research Data Facility Manager, Research Data Adviser, 0.5 Outreach and Engagement Coordinator, 0.5 Repository Manager, 0.2 Senior Management time), hardware and hardware maintenance costs, software licences, costs of organising events as well as the costs of staff training and conference attendance. The total direct annual cost of our Facility was less than £200,000. These are the people cost of the Facility and are not to be confused with the repository costs (for which we do charge directly).

Determining how much to charge

Throughout this process we have explored many options for trying to assess a way of graduating the costing in relation to what support might be required. Ideally, we would want to ensure that the Facility costs can be accurately measured based on what the applicant indicated in their data management plan. However, not all funders require data management plans. Additionally, while data management plans provide some indication of the quantity of data (storage) to be generated, they do not allow a direct estimate of the amount of data management assistance required during the lifetime of the grant. Because we could not assess the level of support required for a particular research project from a data management plan, we looked at an alternative charging strategy.

We investigated charging according to the number of people on a team, given that the training component of the Facility is measurable by attendees to workshops. However, after investigation we were unable to easily extract that type of information about grants and this also created a problem for charging for collaborative grants. We then looked at charging a small flat charge on every grant requiring the assistance of the Facility and at charging proportionally to the size (percentage of value) of the grant. Since we did not have any compelling evidence that bigger grants require more Facility assistance, we proposed a model of flat charging on all grants, which require Facility assistance. This model was also the most cost-effective from an administrative point of view.

As an indicator of the amount of work involved in the development of the Business Case, and the level of work and input that we have received relating to it, the document is now up to version 18 – each version representing a recalculation of the costings.

Collaborative process

A proposal such as we were suggesting – that we charge the costs of the Facility as a direct charge against grants – is reasonably radical. It was important that we ensure the charges would be seen as fair and reasonable by the research community and the funders. To that end we have spent the best part of a year in conversation with both communities.

Within the University we had useful feedback from the Open Access Project Board (OAPB) when we first discussed the option in July last year. We are also grateful to the members of our community who subsequently met with us in one on one meetings to discuss the merits of the Facility and the options for supporting it. At the November 2015 OAPB meeting, we presented a mature Business Case. We have also had to clear the Business Case through meetings of the Resource Management Committee (RMC).

Clearly we needed to ensure that our funders were prepared to support our proposal. Once we were in a position to share a Business Case with the funders we started a series of meetings and conversations with them.

The Wellcome Trust was immediate in its response – they would not allow direct charging to grants as they consider this to be an overhead cost, which they do not pay. We met with Cancer Research UK (CRUK) in January 2016 and there was a positive response about our transparent approach to costing and the comprehensiveness of services that the Facility provides to researchers at Cambridge. These issues are now being discussed with senior management at CRUK and discussions with CRUK are still ongoing at the time of writing this report (May 2016). [Update 24 May: CRUK agreed to consider research data management costs as direct costs on grant applications on a case by case basis, if justified appropriately in the context of the proposed research].

We encourage open dialogue with the RCUK funders about data management. In May 2015 we invited Ben Ryan to come to the University to talk about the EPSRC expectations on data management and how Cambridge meets these requirements. In August 2015 Michael Ball from the BBSRC came to talk to our community. We had an indication from the RCUK that our proposal was reasonable in principle. Once we were in a position to show our Business Case to the RCUK we invited Mark Thorley to discuss the issue and he has been in discussion with the individual councils for their input to give us a final answer.

Administrative issues

Timing in a decision like this is challenging because of the large number of systems within the institution that would be affected if a change were to occur. In anticipation of a positive response we started the process of ensuring our management and financial systems were prepared and able to manage the costing into grants – to ensure that if a green light were given we would be prepared.  To that end we have held many discussions with the Research Office on the practicalities of building the costing into our systems to make sure the charge is easy to add in our grant costing tool. We also had numerous discussions on how to embed these procedures in their workflows for validating whether the Facility services are needed and what to do if researchers forget to add them. The development has now been done.

A second consideration is the necessity to ensure all of the administrative staff involved in managing research grants (at Cambridge this is a  group of over 100 people) are aware of the change and how to manage both the change to the grant management system and also manage the questions from their research community. Simultaneously we were also involved in numerous discussions with our invaluable TRAC team at the Finance Division at the University who helped us validate all the Facility costs (to ensure that none of the costs are charged twice) and establishing costs centres and workflows for recovering money from grants.

Meanwhile we have had to keep our Facility staff on temporary contracts until we are in a position to advertise the roles. There is a huge opportunity cost in training people up in this area.

Conclusion

As it happened, the RCUK has come back to us to say that we can charge this cost to grants but as an overhead rather than direct cost. Having this decision means we can advertise the positions and secure our staffing situation. But we won’t be needing the administrative amendments to the system, nor the advocacy programme.

It has been a long process given we began preparing the Business Case in March 2015. The consultation throughout the University and the engagement of our community (both research and funder) has given us an opportunity to discuss the issues of research data management more widely. It is a shame – from our perspective – that we will not be able to be transparent about the costs of managing data effectively.

The funders and the University are all working towards a shared goal – we are wanting a culture change towards more open research, including the sharing of research data. To achieve this we need a more aware and engaged research community on these matters.  There is much advocacy to do.

Published 8 May 2016
Written by Dr Danny Kingsley and Dr Marta Teperek
Creative Commons License

Watch this space – the first OSI workshop

It was always an ambitious project – trying to gather 250 high level delegates from all aspects of the scholarly communication process with the goal of better communication and idea sharing between sectors of the ecosystem. The first meeting of the Open Scholarship Initiative (OSI) happened in Fairfax, Virginia last week. Kudos to the National Science Communication Institute for managing the astonishing logistics of an exercise like this – and basically pulling it off.

This was billed as a ‘meeting between global, high-level stakeholders in research’ with a goal to ‘lay the groundwork for creating a global collaborative framework to manage the future of scholarly publishing and everything these practices impact’. The OSI is being supported by UNESCO who have committed to the full 10 year life of the project. As things currently stand, the plan is to repeat the meeting annually for a decade.

Structure of the event

The process began in July last year with emailed invitations from Glenn Hampson, the project director. For those who accepted the invitation, a series of emails from Glenn started with tutorials attached to try and ensure the delegates were prepared and up to speed. The emails gathered momentum with online discussions between participants. Indeed much was made of the (many) hundreds of emails the event had generated.

The overall areas the Open Scholarship Initiative hopes to cover include research funding policies, interdisciplinary collaboration efforts, library budgets, tenure evaluation criteria, global institutional repository efforts, open access plans, peer review practices, postdoc workload, public policy formulation, global research access and participation, information visibility, and others. Before arriving delegates had chosen their workgroup topic from the following list:

  • Embargos
  • Evolving open solutions (1)
  • Evolving open solutions (2)
  • Information overload & underload
  • Open impacts
  • Peer review
  • Usage dimensions of open
  • What is publishing? (1)
  • What is publishing? (2)
  • Impact factors
  • Moral dimensions of open
  • Participation in the current system
  • Repositories & preservation
  • What is open?
  • Who decides?

The 190+ delegates from 180+ institutions, 11 countries and 15 stakeholder groups gathered together at George Mason University (GMU), and after preliminary introductions and welcomes the work began immediately with everyone splitting into their workgroups. We spent the first day and a half working through our topics and preparing a short presentation for feedback on the second afternoon. There was then another working session to finalise the presentations before the live-streamed final presentations on the Friday morning. These presentations are all available in Figshare (thanks to Micah Vandegrift).

The event is trying to address some heady and complex questions and it was clear from the first set of presentations that in some instances it had been difficult to come to a consensus, let alone a plan for action. My group had the relative luxury of a topic that is fairly well defined – embargoes. It might be useful for the next event to focus on specific topics and move from the esoteric to the practical.

In addition the meeting had a team of ‘at large’ people who floated between groups to try and identify themes. Unsurprisingly, the ‘Primacy of Promotion and Tenure’ was a recurring theme throughout many of the presentations. It has been clear for some time that until we can achieve some reform of the promotion and tenure process, many of the ideas and innovations in scholarly communication won’t take hold. I would suggest that the different aspects of the reward/incentive system would be a rich vein to mine at OSI2017.

Closed versus open

In terms of outcomes there was some disquiet beforehand, by people who were not attending, about the workshop effectively being ‘closed’. This was because there was a Chatham House Rule for the workgroups to allow people to speak freely about their own experiences.

There was also some disquiet by those people who were attending about a request that the workgroups remain device-free. This was to try and discourage people checking emails and not participating. However people revert to type – in our group we all used our devices to collaborate on our documents. In the end we didn’t have much of a choice, the incredibly high tech room we were using in the modern GMU library flummoxed us and we were unable to get the projector to work.

That all said, there is every intention to disseminate the findings of the workshops widely and openly. During the feedback and presentations sessions there was considerable Twitter discussion at #OSI2016 – there is a downloadable list of all tweets in figshare – note there were enough to make the conference trend on Twitter at one point. This networked graphic shows the interrelationships across Twitter (thanks to Micah and his colleague). In addition there will be a report published by George Mason University Press incorporating the summary reports from each of the groups.

Team Embargo

Our workgroup, like all of them, represented a wide mix of interest groups. We were:

  • Ann Riley – President, Association of College and Research Libraries
  • Audrey McCulloch, Chief Executive, Association of Learned and Professional Societies
  • Danny Kingsley – Head of Scholarly Communication, Cambridge University
  • Eric Massant, Senior Director of Government and Industry Affairs, RELX Group
  • Gail McMillan, Director of Scholarly Communication, Virginia Tech
  • Glenorchy Campbell, Managing Director, British Medical Journal North America
  • Gregg Gordon, President, Social Science Research Network
  • Keith Webster, Dean of Libraries, Carnegie Mellon University
  • Laura Helmuth, incoming president, National Association of Science Writers
  • Tony Peatfield, Director of Corporate Affairs, Medical Research Council, Research Councils, UK
  • Will Schweitzer, Director of Product Development, AAAS/Science

It might be worth noting here that our workgroup was naughty and did not agree beforehand on who would facilitate, so therefore no-one had attended the facilitation pre-workshop webinar. This meant our group was gloriously facilitator and post-it note free – we just got on with it.

Banishing ghosts

We began with some definitions about what embargoes are, noting that press embargoes, publication embargoes and what we called ‘security’ embargoes (like classified documents) all serve different purposes.

Embargoes are not ‘all bad’. In the instance of press embargoes they allow journalists early access to the publication in order for them to be able to investigate and write/present informed pieces in the media. This benefits society because it allows for stronger press coverage. In terms of security embargoes they protect information that is not meant to be in the public domain. However embargoes on Author’s Accepted Manuscripts in repositories are more contentious, with qualified acceptance that these are a transitional mechanism in a shift to full open access.

The causal link of green open access resulting in subscription loss is not yet proven. The September 2013 UK Business, Innovation and Skills Committee Fifth Report: Open Access stated “There is no available evidence base to indicate that short or even zero embargoes cause cancellation of subscriptions”. In 2012 the Committee for Economic Development Digital Connections Council in The Future of Taxpayer-Funded Research: Who Will Control Access to the Results? concluded that “No persuasive evidence exists that greater public access as provided by the NIH policy has substantially harmed subscription-supported STM publishers over the last four years or threatens the sustainability of their journals”.

However there is no argument that traffic on websites for journals that rely on advertising dollars (such as medical journals) suffer when the attention is pulled to another place. This clearly potentially affects advertising revenue which in turn can impact on the financial model of those publication.

During our discussions about the differences between press embargoes and publication embargoes I mentioned some recent experiences in Cambridge. The HEFCE Open Access Policy requires us to collect Author’s Accepted Manuscripts at the time of acceptance and make the metadata about them available, ideally before publication. We respect publishers’ embargoes and keep the document itself locked down until these have passed post-publication. However we have been managing calls from sometimes distressed members of our research community who are worried that making the metadata available prior to publication will result in the paper being ‘pulled’ by the journal. Whether this has ever actually happened I do not know – and indeed would be happy to hear from anyone who has a concrete example so we can start managing reality instead of rumour. The problem in these instances is the researchers are confusing the press embargo with the publication embargo.

And that is what this whole embargo discussion comes down to. Much of the discourse and arguments about embargoes are not evidence based. There is precious little evidence to support the tenet that sits behind embargoes – which is that if publishers allow researchers to make copies of their work available open access then they will lose subscriptions. The lack of evidence does not prevent the possibility it is true however – and that is why we need to settle the situation once and for all. If there is a sustainability issue for journals because of wider green open access then we need to put some longer term management in place and work towards full open access.

It is possible the problem is not repositories, institutional or subject-based. Many authors are making the final version of their published work available in contravention of their Copyright Transfer Agreement in ResearchGate or Academia.edu. It might be that this availability of work is having an impact on researcher’s usage of work on the publishers’ sites. Given that in institutional repositories repository managers make huge efforts to comply with complicated embargoes it is quite possible that repositories are not the problem. Indeed, only a small proportion of work is made available through repositories according to the August 2015 Monitoring the Transition to Open Access report (look at ‘Figure 9. Location of online postings (including illicit postings)’ on page 38).  If this is the case, requiring institutions to embargo the Author’s Accepted Manuscripts they hold in their repositories for long periods will not make any difference. They are not the solution.

Our conclusion from our preliminary discussions was that there needs to be some concrete, rigorous research into the rationale behind embargoes to inform publishers, researchers and funders.

Our proposal – research questions

In response to this the Embargo workgroup decided that the most effective solution was to collaborate on an agreed research process that will have the buy-in of all stakeholders. The overarching question that we want to try and answer is ‘What are the impacts of embargoes on scholarly communication?’ with the goal to create an evidence base for informed discussion on embargoes .

In order to answer that question we have broken the big issue into a series of smaller questions:

  • How are embargoes determined?
  • How do researchers/students find research articles?
  • Who needs access?
  • Impact of embargoes on researchers/students?
  • Effect of embargoes on other stakeholders?

We decided that if the research found there was a case for publication embargoes then agreement on the metrics that should be used to determine the length of an embargo would be helpful. We are hoping that this research will allow standards to be introduced in the area of embargoes.

Discoverability and the issue of searching behaviour is extremely relevant in this space. Our hypothesis is if people are following publishers’ journal pages to find material then the fact that some of the same information is disbursed amongst lots of repositories means that the publisher arguments that embargoes threaten their finances are weakened. However if people are primarily using centralised search engines such as Google Scholar (which favours open versions of articles over paid ones) then that strengthens the publisher argument that they need embargoes to protect revenue.

The other question is whether access really is an issue for researchers. The March 2015 STM Report looked at the research in this area which indicate that well over 90% of researchers surveyed in separate studies said research papers were easy or fairly easy to access which appears to suggests on the face of it little problem in the way of access (look for the ‘Researchers’ access to journals’ section starting p83). Rather than repeating these surveys indicators for how much embargoes restrict access to researchers could include:

  • The usage of Request a Copy buttons in repositories
  • The number of ‘turn-aways’ from publishers platforms
  • The take-up level of Pay Per View options on publisher sites
  • The level of usage of ‘Get it Now’ – where the library obtains a copy through interlibrary loan or document delivery and absorbs the cost.

Our proposal – Research structure

The project will begin with a Literature Review and an investigation into the feasibility of running some Case Studies.

Two clear Case Studies could provide direct evidence if the publishers were willing to share what they have learned. In both cases, there has been a move from an embargo period for green OA to removing embargoes completely. In the first instance, Taylor and Francis began a trial in 2011 to allow immediate green OA for their library and information science journals, meaning that authors published in 35 library and information science journals have the right to deposit their Accepted Manuscript into their institutional repository and make it immediately available. Authors who choose to publish in these journals are no longer asked to assign copyright. They now sign a license to publish, which allows Taylor & Francis to publish the Version of Record. Additionally, authors can choose to make their work green open access with no embargoes applied. In 2014 the pilot was extended for ‘at least a further year’.

As part of the pilot, Taylor and Francis say a survey was conducted by Routledge to canvas opinions on the Library & Information Science Author Rights initiative and also investigated author and researcher behaviour and views on author rights policies, embargoes and posting work to repositories. The survey elicited over 500 responses, including: “Having the option to upload their work to a repository directly after publication is very important to these authors: more than 2/3 of respondents rated the ability to upload their work to repositories at 8, 9, or 10 out of 10, with the vast majority saying they feel strongly that authors should have this right”. There are no links to this survey that I have been able to uncover. It would be useful to include this survey in the Literature Review and possibly build on it for other stakeholders.

The second Case Study is Sage that, in 2013, decided to move to an immediate green policy. Both examples would have enough data by now to indicate if these decisions have resulted in subscription cancellations. I have proposed this type of study before, to no end. Hopefully we might now have more traction.

The Literature Review and Case Studies will then inform the development of a Survey of different stakeholders – which may have to be slightly altered depending on the audience being surveyed.  This is an ambitious goal – because the intention is to have at least preliminary findings available for discussion at the next OSI in 2017.

There was some lively Twitter discussion in the room about our proposal to do the study. Some were saying that the issue is resolved. I would argue that anyone who is negotiating the embargo landscape at the moment (such as repository managers) would strongly disagree with the position. Others referred to research already done in this space, for example the Publishing and Ecology of European Research (PEER) project. This study does discuss embargoes but approached the question with a position that embargoes are valid. The study we are proposing is asking specifically if there is any evidence base for embargoes.

Next steps

We will be preparing a project brief and our report for the OSI publication over the next couple of weeks.

The biggest issue for the project will be for us to gather funding. We have done a preliminary assessment of the time required to do the work so we could work out a ballpark figure for the fundraising goal. Note that our estimation of the number of workdays required for the project was deemed as ‘ludicrously low’ by a consultant in discussion later.

It was noted by a funder in casual discussions that because publishers have a vested interest in embargoes they should fund research that investigates their validity. Indeed Elsevier have already offered to assist financially for which we are grateful, but for this work to be considered robust and for it to be widely accepted it will need to be funded from a variety of sources. To that end we intend to ‘crowd fund’ the research in batches of $5000. The number of those batches will depend on the level of our underestimation of the time required to undertake the work (!).

In terms of governance, Team Embargo (perhaps we might need a better name…) will be working together as the steering committee to develop the brief, organise funding and choose the research team to do the work. We will need to engage an independent researcher or research group to ensure impartiality.

Wrap up summary of the workshop

There were a few issues relating to the organisation of the workshop. Much was made of the many hundreds of emails that were sent both from the organising group and also amongst the delegates before-hand. This level of preliminary discussion was beneficial but using another tool might help. It was noted that the level of email was potentially the reason why some of the delegates who were invited did not attend.

There was a logistic issue in having 190+ delegates staying in a hotel situated in the middle of a set of highways that was a 30 minute bus ride away from the conference location at George Mason University (also situated in an isolated location). The solution was a series of buses to ferry us each way each day, and to and from the airport. We ate breakfast, lunch and dinner together at the workshop location. This combined with the lack of alcohol because we were at an undergraduate American campus (where the legal drinking age is 21) gave the experience something of a school camp feel. Coming from another planned capital city (Canberra, Australia) I am sure that Washington is a beautiful and interesting place. This was not the visit to find that out.

These minor gripes aside, as is often the case, the opportunity to meet people face to face was fantastic. Because there was a heavy American flavour to the attendees, I have now met in person many of the people I ‘know’ well through virtual exchanges. It was also a very good process to work directly with a group of experienced and knowledgeable people who all contributed to a tangible outcome.

OSI is an ambitious project, with plans for annual meetings over the next decade. It will be interesting to see if we really can achieve change.

Published 24 April 2016
Written by Dr Danny Kingsley
Creative Commons License

Consider yourself disrupted – notes from RLUK2016

The 2016 Research Libraries UK conference was held at the British Library from 9-11 March on the theme of disruptive innovation. This blog pulls out some of the highlights personally gained from the conference:

  • If librarians are to be considered important – we as a community need to be strong in our grasp of understanding scholarly communication issues
  • We need to know the facts about our subscriptions to, usage of and contributions to scholarly publishing
  • We need high level support in institutions to back libraries in advocacy and negotiation with publishers
  • Scientists are rarely rewarded for being right, so the scientific record is being distorted by the scientific ecosystem
  • Society needs more open research to ensure reproducibility and robust research
  • The library of the future will have to be exponentially more customisable than the current offering
  • The information seeking behaviour of researchers is iterative and messy and does not match library search services
  • Libraries need to ‘create change to triumph’ – to be inventors rather than imitators
  • Management of open access issues need to be shared across institutions with positive outcomes when research offices and libraries collaborate.

I should note this is not a comprehensive overview of the conference, and I have blogged separately about my own contribution ‘The value of embracing unknown unknowns’. Some talks were looking at the broader picture, others specifically at library practice.

Stand your ground – tips for successful publisher negotiations

The opening keynote presentation was by Professor Gerard Meijer, President of Radboud University who conducted the recent Dutch negotiations with Elsevier.

The Dutch position has been articulated by Sander Dekker, the State Secretary  of Education who said while the way forward was gold Open Access, the government would not provide any extra money. Meijer noted this was sensible because every extra cent going into the system goes into the pocket of publishers – something that has been amply demonstrated in the UK.

All universities in the Netherlands are in top 200 universities in the world. This means all research is good quality – so even if it is only 2% of the world output, the Netherlands has some clout.

Meijer gave some salient advice about these types of negotiations. He said this work needs to be undertaken at the highest level at the universities. There are several reasons for this. He noted that 1.5 to 2 percent of university budget goes to subscriptions – and this is growing as budgets are being cut – so senior leadership in institutions should take an active position.

In addition if you are not willing to completely opt out of licencing their material then you can’t negotiate, and if you are going to opt out you will need the support of the researchers. To that end communication is crucial – during their negotiations, they would send a regular newsletter to researchers letting them know how things were going.

Meijer also stressed the importance of knowing the facts, and the need to communicate and inform the researchers about these facts and the numbers. He noted that most researchers don’t know how much subscriptions cost. They do know however about article processing charges – creating a misconception that Open Access is more expensive.

Institutions in the Netherlands spent €9.2 billion million on Elsevier publications in 2009, which rose to €11billion million* in 2014. Meijer noted that he was ‘not allowed’ to tell us this information due to confidentiality clauses. He drolly observed “It will be an interesting court case to be sued for telling the taxpayers how their money is being spent”. He also noted that because Elsevier is a public company their finances are available, and while their revenue goes up, their costs stay the same.

Apparently Wiley and Springer are willing to go into agreements. However Elsevier are saying that a global business model doesn’t match with a local business requirement. The Netherlands  has not yet signed the contract with Elsevier as they are working out the detail.

Broadly the deal is for three years, from 2016 to 2018. The plan is to grow the Open Access output from nothing to 10% in 2016, 20% in 2017, 30% in 2018 and want to do that without having to pay APCs. To achieve this they have to identify journals that we make Open Access , by defining domains where all journals in these domains we make open access.

Meijer concluded this was a big struggle – he would have liked to have seen more – but what we have is good for science. Dutch research will be open in fields where most Open Access is happening and researchers are paying APCs. Researchers can look at the long list of journals that are OA and then publish there.

*CORRECTION: Apologies for my mistyping.  Thanks to    @WvSchaik for pointing out this error on Twitter. The slide is captured in this tweet.

The future of the research library

Nancy Fried Foster from Ithaka S+R and Kornelia Tancheva from Cornell University Library spoke about research practices and the disruption of the research library. They started by noting that researchers work differently now, using different tools. The objective of their ‘A day in the life of a serious researcher’ work was exploring research practices to inform the vision of library of the future and identify improvements we could make now.

They developed a very fine-grained method of seeing what people do which focuses on what people really do in the workplace. This used a participatory design approach. Participants (who were mainly post graduates) were asked to map or log their movements in one single day where at least some of their time was engaged in research. The team then sat with the person the following day to ask them to narrate their day – and talk about seeking, finding and using information. There was no distinction between academic and non-academic activity.

The team looked at the things that people were doing and the things that the library could and will be. The analysis took a lot of time, organising into several big categories:

  • Seeking information
  • Academic activities
  • Library resources
  • Space, self management and
  • Circum-academic activities – activities allied to the researchers academic line but not central.

They also coded for ‘obstacles’ and ‘brainwork’.

The participants described their information seeking as fluid and constant – ‘you can just assume I am kind of checking my email all the time’. They also distinguished between search and research. One quote was ‘I know the library science is very systematic and organised and human behaviour is not like that’.

Information seeking is an iterative process, it is constant and not systematic. The search process is highly idiosyncratic – our subjects have developed ways of searching for information that worked for them. It doesn’t matter if it is efficient or not. They are self conscious that it is messy. ‘I feel like the librarians must be like “this is the worst thing I have ever heard”’.

Information evaluation is multi-tiered – eg: ‘If an article is talking about people I have heard of it is worth reading’. Researchers often use a mash up of systems that will work for that project. For example email is used as an information management tool.

Connectivity is important to researchers, it means you can work anywhere and switch rapidly between tasks. It has a big impact on collaboration – working with others was continuously mentioned in the context of writing. However sometimes researchers need to eliminate technology to focus.

Libraries have traditionally focused too much on search and not enough on brain work – this is a potential role for libraries. References to the library occurred throughout the process. Libraries are often thought of as a place for refuge – especially for the much needed brain work. The need for self management – enable them to manage their time prioritise the demands on their attention. Strategies depended on a complicated relationship with technology.

One of the major themes emerging from the work is search is idiosyncratic and not important, research has no closure, experts rule and research is collaboration. The implications for the future library are that the future library is a hub, not just focusing on a discovery system but connecting people with knowledge and technologies.

If we were building a library from scratch today what would it look like? There will need to be a huge amount of customisation to adjust tools to suit researchers personal preferences. The library of the future will have to be exponentially more customisable than the current offering. Libraries will have to make available their resources on customisable platforms. We need to shift from non-interoperable tools to customisation.

So if the future were here today we would think of future library – an academic hub (improving current library services) and an application store. We should take on even more of a social media aspect. Think of a virtual ‘app store’ – on an open source platform that provides the option for people to suggest short cuts – employ developers to develop these modules quickly. Take a leadership role in ensuring vendor platforms can be integrated. All library resources will speak easily to the systems our users are using. We need to provide individualised services rather than one size fits all.

Scientific Ecosystems and Research Reproducibility

The scientific reward structure determines the behaviour of researchers and that this has spawned the reproducibility crisis according to Marcus Munafo from the University of Bristol.

Marcus started by talking about the P value where the statistically significant value is 95% – that is, the chance of the hypothesis being wrong is less than five in 100. Generally, studies need to cross this threshold to get published, so there is evidence to show that original studies often suggest a large effect – however when attempted, these effects are not able to be replicated.

Scientists are supposed to be impartial observers, but in reality they need to get grants, and publish papers to get promoted to more ‘glamorous institutions’ (Marcus’ words). Scientists are rarely rewarded for being right, so the scientific record is being distorted by the scientific ecosystem.

Marcus noted it is common to overstate your data or error check your data if your first analysis doesn’t tell you what you are looking for. This ‘flexible analysis’ is quite commonplace, if we look at literature as a whole. Often there is not enough detail in the paper to allow the reproducibility of the work. There are nearly as many unique analysis pipelines as there were studies in the sample – so this flexibility in the joint analysis tool gets leveraged to get the result you want.

There is also evidence that journal impact factors are a very poor indicator of quality, indeed it is a stronger indicator of retraction than quality. The idea is that the whole science will self correct. But science won’t sort itself out in a reasonable timeframe. If you look at the literature you see that replication is the exception rather than the norm.

One study showed among 83 articles recommending effective interventions, 40 had not been replicated, and of those that had been replicated many showed the works had stronger findings in the first paper than in the replication, and some were contradicted in the replication.

Your personal investment in the field shapes your position – unconscious biases that affects all of us. If you come in as an early career scientist you get an impression that the field is more robust than it is in reality. There is hidden literature that is not citable – only by looking at this you have a balanced sense of how robust the literature is. There are many studies that make a claim in the abstract that is not supported by more impartial reading. Others are ‘optimistic’ in the abstract. The articles that describe bad news receive far fewer citations than would be expected. People don’t want to cite bad news. So is science self correcting?

We can introduce measures to help science self correct. In 2000 the requirement to register the outcome of clinical trials began. Once they had to pre-specify what the outcome would be then most of the findings were null. That is why it is a scientific ecosystem – the way we are incentivised has become distorted over the years.

Researchers are incentivised to produce a small number of papers that are eye catching.  It is understandable why you would want to focus on quality over quantity. We can give more weight to confirmatory studies and try to move away from the focus on publishing in certain types of studies. We shouldn’t be putting all our effort into high risk, high return.

What do we do about this? There can be top down measures, but individual groups can work in ways to improve the ways we work, such as adopting the open science way of working. This is not trivial – for example we can’t make data available without the consent of participants. Possible solutions include pre-registering all the plans, set up studies so the data can be made open, ensure publications are gold OA. These measures serve as a quality control method because everything gets checked because people know it is going to be made available. We come down hard on academics who make conscious mistakes – but we should be encouraging people to identify their own errors.

We need to introduce quality control methods implicitly into our daily practice. Open data is a very good step in that direction. There is evidence that researchers who know their data is going to be made open are more thorough in their checking of it. Maybe it is time for an update in the way we do science – we have statistical software that can run hundreds of analysis, and we can do text and data mining of lots of papers. We need to build in new processes and systems that refine science and think about new ways of rewarding science.

Marcus noted that these are not new problems, quoting from Reflections on the Decline of Science in England written by Babbage in 1830.

Marcus referred to many different studies and articles in his talk, some of which I have linked out to here:

Creating change to triumph: A view from Australia

The idea of creating change to triumph was the message of Jill Benn, the Librarian at the University of Western Australia. She discussed Cambietics, the science of managing change. This was a theory developed in 1985 by Barrett, with three stages:

  • Coping with change to survive
  • Capitalising on change
  • Creating change to triumph.

This last is the true challenge – to be an inventor rather than an imitator. Jill gave the Australian context. The country is 32 times bigger than UK, but has a third of the population, with 40 universities around the country. She noted that one of the reasons libraries in Australia have collaborated is the isolation.

Research from Australia counts for 4% of the world’s research output, it is the third largest export after energy, and out-performs tourism. The political landscape really affects higher education. There has been a series of five prime ministers in five years.

Australia has invested heavily in research infrastructure – mostly telescopes and boats. The Australian National Data Service was created and this has built the Research Data Australia interface – an amazing system full of data. The libraries have worked with researchers to populate the repository. There has been a large amount of capacity building. ANDS worked with libraries to build the capacities – the ’23 things’ training programme. You self register – on 1 March, 840 people had signed up for the programme.

The most recent element of the government’s agenda has been innovation. Prime Minister Turnbull has said he wanted to end the ‘publish or perish’ culture of research to increase the impact on community. There is a national innovation and science agenda and the government would not longer take into account publications for research. It is likely the next ERA (Australia’s equivalent of the REF) will involve impact in the community. The latest call is “innovation is the new black”.

There is financial pressure on the University sector – which pays in US dollars which is a problem. The emphasis on efficiency means the libraries have to show value and impact to the research sector.

Many well-developed services exist in university libraries to support research. Australian institutional repositories now have over 650K full text items, which are downloaded over 1 million times annually, there are data librarians and scholarly communication librarians. Some of the ways in which libraries have been asked to deliver capacity is CAUL and its Research Advisory Committee – to engage in the government’s agenda. There are three pillars – capacity building, engagement and advocacy, to promote the work of our libraries to bodies like Universities Australia.

Jill also mentioned the Australasian Open Access Strategy Group which has had a green rather than a gold approach. Australians are interested in open access. It is not yet clear what our role will be of institutional repositories into the future. In an environment where the government wants us to share our research.

How can we benchmark the Australian context? It is difficult. Look at our associations and about what data we might be able to share. Quote from Ross Wilkinson – yes there are individuals but the collective way Australia has managed data we are better able to engage internationally. Despite the investment into repositories in Australia – the UK outperforms Australia.

Australian libraries see themselves as genuine partners for research and we have a healthy self confidence (!). Libraries must demonstrate value and impact and provide leadership. Australian libraries have created change to triumph.

Open access mega-journals and the future of scholarly communication

This talk was given by Professor Stephen Pinfield from Sheffield University. He talked about the Open Access Mega Journal project he is working on with potentially disruptive open access journals (the Twitter handle is @oamj_project).

He began where it all began – with PLOS ONE, which is now the biggest journal in the world. Stephen noted that mega journals are full of controversy, listing comments ranging from them being the future of academic publishing, a disruptive innovation to the best possible future system.

However critics see them variously as a dumping ground, career suicide for early career researchers publishing in them and a cynical money making venture. However, Pinfield noted that despite considerable searching acknowledging what ‘people say’ is different from being able to provide attributed negative statements about mega-journals.

The open access and wide scope nature of mega-journals reverses the trend over past few years where journals have been further specialising, They are identifiable by their approach to quality control, with an emphasis on scientific soundness only rather than subjective assessments of novelty and also by their post publication metrics.

Pinfield noted that there are economies of scale for mega journals – this means that we have single set of processes and technologies. This enables a tiered scholarly publishing system. Mega-journals potentially allow highly selective journals to go open access (who often argue that they reject so much they couldn’t afford to go OA). Pinfield hypothesised that a business model could be where a layer of highly selective titles sits above a layer of moderately selective mega journals. The moderately selective journals provide the financial subsidy but the highly selective ones provide the reputational subsidy. PLOS is a good example of this symbiotic relationship.

The emphasis on ‘soundness’ in the quality control process reduces the subjectivity of judgements of novelty and importance and potentially shifts the role and the power of the gatekeepers. Traditionally the editors and editorial board members have been the arbiters of what is novel.

However this opens up some questions. If it is only a ‘soundness’ judgement then the question is whether power is shifted for good or ill? Also does the idea of ‘soundness’ translate to the Humanities? There is also the problem of an overreliance on metrics. Are the citation values of journals driven by the credibility or the visibility of the journals?

Pinfield emphasised the need for librarians to be informed and credible about their understanding of these topics. If librarians are to be considered important – we as a community need to be strong in our grasp of understanding these issues. There is an ongoing need to keep up to date and remain credible.

Working together to encourage researcher engagement and support

There were several talks about how institutions have been engaging researchers, and many of them emphasised the need to federate the workload across the institution. Chris Aware from the University of Hull discussed some work he was doing with Valerie McCutcheon on the current interaction between library and other parts of the institution in supporting OA, understand how OA is and could be embedded.

The survey revealed a desire for the management of Open Access to be more spread across the institution into the future. Libraries should be more involved in the management of the research information system and managing the REF. However Library involvement in getting Open Access into grant applications is lower – this is a research role, but it is worth asking how much this underpins subsequent activity.

As an aside Chris noted a way of demonstrating the value of something is to call it an ‘office’ – this is something the Americans do. (Indeed it is something Cambridge has done with the Office of Scholarly Communication).

Chris noted that if researchers don’t think about open access as part of the scholarly communications workflow then they won’t do it. Libraries play a key role in advocating and managing OA – so how can they work with other institutional stakeholders in supporting research?

Valerie later spoke about blurring and blending the borders between the Library and the Research Office. She noted that when she was working for Research and Enterprise (RSEO) she thought library people were nice, but she was not sure what the people do there. When she transferred to working in the Library, the perception back the other way was the same.

But the Research Office and the Library need to cooperate on shared strategic priorities. They are both looking out for changes in policy landscape they need to share information and collaborate on policy development and dissemination. They need better data quality in the research process to find solutions to create agile systems to support researchers.

At Glasgow the Library & RSEO were a good match because they had similar end uses and the same data. So this began a close collaboration between the two offices which worked together on the REF, used Enlighten. They also linked their systems (Enlighten and Research Systems) in 2010 where users can browse in the repository by the funder name. Glasgow has had a publications policy rather than an open access policy since 2008.

Valerie also noted that it was crucial to have high-level support and showed a video of Glasgow’s PVC-R singing the praises of the work the Library was doing.

The Glasgow Open Access model has been ‘Act on acceptance’ since 2013 – a simple message with minimal bureaucracy. A centralised service with ‘no fancy meetings’. Valerie also noted that when they put events on they don’t say it is a Library event, the sessions subject based not department based.

Torsten Reimer and Ruth Harrison discussed the support offered at Imperial College, where Torsten said he was originally employed for developing the College’s OA mandate but then the RCUK and the HEFCE policy came into place and changed everything. At Imperial, scholarly communications is seen as an overall concern for the College rather than specifically a Library issue.

Torsten noted the Library already had a good relationship with the departments. The Research Office is seen by researchers as a distraction from their research, but the Library is seen as helping research. However because the two areas have been able to approach everything with one single aim, this has allowed open access and scholarly support to happen across the institution and allowed the library to expand.

Imperial have one workflow and one system for open access which is all managed through Symplectic (there had been separate systems before). They have a simple workflow and form to fill in, then have a ticketing type customer workflow system plugged into Symplectic to pull information out at the back end. This system has replaced four workflows, lots of spreadsheets and much cut and pasting.

Sally Rumsey talked about how Oxford have successfully managed to engage their research community with their recently launched ‘Act on Acceptance’ communication programme.

Summary

This is a rundown of a few of the presentations that spoke to me. There were also excellent speed presentations, Lord David Willetts, the former Minister for Universities and Science spoke, we split up into workshops and there was a panel of library organisations around the world who discussed working together.

The personal outcomes from the conference include:

  • An invitation to give a talk at Cornell University
  • An invitation to collaborate with some people at CILIP about ensuring scholarly communication is included in some of the training offered
  • Discussion about forming some kind of learned society for Scholarly Communication
  • Discussion about setting up a couple of webinars – ‘how to start up an office of scholarly communication’ and ‘successful library training programmes’
  • Also lots of ideas about what to do next – the issue of language and the challenges we are facing in scholarly communication because of language deserves some investigation.

I look forward to next year.

Published 14 March 2016
Written by Dr Danny Kingsley
Creative Commons License

 

The value of embracing unknown unknowns

This blog accompanies a talk Danny Kingsley gave to the RLUK Conference held at the British Library on 9-11 March 2016. The slides are available and the Twitter hashtag from the event was #rluk16

The talk centred around a debate piece written with my long standing collaborator, Dr Mary Anne Kennan, published in August 2015: Open Access: The Whipping Boy for Problems in Scholarly Publishing. This original 10,000 word article was the starting point for a debate where five people provided rebuttals to our position and we were then given the opportunity to write a rejoinder to these. All the articles were published together.

I have included a précis of the article below as Annex 1, but that is not what the talk was about – what I wanted to discuss was the unexpected progression of the piece and what that revealed to us as authors working in Scholarly Communication.

After we submitted the original piece we sent through several suggestions (including names and contact details) to the Editor for people who might want to contribute. These primarily included practitioners in the Open Access space:

  • Funders
  • Library staff
  • Research managers
  • Editors
  • Publishers
  • Policy makers

There was considerable difficulty in locating people who were prepared to contribute. We are still unsure why this was the case – it may have been a time issue, the fact that this was an academic publication and we were asking administrative professionals, or that it was potentially politically sensitive. On the Editor’s suggestion we sent some personal requests to contacts to ask them to participate. However, in the end four of the five people who wrote rebuttals were researchers in the Information Systems field.

This process made the whole production very protracted. There was a two-year period between the first approach from the journal and publication. The production process from the start of the writing period was 18 months – the actual dates are listed as Annex 2 below.

Same old, same old – the responses

Reading the rebuttals from the four Information Systems researchers, two things become obvious. First, none of them actually addressed the posits we had presented in our original debate piece – which, after all was the point of the exercise.

Second, a theme began to emerge, demonstrated by these snippets:

  • “Before discussing that in detail we need to know what the current situation is regarding OA publishing in IS”
  • “We now discuss four fundamental points regarding scholarly communication. We begin by asking what constitutes the main building blocks of the scholarly communication system”
  • “Before examining the current state of scholarly publication, let us set some parameters for this discussion”
  • “I think the argument would benefit from more systematically analyzing the current system of scholarly publishing…”

In each case the authors chose to undertake their own analysis of scholarly publishing – sometimes apparently unaware that this is a long established area of research.

So what does this tell us?

Lesson 1 – ‘Engagement’ is not working

One thing that was striking about this process was that each contributor came to their own conclusion that Open Access is something we should aim towards. While this is a ‘good thing’ for Open Access advocacy, it is not scalable. If we wait for every researcher to come to their own personal epiphany about Open Access we will never have high levels of uptake.

There has been a long standing belief and practice in Open Access that if the research community were only more aware of the issues in scholarly publishing then they would come on board with Open Access. I am entirely guilty of this myself. However after a decade of trying, it is fairly safe to say that engagement has not worked.

One conclusion to take away from this experience is we must enable the academic community to disseminate their work openly. It must happen around them.

Lesson 2 – The research area of scholarly communication is not well recognised

The concept of an academic discipline is fairly slippery, but it is reasonably safe to say that two things define a discipline – the scholarly literature and language.

Academic ‘communities’ manifest in the form of journals or learned societies. But Scholarly Communication research is traditionally discussed either in a disciplinary specific way in a disciplinary journal (such as part of an editorial), or are published in journals in the sociology of science, communication, librarianship or the information sciences disciplines.

There are two journals that do specifically look at Scholarly Communication – the Journal of Librarianship and Scholarly Communication and Scholarly and Research Communication. I should note that Publications also looks at many issues in this area too.

There are now Offices of Scholarly Communication in universities, especially in the US & increasingly in the UK – the Office of Scholarly Communication at Cambridge being a classic example. However there are no Faculties or Departments or Professorial Chairs of Scholarly Communication in existence – that I can find. I am happy to hear about them if they do exist.

And yet people do undertake research in this area. They publish articles, peer review each other’s work, present at conferences. This is academic work.

It might well be a problem of language. Michael Billing’s book ‘Learn to Write Badly: How to succeed in the social sciences’ makes the argument that creating a language that is impenetrable to others is a way of boundary stamping a discipline.

But in the area of Scholarly Communication, many of the words are vernacular – with common meanings that might be different to their specific meaning in the context of the research. A classic example is ‘publish’ which simply means ‘make public’, but within the academic context means that there has been a process of review and revision, branding and attribution. Words like ‘repository’ and ‘mandate’ have caused me some professional grief.

And we are having some trouble with terminology in the Open Access space with publishers. For example the conflation of ‘deposit’ with ‘make available’ – Wiley instructs authors that they cannot deposit until after the embargo. This is wrong. Authors can deposit whenever they like, as long as they don’t make it available until after the embargo. Green Open Access – which means making a copy of the work freely available – has been rather bizarrely interpreted by Elsevier in their Open Access pages as providing a link to the (subscription) article.

The reason there can be such a high level of inaccuracy around language is because it is not ‘officially’ defined anywhere. I should note that the Consortia Advancing Standards in Research Administration Information (CASRAI) may be doing some work in this area.

Problem 1 – Practice versus study

We concluded in our rebuttal that the practice of scholarly communication (as distinct from the study of it) is shared among all academic fields, librarians, publishers, and administrators. Each of these bring their own levels of understanding, perspectives, and involvement in the scholarly communication system.

This can create a problem because practitioners often think they have a good understanding of the issues surrounding the publication process. But according to a 2012 article in the Journal of Librarianship and Scholarly Communication researchers are generally held to have a low awareness of publishing issues and open access opportunities and are confused over copyright issues.

This is a case of the ‘Unknown Unknowns’ – a term coined (to much ridicule) by Donald Rumsfeld in 2003.

Regardless of where individuals sit, however, in all instances there needs to be a base level of competence in this area. Yes I know, I have just said we should not try and engage academics to convert them to Open Access. However what we should be doing is ensuring they have at least a basic understanding of this area for their own professional wellbeing.

One of the conclusions of my 2008 PhD The effect of scholarly communication practices on engagement with open access: An Australian study of three disciplineswhere I undertook in-depth interviews with 43 researchers about their publication and communication practices – was that the Master/Apprentice system is broken (see pp177 – 188). We are not equipping our researchers with the information they need to navigate the publication process successfully. This need for education was echoed in a 2014 paper about open access journal quality indicators (itself published in the Journal of Librarianship and Scholarly Communication – notice a pattern?)

Problem – library community also needs to know

But this is not just an issue for the research community. Librarians in the academic space also need to know about these issues. Last year the Association of College and Research Libraries (ACRL) released their (excellent) Scholarly Communication Toolkit. The introductory pages note that the “ACRL sees a need to vigorously re-orient all facets of library services and operations to the evolving technologies and models that are affecting the scholarly communication process.” The reason, they say, is because in order for academic libraries to continue to succeed we need to integrate our work into all aspects of the full cycle of scholarly communication.

The toolkit also notes that there is ‘wide variance’ in the levels of understanding of these issues within our community. If we consider the ‘four stages of competence’ as a rough tool:

  1. Unconsciously unskilled – we don’t know that we don’t have this skill, or that we need to learn it.
  2. Consciously unskilled – we know that we don’t have this skill.
  3. Consciously skilled – we know that we have this skill.
  4. Unconsciously skilled – we don’t know that we have this skill (it just seems easy).

It would be an ideal situation to have our academic library community sitting at stages three and four. In reality many are at stage two and even at stage one.

But bringing everyone up to speed is a huge challenge. Our experiences in Australia have demonstrated it is extremely difficult to get issues related to scholarly communication into curricula for library training. Many of the skills in this area are learnt ‘on the job’.

There are almost no courses on repository management as demonstrated in this 2012 study published in the (here it is again) Journal of Librarianship and Scholarly Communication. There is a now slightly out of date list of courses in scholarly communication here. Professor Stephen Pinfield did point out after my talk that he is incorporating open access into his library courses. Discussions about open access are also included at Charles Sturt University in subjects where it is related such as Foundations for information Studies, Collections and Research Data Management, but there has been difficulty in securing a subject explicitly on Open Access or even more broadly on scholarly communication.

Even professional training is limited – CILIP offers ‘Institutional repositories and metadata’ and ‘Digital copyright’ but nothing on publishing or open access. One of the positive outcomes of the conference has been an offer to discuss some of these needs with CILIP.

Solution?

So what is the solution? We must shift from managing the academic literature to participating in the generation of it. Librarians can begin by engaging with the academic literature in their area. Suggestions include:

  • Reading research that is being published (in your area of librarianship)
  • Writing an academic article
  • Presenting work at conferences
  • Offering your services as a peer reviewer
  • Serving on an editorial board
  • Collaborating with your academic community on a project and writing about it

When I suggested this at the conference there was some push-back from the audience, defending the benefits of learning on the job. Afterwards, I was approached by a participant who said she had recently published a paper and found the process incredibly instructive. Interestingly, the same thing happened when a speaker urged colleagues to publish an academic paper at LIBER last year. There was again push-back from the audience until one participant said they seconded her statement. He said he thought he knew all about journals because he worked with them but when he published something he realised ‘I didn’t really know anything about it’.

We might have some way to go.

Annex 1 – The original debate piece

In the original debate piece we provided a background to OA’s development and current state – we did not go into great detail because we were limited by the 10,000 word count and we had made some assumptions about prior knowledge.

The piece examined some of the accusations leveled against OA and described why they were false and indeed indicative of a wider set of problems with scholarly communication:

  • that OA publishers are predatory,
  • that OA is too expensive,
  • that self-depositing papers in OA repositories will bring about the end of scholarly publishing.

We then proposed discussions we considered we should be having about scholarly publishing to take advantage of social and technological innovations and move it into the 21st century. These were the monograph issue, management of APCs, improving institutional repositories, needing to make scholarly publishing inclusive and the reward system.

Annex 2 – The times involved in publication

Here are the dates involved in getting the full debate piece to ‘print’:

  • First approach from the journal – September 2013
  • Agreed to write the piece and first discussion – 10 February 2014
  • Submitted the first argument – 26 May 2014
  • Submitted amendment based on editor’s comments – 29 May 2014
  • Rebuttals sent to us – 18 November 2014
  • Deadline for rejoinder – 19 December 2014
  • Rejoinder sent (!) – 16 February 2015
  • “Publication is with the production editor and will be out ‘anytime’” email – 6 May 2015
  • Copy editor’s questions sent to us – 4 June 2015
  • Corrected pieces (original & rejoinder) sent to editors – 26 June 2015
  • Date of acceptance – 4 July 2015
  • Date of publication – 17 August 2015

Published 11 March 2016
Written by Dr Danny Kingsley
Creative Commons License

Forget compliance. Consider the bigger RDM picture

The Office of Scholarly Communication sent Dr Marta Teperek, our Research Data Facility Manager to the  International Digital Curation Conference held in in Amsterdam on 22-25 February 2016. This is her report from the event.

Fantastic! This was my first IDCC meeting and already I can’t wait for next year. There was not only amazing content in high quality workshops and conference papers, but also a great opportunity to network with data professionals from across the globe. And it was so refreshing to set aside our UK problem of compliance with data sharing policies, to instead really focus on the bigger picture: why it is so important to manage and share research data and how to do it best.

Three useful workshops

The first day started really intensely – the plan was for one full day or two half-day workshops, but I managed to squeeze in three workshops in one day.

Context is key when it comes to data sharing

The morning workshop was entitled “A Context-driven Approach to Data Curation for Reuse” by Ixchel Faniel (OCLC), Elizabeth Yakel (University of Michigan), Kathleen Fear (University of Rochester) and Eric Kansa (Open Context). We were split into small groups and asked to decide what was the most important information about datasets from the re-user’s point of view. Would the re-user care about the objects themselves? Would s/he want to get hints about how to use the data?

We all had difficulties in arranging the necessary information in order of usefulness. Subsequently, we were asked to re-order the information according to the importance from the point of view of repository managers. And the take-home message was that for all of the groups the information about datasets required by the re-user was the not same as that required from the repository.

In addition, the presenters provided discipline-specific context based on interviews with researchers – depending on the research discipline, different information about datasets was considered the most important. For example, for zoologists, the information about specimen was very important, but it was of negligible importance to social scientists. So context is crucial for the collection of appropriate metadata information. Insufficient contextual information makes data not useful.

So what can institutional repositories do to address these issues? If research carried out within a given institution only covers certain disciplines, then institutional repositories could relatively easily contextualise metadata information being collected and presented for discovery. However, repositories hosting research from many different disciplines will find this much more difficult to address. For example, Cambridge repository has to host research spanning across particle physics, engineering, economics, archaeology, zoology, clinical medicine and many, many others. This makes it much more difficult (if not impossible) to contextualise the metadata.

It is not surprising that information most important from the repository’s point of view is different that the most important information required by the data re-users. In order to ensure that research data can be effectively shared and preserved in long term, repositories need to collect certain amount of administrative metadata: who deposited the data, what are the file formats, what are the data access conditions etc. However, repositories should collect as much administrative metadata as possible in an automated way. For example, if the user logs in to deposit data, all the relevant information about the user should be automatically harvested by feeds from human resources systems.

EUDAT – Pan-European infrastructure for research data

The next workshop was about EUDAT – the collaborative Pan-European infrastructure providing research data services, training and consultancy for researchers. EUDAT is an impressive project funded by Horizon2020 grant and it offers five different types of services to researchers:

  • B2DROP – a secure and trusted data exchange service to keep research data synchronized, up-to-date and easy to exchange with other researchers;
  • B2SHARE – service for storing and sharing small-scale research data from diverse contexts;
  • B2SAFE – service to safely store research data by replicating it and depositing at multiple trusted repositories (additional data backups);
  • B2STAGE – service to transfer datasets between EUDAT storage resources and high-performance computing (HPC) workspaces;
  • B2FIND – discovery service harvesting metadata from research data collections from EUDAT data centres and other repositories.

The project has a wide range of services on offer and is currently looking for institutions to pilot these services with. I personally think these are services which (if successfully implemented) would be of a great value to Pan-European research community.

However, I have two reservations about the project:

  • Researchers are being encouraged to use EUDAT’s platforms to collaborate on their research projects and to share their research data. However, the funding for the project runs out in 2018. EUDAT team is now investigating options to ensure the sustainability and future funding for the project, but what will happen to researchers’ data if the funding is not secured?
  • Perhaps if the funding is limited it would be more useful to focus the offering on the most useful services, which are not provided elsewhere. For example, another EC-funded project, Zenodo, already offers a user-friendly repository for research data; Open Science Framework offers a platform for collaboration and easy exchange of research data. Perhaps EUDAT could initially focus on developing services which are not provided elsewhere. For example, having a Pan-Europe service harvesting metadata from various data repositories and enabling data discovery is clearly much needed and would be extremely useful to have.

Jisc Shared RDM Services for UK institutions

I then attended the second half of Jisc workshop on shared Research Data Management services for UK institutions. The University of York and the University of Cambridge are two of 13 pilot institutions participating in the pilot. Jenny Mitcham from York and I gave presentations on our institutional perspectives on the pilot project: where we are at the moment and what are our key expectations from the pilot. Jenny gave an overview of an impressive work by her and her colleagues on addressing data preservation gaps at the University of York. Data preservation was one of the areas in which Cambridge hopes to get help from the Jisc RDM shared services project. Additionally, as we described before, Cambridge would greatly benefit from solutions for big data and for personal/sensitive data. My presentation from the session is available here.

Presentations were followed by breakout group discussions. Participants were asked to identify the areas of priorities for the Jisc RDM pilot. The top priority identified by all the groups seemed to be solutions for personal/sensitive data and for effective data access management. This was very interesting to me as at similar workshops held by Jisc in the UK, breakout groups prioritised interoperability with their existing institutional systems and cost-effectiveness. This could be one of the unforeseen effects of strict funders’ research data policies in the UK, which required institutions to provide local repositories to share research data.

As a result of these policies, many institutions were tasked with creating institutional data repositories from scratch in a very short time. Most of the UK universities now have institutional repositories which allow research data to be uploaded and shared. However, very few universities have their repositories well integrated with other institutional systems. Not having the policy pressure in non-UK countries perhaps allowed institutions to think more strategically about developing their RDM service provisions and ensure that developed services are well embedded within the existing institutional infrastructure.

Conference papers and posters

The two following days were full of excellent talks. My main problem was which sessions to attend: talking with other attendees I am aware that the papers presented at parallel sessions were also extremely useful. If the budget allows, I certainly think that it would be useful for more participants from each institution to attend the meeting to cover more parallel sessions.

Below are my main reflections from keynote talks.

Barend Mons – Open Science as a Social Machine

This was a truly inspirational talk, raising a lot of thought-provoking discussions. Barend started from a reflection that more and more brilliant brains, with more and more powerful computers and with billions of smartphones, created a single, interconnected social super-machine. This machine generates data – vast amount of data – which is difficult to comprehend and work with, unless proper tools are used.

Barend mentioned that with the current speed of new knowledge being generated and papers being published, it is simply impossible for human brains to assimilate the constantly expanding amount of new knowledge. Brilliant brains need powerful computers to process the growing amount of information. But in order for science to be accessible to computers, we need to move away from pdfs. Our research needs to be machine-readable. And perhaps if publishers do not want to support machine-readability, we need to move away from the current publishing model.

Barend also stressed that if data is to be useful and correctly interpretable, it needs to be accessible not only to machines, but also to humans, and that effort is needed to make data well described. Barend said that research data without proper metadata description is useless (if not harmful). And how to make research data meaningful? Barend proposed a very compelling solution: no more research grants should be awarded without 5% of money dedicated for data stewardship.

I could not agree more with everything that Barend said. I hope that research funders will also support Barend’s statement.

Andrew Sallans – nudging people to improve their RDM practice

Andrew started his talk from a reflection that in order to improve our researchers’ RDM practice we need to do better than talking about compliance and about making data open. How a researcher is supposed to make data accessible, if the data was not properly managed in the first place? The Open Science Framework has been created with three mission statements:

  • Technology to enable change;
  • Training to enact change;
  • Incentives to embrace change.

So what is the Open Science Framework (OSF)? It is an open source platform to support researchers during the entire research lifecycle: from the start of the project, through data creation, editing and sharing with collaborators and concluding with data publication. What I find the most compelling about the OSF is that is allows one to easily connect various storage platforms and places where researchers collaborate on their data in one place: researchers can easily plug their resources stored on Dropbox, Googledrive, GitHub and many others.

To incentivise behavioural change among researchers, the OSF team came up with two other initiatives:

Personally, I couldn’t agree more with Andrew that enabling good data management practice should be the starting point. We can’t expect researchers to share their research data if we have not helped them with providing tools and support for good data management. However, I am not so sure about the idea of cash rewards.

In the end researchers become researchers because they want to share the outcomes of their research with the community. This is the principle behind academic research – the only way of moving ideas forward is to exchange findings with colleagues. Do researchers need to be paid extra to do the right thing? I personally do not think so and I believe that whoever decides to pursue an academic career is prepared to share. And it is our task to make data management and sharing as easy as possible, and the use of OSF will certainly be of a great aid for the community.

Susan Halford – the challenge of big data and social research

The last keynote was from Susan Halford. Susan’s talk was again very inspirational and thought-provoking. She talked about the growing excitement around big data and how trendy it has become; almost being perceived as a solution to every problem. However, Susan also pointed out the problems with big data. Simply increasing the computational power and not fully comprehending the questions and the methodology used can lead to serious misinterpretations of results. Susan concluded that when doing big data research one has to be extremely careful about choosing proper methodology for data analysis, reflecting on both the type of data being collected, as well as (inter)disciplinary norms.

Again – I could not agree more. Asking the right question and choosing the right methodology are key to make the right conclusions. But are these problems new to big data research? I personally think that we are all quite familiar with these challenges. Questions about the right experimental design and the right methodology have been known to humankind since scientific method is used.

Researchers always needed to design studies carefully before commencing to do the experiments: what will be the methodology, what are the necessary controls, what should be the sample size, what needs to happen for the study to be conclusive? To me this is not a problem of big data, to me this is a problem that needs to be addressed by every researcher from the very start of the project, regardless of the amount of data the project generates or analyses.

Birds of a Feather discussions

I had not experienced Birds of a Feather Discussions (BoF) before at a conference and I am absolutely amazed by the idea. Before the conference started the attendees were invited to propose ideas for discussions keeping in mind that BoF sessions might have the following scope:

  • Bringing together a niche community of interest;
  • Exploring an idea for a project, a standard, a piece of software, a book, an event or anything similar.

I proposed a session about sharing of personal/sensitive data. Luckily, the topic was selected for a discussion and I co-chaired the discussion together with Fiona Nielsen from Repositive. We both thought that the discussion was great and our blog post from the session is available here.

And again, I was very sorry to be the only attendee from Cambridge at the conference. There were four parallel discussions and since I was chairing one of them, I was unable to take part in the others. I would have liked to be able to participate in discussions on ‘Data visualisation’ and ‘Metadata Schemas’ as well.

Workshops: Appraisal, Quality Assurance and Risk Assessment

The last day was again devoted to workshops. I attended an excellent workshop from the Pericles project on the appraisal, quality assurance and risk assessment in research data management. The project was about how an institutional repository should conduct data audits when accepting data deposits and also how to measure the risks of datasets becoming obsolete.

These are extremely difficult questions and due to their complexity, very difficult to address. Still, the project leaders realised the importance of addressing them systematically and ideally in an (semi)automated way by using specialised software to help repository managers making the right preservation decisions.

In a way I felt sorry for the presenters – their project progress and ambitions were so high that probably none of us, attendees, were able to critically contribute to the project – we were all deeply impressed by the high level of questions asked, but our own experience with data preservation and policy automation was nowhere at the level demonstrated by the workshop leaders.

My take home message from the workshop is that proper audit of ingested data is of crucial importance. Even if there is no automation of risk assessment possible, repository managers should at least collect information about files being deposited to be able to assess the likelihood of their obsolescence in the future. Or at least to be able to identify key file formats/software types as selected preservation targets to ensure that the key datasets do not become obsolete. For me the workshop was a real highlight of the conference.

Networking and the positive energy

Lots of useful workshops, plenty of thought-provoking talks. But for me one of the most important parts of the conference was meeting with great colleagues and having fascinating discussions about data management practices. I never thought I could spend an evening (night?) with people who would be willing to talk about research data without the slightest sights of boredom. And the most joyful and refreshing part of the conference was that due to the fact we were from across the globe, our discussions diverted away from the compliance aspect of data policies. Free from policy, we were able to address issues of how to best support research data management: how to best help researchers, what are our priority needs, what data managers should do first with our limited resources.

I am looking forward to catching up next year with all the colleagues I have met in Amsterdam and to see what progress we will have all made with our projects and what should be our collective next moves.

Summarising, I came back with lots of new ideas and full of energy and good attitude – ready to advocate for the bigger picture and the greater good. I came back exhausted, but I cannot imagine spending four days any more productively and fruitfully than at IDCC.

Thanks so much to the organisers and to all the participants!

Published 8 March 2016
Written by Dr Marta Teperek

Creative Commons License