Tag Archives: publishers

Manuscript detectives – submitted, accepted or published?

In the blog post “It’s hard getting a date (of publication)”, Maria Angelaki discussed how a seemingly straightforward task may turn into a complicated and time-consuming affair for our Open Access Team. As it turns out, it isn’t the only one. The process of identifying the version of a manuscript (whether it is the submitted, accepted or published version) can also require observation and deduction skills on par with Sherlock Holmes’.

Unfortunately, it is something we need to do all the time. We need to make sure that the manuscript we’re processing isn’t the submitted version, as only published or accepted versions are deposited in Apollo. And we need to differentiate between published and accepted manuscripts, as many  publishers – including the biggest players Elsevier, Taylor & Francis, Springer Nature and Wiley  – only allow self-archiving of accepted manuscripts in institutional repositories, unless the published version has been made Open Access with a Creative Commons licence.

So it’s kind of important to get that right… 

Explaining manuscript versions

Manuscripts (of journal articles, conference papers, book chapters, etc.) come in various shapes and sizes throughout the publication lifecycle. At the onset a manuscript is prepared and submitted for publication in a journal. It then normally goes through one or more rounds of peer-review leading to more or less substantial revisions of the original text, until the editor is satisfied with the revised manuscript and formally accepts it for publication. Following this, the accepted manuscript goes through proofreading, formatting, typesetting and copy-editing by the publisher. The final published version (also called the version of record) is the outcome of this. The whole process is illustrated below.

Identifying published versions

So the published version of a manuscript is the version… that is published? Yes and no, as sometimes manuscripts are published online in their accepted version. What we usually mean by published version is the final version of the manuscript which includes the publisher’s copy-editing, typesetting and copyright statement. It also typically shows citation details such as the DOI, volume and page numbers, and downloadable files will almost invariably be in a PDF format. Below are two snapshots of published articles, with citation details and copyright information zoomed in. On the left is an article from the journal Applied Linguistics published by Oxford University Press and on the right an article from the journal Cell Discovery published by Springer Nature (click to enlarge any of the images).


Published versions are usually obvious to the eye and the easiest to recognise. In a way the published version of a manuscript is a bit like love: you may mistake other things for it but when you find it you just know. In order to decide if we can deposit it in our institutional repository, we need to find out whether the final version was made Open Access with a Creative Commons (CC) licence (or in rarer cases with the publisher’s own licence). This isn’t always straightforward, as we will now see.

Published Open Access with a CC licence?

When an article has been published Open Access with a CC licence, a statement usually appears at the bottom of the article on the journal website. However as we want to deposit a PDF file in the repository, we are concerned with the Open Access statement that is within the PDF document itself. Quite a few articles are said to be Open Access/CC BY on their HTML version but not on the PDF. This is problematic as it means we can’t always assume that we can go ahead with the deposit from the webpage – we need to systematically search the PDF for the Open Access statement. We also need to make sure that the CC licence is clearly mentioned, as it’s sometimes omitted even though it was chosen at the time of paying Open Access charges.

The Open Access statement will appear at various places on the file depending on the publisher and journal, though usually either at the very end of the article or in the footer of the first page as in the following examples from Elsevier (left) and Springer Nature (right).


A common practice among the Open Access team is to search the file for various terms including “creative”, “cc”, “open access”, “license”, “common” and quite often a combination of these. But even this isn’t a foolproof method as the search may retrieve no result despite the search terms appearing within the document. The most common publishers tend to put Open Access statements in consistent places, but others might put them in unusual places such as in a footnote in the middle of a paper. That means we may have to scroll through a whole 30- or 40-page document to find them – quite a time-consuming process.

 Identifying accepted versions

The accepted manuscript is the version that has gone through peer-review. The content should be the same as the final published version, but it shouldn’t include any copy-editing, typesetting or copyright marking from the publisher. The file can be either a PDF or a Word document. The most easily recognisable accepted versions are files that are essentially just plain text, without any layout features, as shown below. The majority of accepted manuscripts look like this.

However sometimes accepted manuscripts may at first glance appear to be published versions. This is because authors may be required to use publisher templates at the submission stage of their paper. But whilst looking like published versions, accepted manuscripts will not show the journal/publisher logo, citation details or copyright statement (or they might show incomplete details, e.g. a copyright statement such as © 20xx *publisher name*). Compare the published version (left) and accepted manuscript (right) of the same paper below.


As we can see the accepted manuscript is formatted like the published version, but doesn’t show the journal and publisher logo, the page numbers, issue/volume numbers, DOI or the copyright statement.

So when trying to establish whether a given file is the published or accepted version, looking out for the above is a fairly foolproof method.

Identifying submitted versions

This is where things get rather tricky. Because the difference between an accepted and submitted manuscript lies in the actual content of the paper, it is often impossible to tell them apart based on visual clues. There are usually two ways to find out:

  • Getting confirmation from the author
  • Going through a process of finding and comparing the submission date and acceptance date of the paper (if available), mostly relevant in the case of arXiv files

Getting confirmation from the author of the manuscript is obviously the preferable and time-saving option. Unfortunately many researchers mislabel their files when uploading them to the system, describing their accepted/published version file as submitted (the fact that they do so when submitting the paper to us may partly explain this). So rather than relying on file descriptions, having an actual statement from the author that the file is the submitted version is better. Although in an ideal world this would never happen as everyone would know that only accepted and published versions should be sent to us.

A common incarnation of submitted manuscripts we receive is arXiv files. These are files that have been deposited in arXiv, an online repository of pre-prints that is widely used by scientists, especially mathematicians and physicists. An example is shown below.

Clicking on the arXiv reference on the left-hand side of the document (circled) leads to the arXiv record page as shown below.

The ‘comments’ and ‘submission history’ sections may give clues as to whether the file is the submitted or accepted manuscript. In the above example the comments indicate that the manuscript was accepted for publication by the MNRAS journal (Monthly Notices of the Royal Astronomical Society). So this arXiv file is probably the accepted manuscript.

The submission history lists the date(s) on which the file (and possible subsequent versions of it) was/were deposited in arXiv. By comparing these dates with the formal acceptance date of the manuscript which can be found on the journal website (if published), we can infer whether the arXiv file is the submitted or accepted version. If the manuscript hasn’t been published and there is no way of comparing dates, in the absence of any other information, we assume that the arXiv file is the submitted version.


Distinguishing between different manuscript versions is by no means straightforward. The fact that even our experienced Open Access Team may still encounter cases where they are unsure which version they are looking at shows how confusing it can be. The process of comparing dates can be time-consuming itself, as not all publishers show acceptance dates for papers (ring a bell?).

Depositing a published (not OA) version instead of an accepted manuscript may infringe publisher copyright. Depositing a submitted version instead of an accepted manuscript may mean that research that hasn’t been vetted and scrutinised becomes publicly available through our repository and possibly be mistaken as peer-reviewed. When processing a manuscript we need to be sure about what version we are dealing with, and ideally we shouldn’t need to go out of our way to find out.

Published 27 March 2018
Written by Dr Melodie Garnier
Creative Commons License

Next steps for Text & Data Mining

Sometimes the best way to find a solution is to just get the different stakeholders talking to each other – and this what happened at a recent Text and Data Mining symposium held in the Engineering Department at Cambridge.

The attendees were primarily postgraduate students and early career researchers, but senior researchers, administrative staff, librarians and publishers were also represented in the audience.


This symposium grew out of a discussion held earlier this year at Cambridge to consider the issue of TDM and what a TDM library service might look like at Cambridge. The general outcome of that meeting of library staff was that people wanted to know more. Librarians at Cambridge have developed a Text and Data Mining libguide to assist.

So this year the OSC has been doing some work around TDM, including running a workshop at Research Libraries UK annual conference in March. This was a discussion about developing a research library position statement on Text and Data Mining in the UK. The slides from that event are available and we published a blog post about the discussion.

We have also had discussions with different groups about this issue including the Future TDM project which has been looking to increase  the amount of TDM happening across Europe. This project is now finishing up. The impression we have around the sector is that ‘everyone wants to know what everyone else is doing’.

Symposium structure

With this general level of understanding of TDM as our base point, we structured the day to provide as much information as possible to the attendees. The Twitter hashtag for the event is #osctdm, and the presentations from the event are online.

The keynote presentation was by Kiera McNeice, from the FutureTDM Project who have an overview of what TDM is, how it can be achieved and what the barriers are. There is a video of her presentation (note there were some audio issues in the beginning of the recording).

The event broke into two parallel sessions after this. The main room was treated to a presentation about Wikimedia from Cambridge’s Wikimedian in Residence, Charles Matthews. Then Alison O’Mara-Eves discussed Managing the ‘information deluge’: How text mining and machine learning are changing systematic review methods. A video of Alison’s presentation is available.

In the breakout room, Dr Ben Outhwaite discussed Marriage, cheese and pirates: Text-mining the Cairo Genizah  before Peter Murray Rust spoke about ContentMine: mining the scientific literature.

After lunch, Rosemary Dickin from PLOS talked about Facilitating Test and Data Mining how an open access publisher supports TDM. PhD candidate Callum Court presented ChemDataExtractor: A toolkit for automated extraction of chemical information from the scientific literature. This presentation was filmed.

In the breakout room, a discussion about how librarians support TDM was led by Yvonne Nobis and Georgina Cronin. In addition there was a presentation from John McNaught –  the Deputy Director of the National Centre for Text and Data Mining (NaCTeM), who presented Text mining: The view from NaCTeM .

Round table discussion

The day concluded with the group reconvening together for a roundtable (which was filmed) to discuss the broader issue of why there is not more TDM happening in the UK.

We kicked off by asking each of the people who had presented during the event to describe what they saw as the major barrier for TDM. The answers ranged from the issue of recruiting and training staff to the legal challenges and policies needed at institutional level to support TDM and the failure of institutions and government to show leadership on the issue. We then opened up the floor to the discussion.

A librarian described what happens when a publisher cuts off access, including the process the library has to go through with various areas of the University to reinstate access. (Note this was the reason why the RLUK workshop concluded with the refrain: ‘Don’t cut us off!’). There was some surprise in the group that this process was so convoluted.

However, the suggestion that researchers let the library know that they want to do TDM and the library will organise permissions was rejected by the group, on both the grounds that it is impractical for researchers to do this, and that the effort associated with obtaining permission would take too long.

A representative from Taylor and Francis suggested that researchers contact the publishers directly and let them know. Again this was rejected as ‘totally impractical’ because of the assumption this made about the nature of research. Far from being a linear and planned activity, it is iterative and  to request access for a period of three months and to then have to go back to extend this permission if the work took an unexpected turn would be impractical, particularly across multiple publishers.

One attendee in her blog about the event noted: “The naivety of the publisher, concerning research methodology, in this instance was actually quite staggering and one hopes that this publisher standpoint isn’t repeated across the board.”

Some researchers described the threats they had received from publishers about downloading material. There was anger about the inherent message that the researcher had done something criminal.

There was also some concern raised that TDM will drive price increases as publishers see ‘extra value’ to be extracted from their resources. This sparked off a discussion about how people will experiment if anything is made digitally available.

During the hour long session the conversation moved from high level problems to workflows. How do we actually do this? As is the way with these types of events, it was really only in the last 10 minutes that the real issues emerged.  What was clear was something I have repeatedly observed over the past few years – that the players in this space including librarians, researchers and publishers, have very little idea of how the others work and their needs. I have actually heard people say: ‘If only they understood…’

Perhaps it is time we started having more open conversations?

Next steps

Two things have come out of this event. The first is that people have very much asked for some hands on sessions. We will have to look at how we will deliver this, as it is likely to be quite discipline specific.

The second is there is clearly a very real need for publishers, researches and librarians to get into a room together to discuss the practicalities of how we move forward in TDM. One of the comments on Twitter was that we need to have legal expertise in the room for this discussion. We will start planning this ‘stakeholder’ event after the summer break.


The items that people identified as the ‘one most important thing’ they learnt was instructive. The answers reflect how unaware people are of the tools and services available, and of how access to information works. Many of the responses listed specific tools or services they had found out about, others commented on the opportunities for TDM.

There were many comments about publishers, both the bad:

  • Just how much impact the chilling effect of being cut off by publishers has on researchers
  • That researchers have received threats from publishers
  • Very interesting about publishers and ways of working with them to ensure not cut off
  • Lots can be done but it is being hindered by publishers

and the good:

  • That PLOS is an open access journal
  • That there are reasonable publishing companies in the UK
  • That journals make available big data for meta analysis

Commentary about the event

There has been some online discussion and blog posts on the event:

Published 17 August 2017
Written by Dr Danny Kingsley 
Creative Commons License

Whose money is it anyway? Managing offset agreements

Sometimes an innocent question can blow up a huge discussion, and this is what happened recently at an RCUK OA Practitioner’s Group meeting when I asked what was appropriate for institutions to do when managing money they receive as refunds from publishers through offsetting arrangements.

When an institution pays for an article processing charge (APC) in a hybrid journal, it is doing so in addition to the existing subscription. This is generally referred to as ‘double dipping’.  I have written extensively about the issues with hybrid in the past, but here, I’d like to discuss the management of offset agreements.

Offset agreements are a compensation by a publisher to an institution for the extra money they are putting into the system through payment of APCs. Most large publishers have some sort of offset agreement for institutions in the UK which are negotiated by Jisc, based on the principles for offset agreements. (There is one significant publisher which is an exception because it insists there is no need for an offset agreement because it does not double dip.)

Offset agreements are not equal

While offset agreements are negotiated nationally, there is no obligation for any institution  to sign up to them. Cambridge makes the decision to sign up to an offset agreement or not through a standard calculation. If we are spending RCUK and COAF funds on the offset it must show benefit to the funds first. If the numbers demonstrate that by signing up to (and sometimes investing in) the agreement, the funds will be better off at the end of the year then we sign. The fact this agreement may have a broader benefit to the wider University is a secondary consideration. The OSC has a publisher and agreements webpage listing the agreements Cambridge is signed up to.

In a fit of spectacular inefficiency, all offsets work slightly differently. Here’s a run down of different types:

  • In some instances we have a melding of the costs into one payment and there are no transactions for open access. The Springer Compact is an example of this. At Cambridge we have split the cost of this deal between the subscription spend the previous year with the top up being made by our funds from RCUK and COAF in proportion to the amount we publish between these two funders with Springer.
  • Other offsets are internal – where the money does not leave the publisher’s system. The Wiley OA Agreement is this type. By signing up we receive a 25% discount on each APC that is managed through their dashboard. We also receive a 50% discount in a given year based on the number of APCs we bought the previous year. This money is calculated at the beginning of the year and the ‘money’ is put into a ‘fund’ held by Wiley. The APC payments for future articles can be made out of this credit. It is is bit like a betting app – you can’t get the money out without some difficulty, you can only ‘reinvest’ it
  • There is a different kind of internal offset where the calculation is made up front based on how much you spent the previous year on APCs. These manifest as a discount on each APC paid. Taylor and Francis’ offset works this way which is a bit of a hassle because you still have to process each APC regardless of whether you spend $2000 or $200 on it. But again there is no extra money anywhere in this equation because the discount is applied before the invoice is issued. 
  • A different kind of arrangement relates more to fully open access journals. These include a membership where you get a discount on APCs for being a member. Sometimes there is a payment associated with this (BMC for example, which for an upfront membership you can get 15% discount), and others where there is no payment (MDPI – 10% discount for now). Alternatively you can ‘buy’ membership for researchers in exchange for the right to publish for free (PeerJ).
  • The last type of offset is the most straightforward – where the institution gets a cheque back based on the extra spend on APCs over the subscription. Currently IoP is the only publisher with whom Cambridge has this type of agreement.

Managing offset refunds

When Cambridge received its first IoP cheque in 2015 there were questions about what we could or could not do with it. The Open Access Project Board discussed the issue and decided that the money needed to remain within the context of open access. Suggestions included paying our Platinum membership of arXiv.org with it, because this would be supporting open access.

The minutes from the meeting on 31 March 2015 noted: “Any funds returned from publishers as part of deals to offset the cost of article processing charges should be retained for the payment of open access costs, but ring-fenced from the block grants and kept available for emergency uses under the supervision of the Project Board.” We have since twice used this money to pay for fully open access journal APCs when our block grant funds were low. 

Whose money is it anyway?

When the issue of offset refunds and what institutions were doing with it was raised at a recent RCUK OA Practitioners Group meeting it became clear that practices vary considerably from institution to institution. One of the points of discussion was whether it would be appropriate to use this money to support subscriptions. The general (strong) sentiment from RCUK was that this would not be within the spirit, and indeed against the principles, of the RCUK policy.

I subsequently sent a request out to a repository discussion list to ask colleagues across the UK what they were doing with this money. To date there have only been a handful of responses.

In one instance with a medium-sized university the IoP money is placed into a small Library fund that is ring-fenced to pay for Open Access in fully Open Access journals only. This fund has the strategic aim to enable a transition to Open Access by supporting new business models and contributing to initiatives such as Knowledge Unlatched, hosting Open Journal Systems, as well as supporting authors to publish in Open Access venues when they have no other source of funding.

A large research institution responded to say they had a specific account set up into which the money was deposited, noting, as did the other respondents, that the financial arrangements of the University would mean that if it were deposited centrally it would never be seen again. This institution noted they were considering using the funds to offset the subscription to IoP in the upcoming year due to a low uptake of the deal.

Another large research institution said the IoP cheques were being ‘saved’ in the subscriptions budget.

Sussex University

In their recent paper “Bringing together the work of subscription and open access specialists: challenges and changes at the University of Sussex” there is a section on how they are managing the offset money. They note: “It seemed a missed opportunity to simply feed it back into the RCUK block grant, but equally inappropriate to use for journal subscriptions or general Library spending”.

The decision was to support APCs for postgraduate researchers (PGRs) who did not have any other access to money for gold open access, and could only be spent on fully open access journals. They noted that this was a welcome opportunity to be able to offer something tangible and helpful in their advocacy dealings with postgraduate researchers.

Only the start of the conversation

This discussion has raised questions about the decision making process for supporting access to the literature.

Subscriptions are paid for at Cambridge through a fund that is not owned by the Library – the fund consists of contributions from all the Schools plus central funds. Representatives of the Schools, Colleges and library staff sit on the Journal Coordination Scheme committee to decide on subscriptions. However decisions about open access memberships and offsets are made by the Office of Scholarly Communication. Given the increased entanglement of these two routes to access the literature, this situation is one the University is aware needs addressing. The Sussex University paper discusses the processes they went through to merge the two decision making bodies.

This is a rich area for investigation – as we move away from subscription-only spend and into joint decision-making between the subscription team and the Open Access team we need to understand what offsets offer and what they mean for the Library. This discussion is just the beginning.

Published 30 June 2017
Written by Dr Danny Kingsley 
Creative Commons License