Category Archives: Uncategorized

It’s hard getting a date (of publication)

October 27, 2017UncategorizedOffice of Scholarly Communication

As part of Open Access Week 2017, the Office of Scholarly Communication is publishing a series of blog posts on open access and open research. In this post Maria Angelaki describes how challenging it can be to interpret publication dates featured on some publishers’ websites.

More than three weeks a year. That’s how much time we spend doing nothing but determining the publication date of the articles we process in the Open Access team.

To be clear about what we are talking about here: All we need to know for HEFCE compliance is when the final Version of Record was made available on the publisher’s website. Also, if there is a printed version of the journal, for our own metadata, we need to know the Issue publication date too.

Surely, it can’t be that hard.

Defining publication date

The Policy for open access in Research Excellence framework 2021 requires the deposit of author’s outputs within three months of acceptance. However, the first two years of this policy has allowed deposits as late as three months from the date of publication.

It sounds simple doesn’t it? But what does “date of publication” mean? According to HEFCE the Date of Publication of a journal article is “the earliest date that the final version-of-record is made available on the publisher’s website. This generally means that the ‘early online’ date, rather than the print publication date, should be taken as the date of publication.”

When we create a record in Apollo, the University of Cambridge’s institutional repository, we input the acceptance date, the online publication date and the publication date.

We define the “online publication date” as the earliest online date the article has appeared on the publisher’s website and “publication date” as the date the article appeared in a print issue. These two dates are important since we rely on them to set the correct embargoes and assess compliance with open access requirements.

The problems can be identified as:

There are publishers that do not feature clearly the “online date” and the “paper issue date”. We will see examples further on.
To make things more complicated, some publishers do not always specify which version of the article was published on the “online date”. It can variously mean the author’s accepted manuscript (AAM), corrected proof, or the Version of Record (VoR), and there are sometimes questions in the latter as to whether these include full citation details.
Lastly, there are cases where the article is first published in a print issue and then published online. Often print publications are only identified as “Spring issue’ or the like.

How can we comply with HEFCE’s deposit timeframes if we do not have a full publication date cited in the publisher’s website? Ideally, it would only take a minute or so for anybody depositing articles in an institutional repository to find the “correct” publication date. But these confusing cases mean the minute inevitably becomes several minutes, and when you are uploading 5000 odd papers a year this turns into 17 whole days.

Setting rules for consistency

In the face of all of this ambiguity, we have had to devise a system of ‘rules’ to ensure we are consistent. For example:

If a publication year is given, but no month or day, we assume that it was 1^st January.
If a publication year and month are given but no day, we assume that it was 1^st of the month.
If we have an online date of say, 10^th May 2017 and a print issue month of May 2017, we will use the most specific date (10^th May 2017) rather than assuming 1^st May 2017 (though it is earlier).
Unless the publisher specifies that the online version is the accepted manuscript, we regard it as the final VOR with or without citation details.
If we cannot find a date from any other source, we try to check when the pdf featured on the website was created.

This last example does start to give a clue to why we have to spend so much time on the date problem.

By way of illustration, we have listed below some examples by publisher of how this affects us. This is a deliberate attempt to name and shame, but if a publisher is missing from this list, it is not because they are clear and straightforward on this topic. We just ran out of space. To be fair though, we have also listed one publisher as an example to show how simple it is to have a clear and transparent article publication history.

Taylor & Francis – ‘published online’

Publication date of an article online

There are several ways you can read an article. If the article is open access or if you subscribe, then you can download a pdf of the article from the publisher website. Otherwise, you see the online version on the website. The two versions of a particular article are below, the pdf and the online HTML version.

Both the pdf and the online version of the article list the article history as:
Received 14 March 2016
Accepted as 23 December 2016
Published online 12 January 2017

and also cite the Volume, year of publication and issue.

But does the ‘Published online’ date refer to when the Version of Record was made available online or the first time the Accepted Manuscript was made available online? We can’t distinguish this to provide the date for HEFCE.

Publication date of the printed journal

While we know the volume, year of publication and issue number, we don’t know what the exact publication date of the printed journal is for our metadata records. If we drill a bit more and we visit past volumes of the journal, we can see that the previous complete year (2016) features 12 issues. So we can make an educated guess that the issue number refers to the publication month (in our example it is issue 5, so it is May 2017).

However, we are wrong. The 12 issues refer to the online publication issues and not the print issues. According to Taylor & Francis’ agents customer service page they “have a number of journals where the print publication schedule differs to the online”. They have a list of those journals available and in our case we can see that this particular journal has 12 online issues but 4 paper issues in a year. So when did this actual article appear in print? Who knows.

Implications

Remember the 17 days a year? This is the type of activity that fills the time. Do we really need to do this time consuming exercise? Some might suggest that we contact the publisher and ask, but it is time-consuming and not always successful.

Elsevier’s Articles in Press

Elsevier’s description of Articles in Press states they are “articles that have been accepted for publication in Elsevier journals but have not yet been assigned to specific issues”. They could be any of an Accepted Manuscript, a Corrected Proof or an Uncorrected Proof. Elsevier have a page that answers questions about ‘grey areas’ and in a section discussing whether it is permissible for Elsevier to remove an article for some reason, they state they do not remove articles that have been published but “…papers made available in our “Articles in Press” (AiP) service do not have the same status as a formally published article…)”

This means the same article could be an ‘Article in Press’ in three different stages, none of which are ‘published’. Even when an article has moved beyond “In Press” mode and has been published in an issue we are not informed which version Elsevier refers to when the “available online” date is featured.

Let’s look at an example. Is the ‘Available online’ date of 13 December 2016 when it was available online as an Accepted Manuscript, a Corrected Proof or an Uncorrected Proof? This is very unclear.

So we have a disconnect. The earliest online date is not the final published version as per HEFCE’s requirement. There is no way of determining the date when the final published date does actually appear online, so we need to wait until the article is allocated an issue and volume for us to determine the date. This could be some considerable time AFTER the work has been finalised. So open access is delayed, we risk non compliance and waste huge amounts of time.

Well done, Wiley

Wiley features all possible stages of the article’s various publication stages making it easy to distinguish the VoR online publication date, exactly what HEFCE (and we) require.

Article published in an issue

This is an example of when an article is published online and the print issue is published too.

Article published online (awaiting for a print issue date)

Wiley states the publication history clearly even when an article is published online but not yet included in a publication issue.

If you have a closer look at the screenshot, Wiley regards as “First published” the VoR online publication date (shown also on the left under Publication History) and not the Accepted Manuscript online date.

In this case, the publisher clearly states which version they refer to when the term “First Published” is used and also gives the reader the full history of the article’s “life stages” as well as inform us that the article is yet not included in an issue (circle on the right).

Conclusions

If you have made it this far through the blog post, you are probably working in this area and have some experience of this issue. If you are new to the topic, hopefully the above examples have illustrated how frustrating it is sometimes to find the correct information in order to comply with not only HEFCE’s timeframe requirements, but other open access compliance issues, especially when you set embargoes.

A simple task can become an expensive exercise because we are wasting valuable working hours. We are in the business of supporting the research community to openly share research outputs, not in the business of deciphering information in publishers’ websites.

We need clear information in order to effectively deposit an article to our institutional repository and meet whatever requirements need to be met. It is not unreasonable to expect consistency and standards in the display of publication history and dates of articles.

Published 27 October 2017
Written by Maria Angelaki

Flipping journals or filling pockets? Publisher manipulation of OA policies

October 26, 2017UncategorizedOffice of Scholarly Communication

As part of Open Access Week 2017, the Office of Scholarly Communication is publishing a series of blog posts on open access and open research. In this post Drs André Sartori and Danny Kingsley look at examples of where publishers have structured pricing to take full advantage of funds available through UK open access policies.

We are spending a lot on open access in the UK. The 2017-2018 RCUK block grant allocations alone to support the RCUK Policy on Open Access add up to more than £8 million. So, what happens when a country makes a decision to introduce a significant extra boost to the publication budget?

As was predicted early 2013, by the Chairman of the House of Commons Business, Innovation and Skills Committee: “Current UK open access policy risks incentivising publishers to introduce or increase embargo periods”. By September 2013, there was clear evidence this was happening.

Now, in the final year of the RCUK transition period, the situation is far, far worse.

No flipping going on here

Five years on from publication of the Finch report, whose recommendations helped to shape open access policies in the UK, it appears that relatively few journals have flipped from toll access to fully open access. For instance, a comprehensive dataset of embargo periods imposed by Elsevier journals indicates that only 42 of the publishing giant’s 2,300 active journals flipped from toll access to open access in the period 2013-2017. Precise figures for other publishers are not readily available, but compiled lists of converted journals are all very short, as described in this paper.

What several publishers have done instead is to adapt their policies to maximise the ability of their journals to capture the additional funds being injected into open access, by either imposing non-compliant embargo periods or charging more for mandated licences.

An embargo period is the time counted from the publication date of an article during which the author’s accepted version may not be distributed in open access repositories. There is a distinction between a press embargo and a publication embargo. The latter is what is being discussed here. We should also note that there continues to be no evidence to support publisher’s justification for imposing embargo periods.

Several funders (e.g. European Research Council, National Institute for Health Research, RCUK and Charities Open Access Fund partners including the Wellcome Trust) stipulate that open access to funded scientific research must be provided no later than 6 months after publication (with some funders allowing up to 12 months for humanities), either by self-archiving or by purchasing immediate open access.

Hence, any hybrid journal imposing an embargo period exceeding the maximum allowed by these funders will require authors of funded research to purchase immediate open access in order to comply with the funder’s policy. And, sure enough, this was exactly the response of several publishing Goliaths to the introduction of funders’ open access policies.

Increased embargo periods = revenue

For instance, from 2004 to 2011 the largest of them all allowed posting of accepted manuscripts on personal websites or institutional repositories without an embargo. In 2011, Elsevier required that papers affected by a funder or institutional mandate were only able to be deposited if there was a specific agreement with Elsevier. In 2013, shortly after RCUK announced its open access policy, Elsevier published the first version of its embargo list, which listed only six journals (you read it right – six of 2732 journals) with an embargo period within the 6 month maximum allowed by RCUK for policy compliance via self-archiving. The number of journals compliant with RCUK’s self-archiving option for compliance has increased to 10 since then.

Springer, the world’s second largest journal publisher, also allowed authors to deposit their work in institutional repositories with no embargo until 2013, when it introduced an embargo period of 12 months for all their journals, effectively blocking the green route for compliance with major funders’ policies for all articles in STEM (Science, Technology, Engineering and Mathematics) subjects.

The other three of the big five publishers—Wiley-Blackwell, Taylor & Francis and Sage—also impose embargo periods that are mostly incompatible with compliance via self-archiving for the funders listed above. Wiley has adopted, since April 2013, embargo periods of 12 months for STEM and 24 months for HASS (Humanities, Arts, and Social Sciences) journals. Only 44 of Taylor & Francis’ 2,577 journals are hybrids that support self-archiving without embargo. Finally, Sage mirrors Springer’s policy of a 12-month embargo period for all their journals.

Introducing or increasing embargo periods is a very effective method of encouraging funded authors to select a paid-for open access option, but it lacks the creativity of some of the strategies considered below.

Higher charges for a CC BY licence

Funders aspire to maximum reuse of published results of the research they have invested in, so many require a Creative Commons Attribution (CC BY) licence when they are paying for open access. Examples are Bill & Melinda Gates Foundation, RCUK and COAF partners. Charging a premium for this licence type is therefore yet another method used by publishers to take advantage of funding for open access.

Below are a few examples of publishers charging extra for a CC BY licence.

American Association for the Advancement of Science (Science Advances)
- OA charge $3692 (CC BY-NC)
- Surcharge for CC-BY $1230
American Chemical Society
- OA charge $4000 (custom)
- Surcharge for CC-BY $1000
American Society for Nutrition
- OA charge $3000 (custom)
- Surcharge for CC-BY $2000
Optical Society of America (Biomedical Optics Express)
- OA charge $868 (custom)
- Surcharge for CC-BY $750
Oxford University Press (Cerebral Cortex)
- OA charge £2100 (CC BY-NC)
- Surcharge for CC-BY £263

As pointed out here, publishers rarely feel the need to explain the reasons for the differential pricing. Of the examples above, only AAAS justifies the surcharge by stating that “We assess the surcharge to account for potential lost secondary revenue such as permissions and reprint sales”.

Let’s for a moment ignore the fact that their base APC ($3,000) is well above the average charged by open access journals, and consider the potential revenue from the sale of reprints. Given the alternative licence offered by Science Advances (CC BY-NC) allows anyone to copy and redistribute the material in any medium or format (but not to sell it), what revenue could be reasonably be expected from reprint sales?

Targeted embargo periods

A third and more complex strategy to capitalise on research funders’ policies, and which fortunately appears to be losing ground, is to have policies specifying more strict self-archiving conditions for authors of funded research, or longer embargo periods for deposits in PubMed Central and Europe PMC, the subject repository mandated by several major funders of biomedical research (e.g. BBSRC, MRC, NIHR and COAF partners).

BMJ journals, for example, set a special embargo of 12 months on deposits in PMC, while allowing deposits in other open access repositories without any embargo.

Elsevier, Wiley and more recently Emerald are all examples of publishers that have at some point dictated different conditions for authors following open access mandates, but as of the date of this post do not discriminate authors on the basis of their funding.

Call us cynics

This last technique to squeeze every penny out of government funds is possibly the most cynical and puts even more lie to the claims publishers make about the necessity for embargo periods. Either making an author’s accepted manuscript available in a repository causes the cancellation of journal subscriptions or it doesn’t. The funding behind the research described in the paper is irrelevant.

And yet we continue to comply and we continue to pay. The RCUK is morphing into UK Research and Innovation on 1 April 2018. This is the time to take serious stock of the policies that have lined the pockets of big academic publishing companies and change them to achieve the actual end goal which is the dissemination of research. Green over gold people.

Published 26 October 2017
Written by Dr Andre Sartori and Dr Danny Kingsley

Choosing from a cornucopia: a thesis digitisation project

October 25, 2017UncategorizedBritish Library, digitisation, microfilm, OCR, optical character recognition, thesesOffice of Scholarly Communication

As part of Open Access Week 2016, the Office of Scholarly Communication is publishing a series of blog posts on open access and open research. In this post Drs Danny Kingsley and Matthias Ammon describe the process of choosing theses to digitise.

For decades microfilm was the way documents were photographed and stored. The British Library holds a collection of 14,000 Cambridge PhD theses on microfilm. These date back to the 1960s and go through to 2008 when digitisation took over from microfilm. In 2016 the Office of Scholarly Communication (OSC) was contacted by the British Library with an offer of low cost digitisation of these theses.

Clearly being able to upload these theses to University’s repository, Apollo would make the works more visible. It would also be a major improvement on having to request the works be digitised from paper, because the cost was significantly lower. Even though we did not have permission from the authors to make the work openly available, they would be requestable. The OSC decided to invest £20,000, which would pay for 10% of the theses, a total of 1,400.

There were two primary criteria that we were considering in choosing which theses to upload – the quality of the finished product and the likelihood of the theses being requested.

Quality of digitisation

Before word processing, theses were typewritten. The typeprint in the originals is not always clear and even. In addition, images were glued into the works and those images themselves were not always originals, so the quality of a copy is poor.

We needed to look at whether these types of issues affected the quality and readability of a thesis digitised from microfilm. In addition, one advantage of having a digitised thesis is the ability to run Optical Character Recognition (OCR) over it so the work becomes searchable. However OCR does not work on handwriting or if the type is uneven.

To test this, the OSC decided to ask the British Library to digitise a few samples from older theses and from theses that contained unusual characters or maps to ascertain the quality of the digitisation. Louise Clarke, the Superintendent in the Manuscripts Reading Room considered the British Library list and found some examples of theses that that had not been digitised at our end. She identified some sample pages to be scanned that might prove to be challenging.

When the scans arrived, Sarah Middle, our repository manager assessed the visual quality and tried to run OCR over the scans to test for accuracy.

The scan of 1997 thesis that contained photos, diagrams and equations, had fuzzy text at the edges but was generally legible and the OCR samples were accurate. However the photos looked very dark.
The 1968 thesis that contained typed Greek characters had a poorer scan quality. It was ‘shadowy’ and some of the letters in the English text blur together. This meant OCR was almost pointless in some places as the accuracy is so low. In addition OCR did not pick up all the Greek characters, although we were not sure this would be better if the scan was done in-house from the original.
A 1977 thesis that included handwritten Hebrew characters had much better visual quality than the 1968 thesis but OCR didn’t pick up the handwritten Hebrew at all and while the accuracy of OCR on the English text was much higher than the 1968 thesis it was not as good as the 1997 thesis.
The final thesis was a 1989 thesis that contained images of handwriting and equations in the text. This handwriting had been rendered as an image so OCR was not applied. Given that in this particular work the handwriting was there for illustrative purposes so this was not in itself an issue. Something that was odd was the OCR on the text seemed to include a lot of Greek characters, even though there were none in the sample. We hypothesised that possibly because some of the equations contained Greek characters this might have confused the language settings. The mathematical formulae rendered about as well as expected from OCR.

We then went back to Louise Clarke and asked her to scan the pages from the original as a comparison. Even allowing for the fact that a professional scan by the Digital Content Unit would have been of higher quality, it did help the assessment. We found that the photo was lighter (and therefore clearer) in the digital scan from the original, but the text from the scanned microfilm was much clearer than a 200dpi scan from the original.

This process led to the conclusion that we would have the best results if we focused on more recent theses.

Which subjects?

We decided to take advantage of some information already in house on which theses were likely to be more read. Until recently, Cambridge PhD students only had to provide a hardbound copy of their thesis for graduation. While in the past few years, some PhD students have uploaded their theses to the repository to make them open access, the majority are not available in this format.

If a researcher wanted to look at a Cambridge PhD they either had to come to the University Library and read the work in the Manuscripts Reading Room, or order a digital copy. The Digital Content Unit in the Library manages these requests for digitisation. Indeed last year during Open Access Week we blogged about the project to upload the collection of scanned theses into the repository and the attempt to find the authors for permission to make them open access.

What this gave us, however, was an indication of the theses that people wanted to read. We were particularly interested to know if there was a pattern in terms of the subjects that were being requested for digitisation.

Our repository manager went through the list of all theses that had been requested and found 452 distinct classmarks in the correct format, which seemed like a good sample size. Our initial plan was to see if the classmarks (which are codes used to identify the subject of a book or manuscript) provided for each thesis could be compared to the catalogue to retrieve department information/subject headings, which we could in turn use as a basis to select the theses.

Unfortunately our technical team was tied up at the time with the implementation of a new library management system so we had to revert to a manual process. This meant looking the thesis up manually in the catalogue and noting the department. In the end Louise Clarke checked 200 of the theses requested between July 2015 and July 2016 to establish which departments the theses belonged to.

Based on these statistics History was a clear outlier as by far the most requested subject. Also popular, but to a less statistically significant level, were subjects such as Engineering, Social Anthropology, Chemistry and Divinity. It should be noted that Engineering produces by far the largest number of theses overall, so the inclusion of Engineering theses in this list would be expected.

So far, so good. We knew the subjects that we should focus on, and that we should aim to digitise more recent theses that had been created with word processing. Now for the grunt work of choosing the 10%.

Choosing the 10%

Obviously the first thing we needed to do was exclude all of the theses we held open access in the repository and any that we had digitised ourselves from the original from the list of 14,000 microfilmed theses.

The British Library holdings contained Dewey numbers. While Dewey numbers are only an approximation of departmental divisions within the University, it was still a mechanism to identify the subjects. Our repository manager Sarah Middle collated the Dewey numbers for the British Library holdings and the project manager Matthias Ammon performed a rough sorting of theses according to the main Dewey number headings.

We decided to include all of the History theses going back to 1980. These corresponded to Dewey classes 90x, 92x, 93x, 94x, 95x, 96x, 97x, 98x and 99x) going back to 1980. There were a total of 756 theses, just over half of the total list.

In the end, the rest up to 1400 was filled up with subjects that appeared to be popular based on the sample analysis and roughly adjusted for the total number of theses produced in each subject in the University, with more recent cut-off points for the science subjects. While Classics was a popularly requested subject, the number of items available in the British Library’s microfilm holdings in the corresponding Dewey classes (84x and 88x) was small.

The decision was taken to include:

Chemistry back to 1990 (a total of 181)
Engineering back to 1995 (a total of 140)
Sociology and Anthropology, covering several departments of the University (a total of 216)
Philosophy (63)
Religion (30)
Classics (14)

While there was a certain element of arbitrariness in the process, this was considered a starting point. We are hopeful the remainder of the theses held on microfilm by the British Library will be digitised in due course.

Making the theses available

The British Library subsequently scanned our selection and provided us with the files on an external drive earlier this year. We were able to extract the metadata from EThOS to allow for a bulk upload of the works. However, this project has made us assess the way we were managing access to theses in the Library. This policy thinking has now been completed and we are developing an online request system for these restricted theses. The whole set of 1,400 theses should be available in the repository for request during November.

The Office of Scholarly Communication is grateful for the support of the Arcadia fund, a charitable foundation of Lisbet Rausing and Peter Baldwin for this project.

Published 25 October 2017
Written by Dr Danny Kingsley and Dr Matthias Ammon

Unlocking Research

Open Research at Cambridge