Tag Archives: British Library

Where are we now? Cambridge theses deposits one year in

As the nights draw in and the academic year 2018/19 begins, we are preparing to enter our second year of compulsory e-theses deposits. Our university repository, Apollo, is close to holding 6000 digital PhD theses and it is the intention of the University that this valuable research asset continues to grow into the future. The Apollo repository will play a large part in making this happen. Until recently only hardbound copies of theses were collected and catalogued by the University Library. Users could read theses on-site in Cambridge or order a digitisation of the thesis, but the introduction of e-thesis deposit to Apollo has meant that University of Cambridge theses are more accessible than ever before. It’s been an incredibly busy year and we have made some great steps forward in our management of theses in Cambridge.

e-theses at Cambridge – the background

The e-theses deposit story at Cambridge started in October 2016, when the Office of Scholarly Communication upgraded Apollo to allow the deposit of theses and began a digital thesis pilot for the academic year 2016/17. 11 departments in the University participated in the pilot, asking their PhD students to deposit an e-thesis alongside a hardcopy thesis. Theses deposited in Apollo during the pilot could either be made open access on request of the author or were treated as historical theses had been up until that point, whereby hardbound copies were held in the University Library and requestors could sign a declaration stating they wish to consult a thesis for private study or non-commercial research. Following the success of the pilot, the Board of Graduate Studies, at its meeting on 4 July 2017, made the decision that from 1 October 2017 all PhD students would be required to deposit both a hard copy and an electronic copy of their thesis to the University Library.

What we learnt during the academic year 2017/18

The experience of depositing theses during the pilot had highlighted some issues that needed addressing. We had to make decisions on how to deal with third party copyright, sensitive material, library copy and supply rules, and the alignment of access levels for hardbound and electronic theses. In response to this, we decided that we should think through each of the different ways in which a thesis could be deposited in the repository, and consider the range of contentious material that could be contained within a thesis.

How do theses enter the repository?

Whilst students that are depositing in order to graduate do this directly, we also have the capacity to scan theses on request here in the library, and these scanned theses are subsequently deposited in Apollo. In addition to this, we led a drive to digitise University of Cambridge theses held by the British Library on microfilm and gave alumni the option to digitise their thesis and make it open access at no cost to them.

British Library theses

This year the OSC has made a bulk deposit of theses scanned by the British Library, which significantly augments the number of theses stored in the repository. In the culmination of a two-year project, nearly 1300 additional Cambridge PhD theses are now available on request in the Apollo repository.

Prior to being made available in the repository, these Cambridge theses were held on microfilm at the British Library. They date from the 1960s through to 2008, when digitisation took over from microfilm as a means of document storage. The British Library holds 14,000 Cambridge PhD theses on microfilm; in 2016 they embarked on a project with the OSC to digitise ten percent of the collection at low cost – read more about this in an earlier post, Choosing from a cornucopia: a digitisation project.

You can explore the collection in Apollo: Historical Digital Theses: British Library collection.  The theses are under controlled access, which means they are available on request for non-commercial research purposes, subject to a £15 admin fee.

Establishing access levels

We established that the level of access we could allow to the thesis could be determined by the route a thesis entered the repository, its content, or in some cases the author’s wish to publish. To address all of the potential issues, we decided to define a set of access levels which would determine what we, as managers of the repository, were able to do with a thesis and the way in which it could be accessed by a requestor.

The access levels were put in action in spring 2018 and this was followed by a survey of Degree Committees, conducted by the e-theses working group consisting of members of the University Library and Student Registry. The survey asked for feedback on the suitability of the access levels for research outputs for all departments in the University; the outcome confirmed that the access levels were working and covered the options well, although a few tweaks were needed. In light of the feedback, a set of recommendations was put to the Board of Graduate Studies by the e-theses working group, and these recommendations were considered and accepted at their meeting on 3 July 2018, ready to be put in place for the 2017/18 academic year.

eSales for theses under controlled access

At the same time as we were establishing our access levels, we were also working on devising an eSales process to facilitate the supply of theses under controlled access. Controlled access replicates the way that historical, hardbound theses were managed in the library, with the addition of an electronic version of the thesis being held in the repository, and follows the library copy and supply rules for unpublished works under copyright law. A thesis scanned by the library would be deposited under controlled access so it remains unpublished, but this access level is also available to students depositing their thesis directly. The eSales process we devised went live in July 2018 and this meant a large number of theses held in the repository were made more accessible, including those digitised by the British Library. As of 18 October, we have supplied 14 theses via the eSales route and the requests keep coming in at a steady pace.

Looking forward to the 2018/19 academic year

As we begin the 2018/19 academic year, our theses management is looking in good shape but we will continue to improve and refine our internal and external services. In consultation with the University’s Student Registry we are making the final changes to our deposit forms, access levels and communications and we endeavour to make this academic year the smoothest yet for e-theses management. University of Cambridge theses are more accessible than they have ever been. The collection will grow as more students deposit each year, and the valuable research of PhD students will continue to be disseminated.

Published 25 October 2018
Written by Zoë Walker-Fagg
Creative Commons License

Choosing from a cornucopia: a digitisation project

As part of Open Access Week 2016, the Office of Scholarly Communication is publishing a series of blog posts on open access and open research. In this post Drs Danny Kingsley and Matthias Ammon describe the process of choosing theses to digitise.

For decades microfilm was the way documents were photographed and stored. The British Library holds a collection of 14,000 Cambridge PhD theses on microfilm. These date back to the 1960s and go through to 2008 when digitisation took over from microfilm. In 2016 the Office of Scholarly Communication (OSC) was contacted by the British Library with an offer of low cost digitisation of these theses.

Clearly being able to upload these theses to University’s repository, Apollo would make the works more visible. It would also be a major improvement on having to request the works be digitised from paper, because the cost was significantly lower. Even though we did not have permission from the authors to make the work openly available, they would be requestable. The OSC decided to invest £20,000, which would pay for 10% of the theses, a total of 1,400.

There were two primary criteria that we were considering in choosing which theses to upload – the quality of the finished product and the likelihood of the theses being requested.

Quality of digitisation

Before word processing, theses were typewritten. The typeprint in the originals is not always clear and even. In addition, images were glued into the works and those images themselves were not always originals, so the quality of a copy is poor.

We needed to look at whether these types of issues affected the quality and readability of a thesis digitised from microfilm. In addition, one advantage of having a digitised thesis is the ability to run Optical Character Recognition (OCR) over it so the work becomes searchable. However OCR does not work on handwriting or if the type is uneven.

To test this, the OSC decided to ask the British Library to digitise a few samples from older theses and from theses that contained unusual characters or maps to ascertain the quality of the digitisation. Louise Clarke, the Superintendent in the Manuscripts Reading Room considered the British Library list and found some examples of theses that that had not been digitised at our end. She identified some sample pages to be scanned that might prove to be challenging.

When the scans arrived, Sarah Middle, our repository manager assessed the visual quality and tried to run OCR over the scans to test for accuracy.

  • The scan of 1997 thesis that contained photos, diagrams and equations, had fuzzy text at the edges but was generally legible and the OCR samples were accurate. However the photos looked very dark.
  • The 1968 thesis that contained typed Greek characters had a poorer scan quality. It was ‘shadowy’ and some of the letters in the English text blur together. This meant OCR was almost pointless in some places as the accuracy is so low. In addition OCR did not pick up all the Greek characters, although we were not sure this would be better if the scan was done in-house from the original.
  • A 1977 thesis that included handwritten Hebrew characters had much better visual quality than the 1968 thesis but OCR didn’t pick up the handwritten Hebrew at all and while the accuracy of OCR on the English text was much higher than the 1968 thesis it was not as good as the 1997 thesis.
  • The final thesis was a 1989 thesis that contained images of handwriting and equations in the text. This handwriting had been rendered as an image so OCR was not applied. Given that in this particular work the handwriting was there for illustrative purposes so this was not in itself an issue. Something that was odd was the OCR on the text seemed to include a lot of Greek characters, even though there were none in the sample. We hypothesised that possibly because some of the equations contained Greek characters this might have confused the language settings. The mathematical formulae rendered about as well as expected from OCR.

We then went back to Louise Clarke and asked her to scan the pages from the original as a comparison. Even allowing for the fact that a professional scan by the Digital Content Unit would have been of higher quality, it did help the assessment. We found that the photo was lighter (and therefore clearer) in the digital scan from the original, but the text from the scanned microfilm was much clearer than a 200dpi scan from the original.

This process led to the conclusion that we would have the best results if we focused on more recent theses.

Which subjects?

We decided to take advantage of some information already in house on which theses were likely to be more read. Until recently, Cambridge PhD students only had to provide a hardbound copy of their thesis for graduation. While in the past few years, some PhD students have uploaded their theses to the repository to make them open access, the majority are not available in this format.

If a researcher wanted to look at a Cambridge PhD they either had to come to the University Library and read the work in the Manuscripts Reading Room, or order a digital copy. The Digital Content Unit in the Library manages these requests for digitisation. Indeed last year during Open Access Week we blogged about the project to upload the collection of scanned theses into the repository and the attempt to find the authors for permission to make them open access.

What this gave us, however, was an indication of the theses that people wanted to read. We were particularly interested to know if there was a pattern in terms of the subjects that were being requested for digitisation.

Our repository manager went through the list of all theses that had been requested and found 452 distinct classmarks in the correct format, which seemed like a good sample size. Our initial plan was to see if the classmarks (which are codes used to identify the subject of a book or manuscript) provided for each thesis could be compared to the catalogue to retrieve department information/subject headings, which we could in turn use as a basis to select the theses.

Unfortunately our technical team was tied up at the time with the implementation of a new library management system so we had to revert to a manual process.  This  meant looking the thesis up manually in the catalogue and noting the department. In the end Louise Clarke checked 200 of the theses requested between July 2015 and July 2016  to establish which departments the theses belonged to.

Based on these statistics History was a clear outlier as by far the most requested subject. Also popular, but to a less statistically significant level, were subjects such as Engineering, Social Anthropology, Chemistry and Divinity. It should be noted that Engineering produces by far the largest number of theses overall, so the inclusion of Engineering theses in this list would be expected.

So far, so good. We knew the subjects that we should focus on, and that we should aim to digitise more recent theses that had been created with word processing. Now for the grunt work of choosing the 10%.

Choosing the 10%

Obviously the first thing we needed to do was exclude all of the theses we held open access in the repository and any that we had digitised ourselves from the original from the list of 14,000 microfilmed theses.

The British Library holdings contained Dewey numbers. While Dewey numbers are only an approximation of departmental divisions within the University, it was still a mechanism to identify the subjects. Our repository manager Sarah Middle collated the Dewey numbers for the British Library holdings and the project manager Matthias Ammon performed a rough sorting of theses according to the main Dewey number headings.

We decided to include all of the History theses going back to 1980. These corresponded to Dewey classes 90x, 92x, 93x, 94x, 95x, 96x, 97x, 98x and 99x) going back to 1980. There were a total of 756 theses, just over half of the total list.

In the end, the rest up to 1400 was filled up with subjects that appeared to be popular based on the sample analysis and roughly adjusted for the total number of theses produced in each subject in the University, with more recent cut-off points for the science subjects. While Classics was a popularly requested subject, the number of items available in the British Library’s microfilm holdings in the corresponding Dewey classes (84x and 88x) was small.

The decision was taken to include:

  • Chemistry back to 1990 (a total of 181)
  • Engineering back to 1995 (a total of 140)
  • Sociology and Anthropology, covering several departments of the University (a total of 216)
  • Philosophy (63)
  • Religion (30)
  • Classics (14)

While there was a certain element of arbitrariness in the process, this was considered a starting point. We are hopeful the remainder of the theses held on microfilm by the British Library will be digitised in due course.

Making the theses available

The British Library subsequently scanned our selection and provided us with the files on an external drive earlier this year. We were able to extract the metadata from EThOS to allow for a bulk upload of the works. However, this project has made us assess the way we were managing access to theses in the Library. This policy thinking has now been completed and we are developing an online request system for these restricted theses. The whole set of 1,400 theses should be available in the repository for request during November.

The Office of Scholarly Communication is grateful for the support of the Arcadia fund, a charitable foundation of Lisbet Rausing and Peter Baldwin for this project.

Published 25 October 2017
Written by Dr Danny Kingsley and Dr Matthias Ammon
Creative Commons License