All posts by Office of Scholarly Communication

Choosing from a cornucopia: a thesis digitisation project

As part of Open Access Week 2016, the Office of Scholarly Communication is publishing a series of blog posts on open access and open research. In this post Drs Danny Kingsley and Matthias Ammon describe the process of choosing theses to digitise.

For decades microfilm was the way documents were photographed and stored. The British Library holds a collection of 14,000 Cambridge PhD theses on microfilm. These date back to the 1960s and go through to 2008 when digitisation took over from microfilm. In 2016 the Office of Scholarly Communication (OSC) was contacted by the British Library with an offer of low cost digitisation of these theses.

Clearly being able to upload these theses to University’s repository, Apollo would make the works more visible. It would also be a major improvement on having to request the works be digitised from paper, because the cost was significantly lower. Even though we did not have permission from the authors to make the work openly available, they would be requestable. The OSC decided to invest £20,000, which would pay for 10% of the theses, a total of 1,400.

There were two primary criteria that we were considering in choosing which theses to upload – the quality of the finished product and the likelihood of the theses being requested.

Quality of digitisation

Before word processing, theses were typewritten. The typeprint in the originals is not always clear and even. In addition, images were glued into the works and those images themselves were not always originals, so the quality of a copy is poor.

We needed to look at whether these types of issues affected the quality and readability of a thesis digitised from microfilm. In addition, one advantage of having a digitised thesis is the ability to run Optical Character Recognition (OCR) over it so the work becomes searchable. However OCR does not work on handwriting or if the type is uneven.

To test this, the OSC decided to ask the British Library to digitise a few samples from older theses and from theses that contained unusual characters or maps to ascertain the quality of the digitisation. Louise Clarke, the Superintendent in the Manuscripts Reading Room considered the British Library list and found some examples of theses that that had not been digitised at our end. She identified some sample pages to be scanned that might prove to be challenging.

When the scans arrived, Sarah Middle, our repository manager assessed the visual quality and tried to run OCR over the scans to test for accuracy.

  • The scan of 1997 thesis that contained photos, diagrams and equations, had fuzzy text at the edges but was generally legible and the OCR samples were accurate. However the photos looked very dark.
  • The 1968 thesis that contained typed Greek characters had a poorer scan quality. It was ‘shadowy’ and some of the letters in the English text blur together. This meant OCR was almost pointless in some places as the accuracy is so low. In addition OCR did not pick up all the Greek characters, although we were not sure this would be better if the scan was done in-house from the original.
  • A 1977 thesis that included handwritten Hebrew characters had much better visual quality than the 1968 thesis but OCR didn’t pick up the handwritten Hebrew at all and while the accuracy of OCR on the English text was much higher than the 1968 thesis it was not as good as the 1997 thesis.
  • The final thesis was a 1989 thesis that contained images of handwriting and equations in the text. This handwriting had been rendered as an image so OCR was not applied. Given that in this particular work the handwriting was there for illustrative purposes so this was not in itself an issue. Something that was odd was the OCR on the text seemed to include a lot of Greek characters, even though there were none in the sample. We hypothesised that possibly because some of the equations contained Greek characters this might have confused the language settings. The mathematical formulae rendered about as well as expected from OCR.

We then went back to Louise Clarke and asked her to scan the pages from the original as a comparison. Even allowing for the fact that a professional scan by the Digital Content Unit would have been of higher quality, it did help the assessment. We found that the photo was lighter (and therefore clearer) in the digital scan from the original, but the text from the scanned microfilm was much clearer than a 200dpi scan from the original.

This process led to the conclusion that we would have the best results if we focused on more recent theses.

Which subjects?

We decided to take advantage of some information already in house on which theses were likely to be more read. Until recently, Cambridge PhD students only had to provide a hardbound copy of their thesis for graduation. While in the past few years, some PhD students have uploaded their theses to the repository to make them open access, the majority are not available in this format.

If a researcher wanted to look at a Cambridge PhD they either had to come to the University Library and read the work in the Manuscripts Reading Room, or order a digital copy. The Digital Content Unit in the Library manages these requests for digitisation. Indeed last year during Open Access Week we blogged about the project to upload the collection of scanned theses into the repository and the attempt to find the authors for permission to make them open access.

What this gave us, however, was an indication of the theses that people wanted to read. We were particularly interested to know if there was a pattern in terms of the subjects that were being requested for digitisation.

Our repository manager went through the list of all theses that had been requested and found 452 distinct classmarks in the correct format, which seemed like a good sample size. Our initial plan was to see if the classmarks (which are codes used to identify the subject of a book or manuscript) provided for each thesis could be compared to the catalogue to retrieve department information/subject headings, which we could in turn use as a basis to select the theses.

Unfortunately our technical team was tied up at the time with the implementation of a new library management system so we had to revert to a manual process.  This  meant looking the thesis up manually in the catalogue and noting the department. In the end Louise Clarke checked 200 of the theses requested between July 2015 and July 2016  to establish which departments the theses belonged to.

Based on these statistics History was a clear outlier as by far the most requested subject. Also popular, but to a less statistically significant level, were subjects such as Engineering, Social Anthropology, Chemistry and Divinity. It should be noted that Engineering produces by far the largest number of theses overall, so the inclusion of Engineering theses in this list would be expected.

So far, so good. We knew the subjects that we should focus on, and that we should aim to digitise more recent theses that had been created with word processing. Now for the grunt work of choosing the 10%.

Choosing the 10%

Obviously the first thing we needed to do was exclude all of the theses we held open access in the repository and any that we had digitised ourselves from the original from the list of 14,000 microfilmed theses.

The British Library holdings contained Dewey numbers. While Dewey numbers are only an approximation of departmental divisions within the University, it was still a mechanism to identify the subjects. Our repository manager Sarah Middle collated the Dewey numbers for the British Library holdings and the project manager Matthias Ammon performed a rough sorting of theses according to the main Dewey number headings.

We decided to include all of the History theses going back to 1980. These corresponded to Dewey classes 90x, 92x, 93x, 94x, 95x, 96x, 97x, 98x and 99x) going back to 1980. There were a total of 756 theses, just over half of the total list.

In the end, the rest up to 1400 was filled up with subjects that appeared to be popular based on the sample analysis and roughly adjusted for the total number of theses produced in each subject in the University, with more recent cut-off points for the science subjects. While Classics was a popularly requested subject, the number of items available in the British Library’s microfilm holdings in the corresponding Dewey classes (84x and 88x) was small.

The decision was taken to include:

  • Chemistry back to 1990 (a total of 181)
  • Engineering back to 1995 (a total of 140)
  • Sociology and Anthropology, covering several departments of the University (a total of 216)
  • Philosophy (63)
  • Religion (30)
  • Classics (14)

While there was a certain element of arbitrariness in the process, this was considered a starting point. We are hopeful the remainder of the theses held on microfilm by the British Library will be digitised in due course.

Making the theses available

The British Library subsequently scanned our selection and provided us with the files on an external drive earlier this year. We were able to extract the metadata from EThOS to allow for a bulk upload of the works. However, this project has made us assess the way we were managing access to theses in the Library. This policy thinking has now been completed and we are developing an online request system for these restricted theses. The whole set of 1,400 theses should be available in the repository for request during November.

The Office of Scholarly Communication is grateful for the support of the Arcadia fund, a charitable foundation of Lisbet Rausing and Peter Baldwin for this project.

Published 25 October 2017
Written by Dr Danny Kingsley and Dr Matthias Ammon
Creative Commons License

Benchmarking RDM Training

This blog reports on the progress of the international project to benchmark Research Data Management training across institutions. It is a collaboration of Cambridge Research Data Facility staff with international colleagues – a full list is at the bottom of the post. This is a reblog, the original appeared on 6 October 2017. 

How effective is your RDM training?

When developing new training programmes, one often asks oneself a question about the quality of training. Is it good? How good is it? Trainers often develop feedback questionnaires and ask participants to evaluate their training. However, feedback gathered from participants attending courses does not answer the question how good was this training compared with other training on similar topics available elsewhere. As a result, improvement and innovation becomes difficult. So how to objectively assess the quality of training?

In this blog post we describe how, by working collaboratively, we created tools for objective assessment of RDM training quality.

Crowdsourcing

In order to objectively assess something, objective measures need to exist. Being unaware of any objective measures for benchmarking of a training programme, we asked Jisc’s Research Data Management mailing list for help. It turned out that a lot of resources with useful advice and guidance on creation of informative feedback forms was readily available, and we gathered all information received in a single document. However, none of the answers received provided us with the information we were looking for. To the contrary, several people said they would be interested in such metrics. This meant that objective metrics to address the quality of RDM training either did not exist, or the community was not aware of them. Therefore, we decided to create RDM training evaluation metrics.

Cross-institutional and cross-national collaboration

For metrics to be objective, and to allow benchmarking and comparisons of various RDM courses, they need to be developed collaboratively by a community who would be willing to use them. Therefore, the next question we asked Jisc’s Research Data Management mailing list was whether people would be willing to work together to develop and agree on a joint set of RDM training assessment metrics and a system, which would allow cross-comparisons and training improvements. Thankfully, the RDM community tends to be very collaborative, which was the case also this time – more than 40 people were willing to take part in this exercise and a dedicated mailing list was created to facilitate collaborative working.

Agreeing on the objectives

To ensure effective working, we first needed to agree on common goals and objectives. We agreed that the purpose of creating the minimal set of questions for benchmarking is to identify what works best for RDM training. We worked with the idea that this was for ‘basic’ face-to-face RDM training for researchers or support staff but it can be extended to other types and formats of training session. We reasoned that same set of questions used in feedback forms across institutions, combined with sharing of training materials and contextual information about sessions, should facilitate exchange of good practice and ideas. As an end result, this should allow constant improvement and innovation in RDM training. We therefore had joint objectives, but how to achieve this in practice?

Methodology

Deciding on common questions to be asked in RDM training feedback forms

In order to establish joint metrics, we first had to decide on a joint set of questions that we would all agree to use in our participant feedback forms. To do this we organised a joint catch up call during which we discussed the various questions we were asking in our feedback forms and why we thought these were important and should be mandatory in the agreed metrics. There was lots of good ideas and valuable suggestions. However, by the end of the call and after eliminating all the non-mandatory questions, we ended up with a list of thirteen questions, which we thought were all important. These however were too many to be asked of participants to fill in, especially as many institutions would need to add their own institution-specific feedback questions.

In order to bring down the number of questions which should be made mandatory in feedback forms, a short survey was created and sent to all collaborators, asking respondents to judge how important each question was (scale 1-5, 1 being ‘not important at all that this question is mandatory’ and 5 being ‘this should definitely be mandatory’.). Twenty people participated in the survey. The total score received from all respondents for each question were calculated. Subsequently, top six questions with the highest scores were selected to be made mandatory.

Ways of sharing responses and training materials

We next had to decide on the way in which we would share feedback responses from our courses and training materials themselves . We unanimously decided that Open Science Framework (OSF) supports the goals of openness, transparency and sharing, allows collaborative working and therefore is a good place to go. We therefore created a dedicated space for the project on the OSF, with separate components with the joint resources developed, a component for sharing training materials and a component for sharing anonymised feedback responses.

Next steps

With the benchmarking questions agreed and with the space created for sharing anonymised feedback and training materials, we were ready to start collecting first feedback for the collective training assessment. We also thought that this was also a good opportunity to re-iterate our short-, mid- and long-term goals.

Short-term goals

Our short-term goal is to revise our existing training materials to incorporate the agreed feedback questions into RDM training courses starting in the Autumn 2017. This would allow us to obtain the first comparative metrics at the beginning of 2018 and would allow us to evaluate if our designed methodology and tools are working and if they are fit for purpose. This would also allow us to iterate over our materials and methods as needed.

Mid-term goals

Our mid-term goal is to see if the metrics, combined with shared training materials, could allow us to identify parts of RDM training that work best and to collectively improve the quality of our training as a whole. This should be possible in mid/late-2018, allowing time to adapt training materials as result of comparative feedback gathered at the beginning of 2018 and assessing whether training adaptation resulted in better participant feedback.

Long-term goals

Our long-term goal is to collaboratively investigate and develop metrics which could allow us to measure and monitor long-term effects of our training. Feedback forms and satisfaction surveys immediately after training are useful and help to assess the overall quality of sessions delivered. However, the ultimate goal of any RDM training should be the improvement of researchers’ day to day RDM practice. Is our training really having any effects on this? In order to assess this, different kinds of metrics are needed, which would need to be coupled with long-term follow up with participants. We decided that any ideas developed on how to best address this will be also gathered in the OSF and we have created a dedicated space for the work in progress.

Reflections

When reflecting on the work we did together, we all agreed that we were quite efficient. We started in June 2017, and it took us two joint catch up calls and a couple of email exchanges to develop and agree on joint metrics for assessment of RDM training. Time will show whether the resources we create will help us meet our goals, but we all thought that during the process we have already learned a lot from each other by sharing good practice and experience. Collaboration turned out to be an excellent solution for us. Likewise, our discussions are open to everyone to join, so if you are reading this blog post and would like to collaborate with us (or to follow our conversations), simply sign up to the mailing list.

Resources

Published 9 October 2017
Written by: (in alphabetical order by surname): Cadwallader Lauren, Higman Rosie, Lawler Heather, Neish Peter, Peters Wayne, Schwamm Hardy, Teperek Marta, Verbakel Ellen, Williamson, Laurian, Busse-Wicher Marta
Creative Commons License

Milestone -1000 datasets in Cambridge’s repository

Last week, Cambridge celebrated a huge milestone – the deposit of the 1000th dataset to our repository Apollo since the launch of the Research Data Facility in early 2015. This is the culmination of a huge amount of work by the team in the Office of Scholarly Communication, in terms of developing systems, workflows, policies and through an extensive advocacy campaign. The Research Data team have run 118 events over the past couple of years and published 39 blogs.

In the past 12 months alone there have been 26000 downloads of the data in Apollo. In some cases the dataset has been downloaded many times – 170 – and the data has featured in news, blogs and Twitter.

An event was held at Cambridge University Library last week to celebrate this milestone.

   

Opening remarks

The Director of Library Services, Dr Jess Gardner opened proceedings with a speech where she noted “the Research Data Services and all who sail in her are at the core of our mission in our research library”.

Dr Gardner referred to the library’s long and proud history of collecting and managing research data that “began on vellum, paper, stone and bone”. The research data of luminaries such as Isaac Newton and Charles Darwin was on paper and, she noted “we have preserved that with great care and share it openly on line through our digital library.”

Turning to the future, Dr Gardner observed: “But our responsibility now is today’s researcher and today’s scientists and people working across all disciplines across our great university. Our preservation stewardship of that research data from the digital humanities across the biomedical is a core part of what we now do.”

“In the 21st century our support and our overriding philosophy is all about supporting open research and opening data as widely as possible,” she noted.  “It is about sharing freely wherever it is appropriate to do so”. [Dr Gardner’s speech is in full at the end of this post.]

Perspectives from a researcher

The second speaker was Zoe Adams, a PhD student at Cambridge who talked about the work she has done with Professor Simon Deakin on the Labour Regulation Index in association with the Centre for Business Research.

Ms Adams noted it was only in retrospect she could “appreciate the benefit of working in a collaborative project and open research generally”. She discussed how helpful it had been as an early career researcher to be “associated with something that was freely available”. She observed that few of her peers had many citations, and the reason she did was because “the dataset is online, people use the data, they cite the data, and cite me”.

Working openly has also improved the way she works, she explained, saying “It has given me a new perspective on what research should be about. …  It gives me a sense that people are relying on this data to be accurate and that does change the way you approach it.”

View from the team

The final speaker was Dr Lauren Cadwallader, Joint Deputy Head of the OSC with responsibility for the Research Data Facility, who discussed the “showcase dataset of the data that we can produce in the OSC” which is  taken from usage of our Request a Copy service.

Dr Cadwallader noted there has been an increase in the requests for theses over time. “This is a really exciting observation because the Board of Graduate studies have agreed that all students should deposit a digital copy of their thesis in our repository,” she said. “So it is really nice evidence that we can show our PhD students that by putting a copy in the repository people can read it and people do want to read theses in our repository.”

One observation was that several of the theses that were requested were written 60 years ago, so the repository is sharing older research as well. The topics of these theses covered algebra, Yorkshire evangelists and one of the oldest requested theses was written in 1927 about the Falkland Islands. “So there is a longevity in research and we have a duty to provide access to that research, ” she said.

Thanks go to…

The dataset itself is one created by the OSC team looking at the usage of our Request a Copy service. The analysis undertaken by Peter Sutton Long and we recently published a blog post about the findings.

The music played at the event was complied by Tony Malone and covers almost 1000 years of music, from Laura Cannell’s reworking of Hildegard of Bingen, to Jane Weaver’s Modern Cosmology. There are acknowledgments to Apollo, and Cambridge too. The soundtrack is available for those interested in listening.

This achievement is entirely due to the incredible work of the team in the Research Data Facility and their ability to engage with colleagues across the institution, the nation and the world. In particular the vision and dedication of Dr Marta Teperek cannot be understated.

In the words of Dr Gardner: “They have made our mission different, they have made our mission better, through the work they have achieved and the commitment they have.”

The event was supported by the Arcadia Fund, a charitable fund of Lisbet Rausing and Peter Baldwin.

 

 

Published 21 September 2017
Written by Dr Danny Kingsley
Creative Commons License

Speech by Dr Jess Gardner

First let us begin with some headline numbers. One thousand datasets. This is hugely significant and a very high level when looking at research repositories around the country. There is every reason to be proud of that achievement and what it means for open research.

There have been 26000 downloads of that data in the past 12 months alone – that is about use and reuse of our research data and is changing the face of how we do research. Some of these datasets have been downloaded 117 times and used in news, blogs and Twitter. The Research Data team have written 39 blogs about research data and have run 118 events, most of these have been with researchers.

While the headline numbers give us a sense of volume, perhaps let’s talk about the underlying rationale and philosophy behind this, which is core.

Cambridge University Library has a 600 year old history we are very proud of. In that time we have had an abiding responsibility to collect, care for and make available for use and reuse, information and research objects that form part of the intrinsic international scholarly record of which Cambridge has been such a strong part. And the ability for those ideas to inspire new ideas. The collection began on vellum, paper, and stone and bone.

And today much of that of course is digital. You can’t see that in the same way you can see the manuscripts and collections. It is sometimes hard to grasp when we are in this grand old dame of a building that I dare you not to love. It is home to the physical papers of such greats as Isaac Newton and Charles Darwin. Their research data was on paper and we have preserved that with great care and share it openly on line through our digital library. But our responsibility now is today’s researcher and today’s scientists and people working across all disciplines across our great university. Our preservation stewardship of that research data from the digital humanities across the biomedical is a core part of what we now do.

And the people in this room have changed that. They have made our mission different, they have made our mission better through the work they have achieved and the commitment they have.

Philosophically this is very natural extension of what we have done in the Library and the open library and its great research community for which this very building is designed. Some of you may know there is a philosophy behind this building and the famous ‘open library Cambridge’. In the 19th century and 20th century that was mostly about our open stack of books and we have quite a few of them, we are a little weighed down by them.

Our research data weighs less but it is just as significant and in the 21st century our support and our overriding philosophy is all about supporting open research and opening data as widely as possible. It is about sharing freely wherever it is appropriate to do so and there are many reasons why data isn’t open sometimes, and that is fine. What we are looking for is managing so we can make those choices appropriately, just as we have with the archive for many, many years.

So whilst as there is a fantastic achievement to mark tonight with those 1000 datasets it really is significant, we are really celebrating a deeper milestone with our research partners, our data champions, our colleagues in the research office and in the libraries across Cambridge, and that is about the changing role in research support and library research support in the digital age, and I think that is something we should be very proud of in terms of what we have achieved at Cambridge. I certainly am.

I am relatively new here at Cambridge. One of the things that was said to me when I was first appointed to the job was how lucky I was to be working at this University but also with the Office of Scholarly Communication in particular and that has proved to be absolutely true. I like to take this opportunity to note that achievement of 1000 datasets and to state very publicly that the Research Data Services and all who sail in her are at the core of our mission in our research library. But also to thank you and the teams involved for your superb achievements. It really is something to be very proud of and I thank you.