The Office of Scholarly Communication provides information, advice and training on research data management. So when faced with running a research project that involves a considerable amount of data, it is telling to see if we can practice what we preach.
This blog post is a short list of how we have approached managing data for analysis. Judging by our colleagues’ faces when we described some of the advice here, this is blindingly obvious to some people. But it was news to us, so we are sharing it in case it is helpful to others.
Organising and storing the data
As is good practice we have started with a Data Management Plan. Actually we ended up having to write two, one for the qualitative and one for the quantitative aspect of the project.
We have also had to think through where the data is being stored and backed up. All of the collected data is currently being stored on a shared Cambridge University Google Drive where only invited users with a Cambridge University email address can view the data. This is because it can handle Level 2 confidential information and was accessible on and off campus. Some of the data is confidential and publishers have asked us to keep it private.
The data is also stored on a staff member’s laptop computer in her Documents folder (the laptop is password protected) that is backed up by the Library on a daily basis. There is a second storage place on the Office of Scholarly Communication’s (OSC) Shared Drive to ensure that there are two backups in two different locations.
One dataset has proven difficult to use as it is 48MB and Google Drive does not seem to be able to handle that file size well.
Each dataset was renamed with the file naming syntax that the OSC uses. This includes a three letter prefix at the beginning (e.g. RAW for raw data), a short description, then a version, and finally the date that the data was received. Underscores separate each section and there are no spaces. An example is MEM_JCSBlogData_V1_20180618.docx
To organise and summarise the metadata, we have created two spreadsheets. One is a logbook that records the name of the file, a description of the data, size of the file, if it is confidential, and what years it covers. The second spreadsheet records what information each dataset covered, i.e. Peer Review, Editing, Citing, APCs, and Usage. The spreadsheet also records correspondence with the publishers.
Assessing our data
At first glance, we were unsure whether we could do cross comparisons between publishers with the data that we had collected. Although most datasets were provided in Excel (with the exception of the Springer 2017 report on gold open access and eLife), they were formatted differently and covered different areas.
Dr Laurent Gatto, one of Cambridge’s Data Champions, very kindly agreed to meet with us and look over the data that we had collected so far. He suggested a number of ways that we could clean up the data so that we could do some cross comparison analysis. Somewhat to our surprise he was generally positive about the quality and analysability of the data that we had collected.
Cleaning up data for analysis
After an initial look at the data, Laurent gave us some excellent suggestions on managing and analysing the data. These are summarised below:
- Have a separate folder where the original datasets will be saved. These files will remain untouched.
- When doing any re-formatting, a new file will be created using the same naming convention, but updating the version. A record of any changes to the dataset will need to be recorded in a spreadsheet.
- Ensure that all of the headers are uniform across the different spreadsheets, to allow analysis across datasets. Each header must be the same down to the last lowercase letter and cannot include any spaces
- Dates must also be uniform using Year-Month-Day format
- Only the first row of a spreadsheet can include the header. Having more than one row with header information will cause problems when you are starting to code.
- Create a readme file where every header will be recorded with a short description.
After speaking with Laurent we are more optimistic about the data that we have collected than we were before. We were concerned that there was not enough information to do analysis across publishers; however, we are more confident that this is not the case. As we start the analysis it will also give us a better understanding of what data is missing.
We will provide an update as we close in on our findings.