Tag Archives: .warc

Archiving webpages – securing the digital discourse

We are having discussions around Cambridge about the research activity that occurs through social media. These digital conversations are the ephemera of the 21st century, the equivalent of the Darwin Manuscripts that the University has spent considerable energy preserving and digitising. However, to date we are not currently archiving or preserving this material.

As a starting point, we are sharing here some of the insights Dr Marta Teperek gained from attending the DPTP workshop on Web Archiving on 12 May 2015, led by Ed Pinsent and Peter Webster.

Digital dissemination

Increasingly researchers are realising that online resources are important to disseminate their findings – the subject of our recent blog ‘What is ‘research impact’ in an interconnected word?‘ It is common to use blogs and Twitter to share discoveries.

Some researchers even have dedicated websites to publish information about their research. In the era of Open Science webpages are also used to share research data, especially for programmers, who often use them as powerful tools for providing rich metadata description for their software. It is not uncommon to include a link to a webpage in publications as the source of additional information supporting a paper. In these cases, other researchers need to be able to cite to the webpage as it was at the time of publication. This ensures the content is stable – be it information, dataset, or a piece of software.

The question arises then about preventing ‘linkrot’ and preserving webpages – to ensure the content of a webpage is still going to be accessible (and unaltered) in several years’ time.

What does it mean to archive a webpage?

Archiving is preserving the exact copy of a webpage, as it is at a given moment in time. The most commonly used format for webpage archives are .warc files. These files contain all the information about the page: about its content, layout, structure, interactivity etc. They can be easily re-played to re-create the exact content of the archived webpage, as it was at the time of recording. These .warc files can be shared with colleagues or with the public by various means, for example, by preserving a copy in data repositories.

The right to archive

One of the most interesting topics emerging from almost every talk was who has the right to archive a webpage. The answer would seem simple – the webpage creator. However, webpages often contain information with reference to, or with input from various external resources. Most pages nowadays have feeds from Twitter, allow comments from external users, or have discussion fora. Does the website creator have the rights to archive all these?

In general, anyone can archive the page. Problems start if there are intentions to make the archive available to others – which is typically the driver for archiving the page in the first place. In theory, in order to disseminate the archived page, the archiver should ask all copyright owners of the content of that page for their consent. However, obtaining consent from all copyright owners might be impossible – imagine trying to approach authors of every single tweet on a given webpage.

The recommendation is that people should obtain consent for all elements of the webpage for which it is reasonably possible to get the consent. When making the archive available, there should also be a statement that the best effort was made to obtain consent from all copyright owners. It is good practice to ask any webpage contributors to sign a consent form for archiving and sharing of their contributed content.

Alternative approach to copyright

Some websites have decided to take an alternative approach to dealing with copyright. The Internet Archive simply archives everything, without worrying about copyright. Instead, they have a takedown policy if someone asks them to remove the shared archive. As a consequence of their approach, they are currently the biggest website archive in the world, which as of August 2014 used 50 PetaBytes of storage.

Anyone can archive their websites on the Internet Archive, simply by creating an account to upload the website in the Internet Archive, entering the URL of the webpage to be archived, clicking a button to archive the page, and it is done – the archive will be created and shared.

The workshop inspired us at Cambridge to archive the data website, which is now available on Internet Archive. Snapshots from each of the archiving events can be easily replayed by simply clicking on them.

Can a non-specialist archive the website?

But what if you would like to archive a website yourself – store and share it on your conditions, perhaps using a data repository? Various options for website preservation were discussed during the workshop.

As a non-specialist, the best option is the one which does not require any specialist knowledge, or specialist software installation. A startup company called WebRecorder have created a website which allows anyone to easily archive any page. There is no need to create an account. The user can simply copy the URL of the page to be archived and press ‘record’. This will generate a .warc file of the website.The disadvantage is this needs to be done for every page of the website separately. WebRecorder allows free downloads of .warc files – the files can be downloaded and archived/shared however the user chooses.

If anybody wants to then re-run the website from a .warc file, there are plenty of free software options available to re-play the webpage. Again, an easy solution for non-specialist is to go to WebRecorder. WebRecorder allows one to upload a .warc file and will then easily replay the webpage with a single click on the ‘Replay’ button.

A bouquet for the DPTP workshop

This was an excellent and extremely efficient one-day workshop, due to its dynamic organisation. The workshop was broken down into six main parts, and each of these parts consisted of several very short (usually 10 mins long) presentations and case studies directly related to the subject (no time to draw away!). After every short talk there was time for questions. Furthermore, there were breaks between the main parts of the workshop to allow focused discussions on the subject. This dynamic organisation ensured that every question was addressed, and that all issues were thematically grouped – which in return helped delivering powerful take-home messages from each section.

Furthermore the speakers (who by the way had expert knowledge on the subject) did not recommend any particular solutions, but instead reviewed types of solutions available, discussing their major advantages and disadvantages. This provided the attendees with enough guidance for making informed decisions about solutions most appropriate to their particular situations.

What also greatly contributed to the success of the workshop was the diverse background of attendees: from librarians and other research data managers, to researchers, museum website curators, and European Union projects’ archivists. All these people had different approaches, and different needs about web archiving. Perhaps this is why the breakout sessions were so valuable and deeply insightful.

Published 3 October 2015
Written by Dr Marta Teperek and Dr Danny Kingsley
Creative Commons License