This post is a revised version of a briefing I wrote for colleagues a few months ago. My research included asking members of the archives-nra mailing list whether and how their organisation was archiving its corporate website. It emerged that few organisations are actively doing this, but many archivists would like to know more. So I am putting this piece online to help others in need of a short intro and some pointers for further exploration. It’s written from a university perspective, but is probably relevant to many other public sector organisations.
What is web archiving?
‘Archiving’ is used in the archival sense: preserving websites indefinitely as archives for use. Preserving websites raises profound questions around issues such as temporal coherence, authenticity, quality assurance, selection and appraisal, and how to justify particular costs. Standards and methods have however been developed in recent years. Web archivists work to the ISO 28500 standard, which specifies methods of capture and the WARC format (which allows multi-format materials to be aggregated into one archival file). The scale of websites means that web archiving is generally automated: sites are crawled/harvested at agreed intervals. There are three main methods: client-side archiving, transaction-based and server-side.
Why archive the corporate website?
Since the late 1990s universities have increasingly used their corporate website for sharing information, marketing and engagement. In many cases, pages and documents published on the site have taken over the functional role of the paper records and publications managed in a university archive. Web technology has also enabled new kinds of records which may in some cases merit permanent retention.
Archiving the website therefore allows an organisation to document key decisions and developments in support of its business activities.
It enables the university to meet legal and contractual requirements – for example in the case of disputes with students, staff can use the archived website can prove what information was made available to them.
The website represents the way the university has presented itself to its audiences, which is a significant part of its history. It will include details of innovations and projects which may not otherwise be recorded.
What should be archived, when, and how often?
University websites typically publish documents which are part of the institutional records such as meeting minutes. These can be managed using the organisation’s retention schedule and archived outside a web archiving process following standard digital preservation techniques. It would be more usual to ingest such material into the archive service by transfer from the originating department, but it is possible that in some circumstances (e.g. if some disruption to transfer happened) the document published online could be taken instead.
‘Web pages’ cannot be managed in this way as web archiving is not a simple one-off procedure. It is an ongoing suite of processes requiring multiple choices, with no right or easy answers. Motivation, selection and methods are inter-connected and it is important to focus on what is significant and must be preserved.
The client-side crawler approach may be best suited to university websites and preservation intent. Various companies offer such ISO 28500 client-side web archiving services e.g. Archive-it and Hanzo Preserve. Such firms base their subscription charges on variables such as the size/nature of the website and how often it is to be crawled.
Aren’t the Internet Archive doing this anyway? Yes, the Internet Archive crawls corporate websites. However, this may only cover parts of a site, and the organisation has no control over the process e.g. timing, content etc.
Can’t we use the Content Management System? No, CMS services focus on management and marketing and are unlikely to offer permanent preservation services to archival standards.
What about other aspects of university web presence? Yes, these should be considered when making decisions about web archiving.
What about copyright? Web archiving raises serious issues around intellectual property. However this should not pose a major problem for corporate websites, as the content will be copyright of the university. Third-party content shown on the site would have to be cleared for web archiving or redacted by the archiving process.
British Standards Institute. Information and documentation: WARC file format. BSI, ISO, 2017 ISO 28500:2017.
National Archives, The (2011). Web archiving guidance. 2011. http://www.nationalarchives.gov.uk/documents/information-management/web-archiving-guidance.pdf
Pennock, M. (2013). Web archiving. Digital Preservation Coalition. http://www.dpconline.org/docs/technology-watch-reports/865-dpctw13-01-pdf/file