Web Archiving and Government Information

Previously we explored some of the nuances of government information, including what it is and how it is made available. Now we will discuss web archiving as an approach to preservation and look at some of the limitations, particularly those relevant to government information.

Web Archiving: What is it? What can it do? 

The vast majority of public information is initially disseminated through official websites and other online sources. A variety of technologies can be used to capture content from the web (“right click, save as” is a great example!). At the PEGI Project, we focus on web archiving, also called web harvesting or crawling, as a current solution to take content on a website and make it available after that website has changed or is no longer in service. In this post, we will talk about the current web archiving landscape and then discuss the issues.

According to the International Internet Preservation Consortium (IIPC), “Web archiving is the process of collecting portions of the World Wide Web, preserving the collections in an archival format, and then serving the archives for access and use.” There are organizations that build and maintain various collaborative tools and services to do this work. One of the most familiar examples of web archiving is the Internet Archive’s Wayback Machine

The web archiving workflow starts with a “seed” URL as input, and the crawler captures the page and any associated files (images, PDFs etc) following rules that are set in the tool. The capture is further modified based on scoping rules that address the different types of files that appear on a website, and on logic about which links to follow and for how many subsequent pages. The final output is a Web ARChive file, or a WARC. 

There’s also a lot of human intervention in the process of collecting the web. Humans have to “tell” the harvester what to collect, and after the harvesting is done, humans have to do “quality assurance” to make sure the harvester collected what the humans wanted it to collect.

Memory institutions rely on web archiving as a primary means of preserving web-based content. These efforts provide excellent starting points for long-term preservation – after all, collecting is a key part of any preservation effort. We focus on web archiving because it is in widespread use and has significant advantages in terms of scale and reliability for preserving electronic government (and other) information. It also has drawbacks, some of which can be addressed through other means. 

Government information context 

The internet has become the publishing medium of choice for government entities, and therefore libraries have found themselves in a quandary about continuing to build their collections of government publications. Web archiving is the de facto tool for libraries to attempt to collect information of importance to their communities. Many libraries and government agencies alike have begun to use the Internet Archive’s Archive-it service -- including GPO’s FDLP Web archive. Since 2008, the End of Term (EOT) web Archive Project has sought to collect a significant and near-comprehensive snapshot of the federal .gov/.mil domain at 4-year intervals. 

Web archiving can be conducted in tandem with item-by-item collection development practices. Because the majority of content on the web is designed for point-in-time access instead of long-term preservation, collecting and preserving individual files separately often results in a loss of highly relevant context. For example, while a single PDF report made available as a surrogate for a print report might include the publisher and date of publication, a report made available as a series of PDF sections might lack this information in each section, relying instead on a web-based table of contents to navigate. Web archiving captures the context along with the content as a point-in-time snapshot, meaning that it is possible to “browse” the content as it originally appeared. However, that same archived website may be very difficult to discover using current methods of access and delivery. 

Capture

Whether information is made public as a single report in PDF format, a series of interconnected static pages, an audiovisual production, or within a database or interactive application, it is initially published in a digital form. Programmatic web archiving can collect a great deal of content and package it for future use. Relatively simple web sites and discrete file formats (PDF, MP4 etc) can be effectively captured and recaptured over time, then users can display the site from the date of any capture – many will be familiar with this through the Internet Archive’s Wayback Machine. The captured and packaged content can be preserved as with other digital objects using known processes and systems. However, as sites to be archived become more complex and dynamic, this programmatic archiving begins to break down because it is technically difficult to capture, preserve, and display the underlying software that drives the site. Interactive, data-driven sites like data.census.gov or the USGS National Map are particularly difficult to capture in an automated way and difficult to “play back” to future users.

Some types of content significantly benefit from presentation in the same format as it was distributed, but this is only one access use case. A website is designed to be viewed using a browser. Extracting text and visual content may not be sufficient to make the information easily interpretable and usable. For example, printing a website with dynamic links onto paper results in a loss of information about the relationships the links represent, even if the text markup is included. For situations like this, web archiving tools, such as Heritrix, Brozzler, or Conifer (formerly known as WebRecorder) may be used to capture and replay the information. 

Access

Access to web-archived content is the most problematic aspect of web archiving. Access to web archives is primarily through Wayback/playback and bulk data analysis. This is very good for one specific use case of machine-assisted corpus analysis. For example, the Archives Unleashed program teaches about the tools needed for historians and others to access and analyze web archives in bulk and via software. 

Currently, there is no mature search function across web archive collections for archived content; this means if the term is no longer on the site (e.g., “climate change” on epa.gov) or if the site is no longer available (e.g., nclis.gov), users will never be able to find the content from an archived website through searching. Besides the WayBack Machine, MementoWeb has a search across web archives, but it is only based on known seed urls. GPO is cataloging its seeds for the Catalog of Government Publications (CGP), but they only catalog at the level of the seed and do not create cataloging for documents hosted on the archived version of the website. There have been some research projects to extract specific file types or connect up various web archives -- notably the Internet Archive Scholar that extracts journal literature by DOIs, the Internet Archive’s Military Industrial Powerpoint Complex collection, and the UNT project of PDF extraction of the 2016 End of Term crawl. But all of these projects only highlight the access issues with web archives. 

Access relies on the searcher’s knowledge of the Wayback Machine, serendipity, or searches for known titles. 

Conclusion

We have written previously about the importance of organizing resources as collections, a process sometimes referred to as curation. For curation to meaningfully provide long term access, there are a number of essential intermediary steps. For example, organizing links to resources is not the same as curating a collection. Collecting for long-term access entails more than pointing to publications in a library catalog record. Preservation requires that resources be held in the custody of one or more trusted entities that can conduct preservation activities that ensure long term stewardship and access. Once born-digital content is captured, there are steps that must be taken to preserve the content for future access and use. Pointing is not collecting or preserving; while collecting is not preserving, it is the first step toward preserving. Web archiving is not the only step in preserving web content. It’s just one of the first steps – collecting content – in a wider digital curation and preservation program in which libraries would acquire, describe, preserve, and provide access to born-digital information.