FEATURE: Web Archiving at the Library of Congress

Information Today, Inc. Corporate Site

KMWorld

CRM Media

Streaming Media

Faulkner

Speech Technology

DBTA/Unisphere

PRIVACY/COOKIES POLICY

Other ITI Websites

American Library Directory Boardwalk Empire Database Trends and Applications DestinationCRM Enterprise AI World Faulkner Information Services Fulltext Sources Online InfoToday Europe KMWorld Literary Market Place Plexus Publishing Smart Customer Service Speech Technology Streaming Media Streaming Media Europe Unisphere Research

Vendors: For commercial reprints in print or digital form, contact LaShawn Fugate (lashawn@infotoday.com)
Magazines > Computers in Libraries > December 2011
Back Index Forward

SUBSCRIBE NOW!

Vol. 31 No. 10 — December 2011

FEATURE
Web Archiving at the Library of Congress
by Abbie Grotke

With so much global cultural heritage being documented online, librarians, archivists, and others are increasingly becoming aware of the need to preserve this resource for future generations.

In the 20 years since the World Wide Web was invented, it has become woven into everyday life—a platform by which a huge number of individuals and more traditional publishers distribute information and communicate with one another around the world. While the size of the web can be difficult to articulate, it is generally understood that it is large and ever-changing and that content is added and removed continually. With so much global cultural heritage being documented online, librarians, archivists, and others are increasingly becoming aware of the need to preserve this resource for future generations.

The record of the web began only about 15 years ago. In 1996, Brewster Kahle founded the Alexa Internet, a for-profit web crawling company, and the Internet Archive, a nonprofit digital library, and began archiving the internet.

The results of the web crawls performed by Alexa were donated to the Internet Archive’s web collection, and so began an effort to document the web and preserve it for future researchers. The crawler made snapshots of as much of the web as it could find, downloaded the content, and eventually made it accessible via a tool called the Wayback Machine. In the early days, Alexa provided to Internet Archive bimonthly snapshots that were embargoed for a period of 6 months before they became public. In 2009 it began providing incremental submissions as content was crawled, but again, the resources were embargoed for a period of 6 months before they became public. Sites were visited again, and new versions were stored so that comparisons of a particular site were possible, and the changes were documented.

About the same time that the Internet Archive began crawling the web, national libraries and archives around the world also began to see the importance of preserving this global resource. Many countries even sought changes in their laws to allow them (or mandate them) to collect and preserve the output of their nation’s creativity, such as the Legal Deposit of Published Materials Act that went into force on July 1, 2005, in Denmark.

Often the question is asked, if the Internet Archive is crawling the web, why are others archiving at all? The answer is simple: No one institution can collect an archival replica of the whole web at the frequency and depth needed to reflect the true evolution of society, government, and culture online. A hybrid approach is needed to ensure that a representative sample of the web is preserved, including the complimentary approaches of broad crawls such as the Internet Archive does, paired with deep, curated collections by theme or by site, tackled by other cultural heritage organizations.

Archiving at the Library of Congress

In 2000, the U.S. Congress established the Library of Congress’ National Digital Information Infrastructure and Preservation Program (otherwise known as NDIIPP) to develop a national strategy to collect, preserve, and make available significant digital content—especially information that was created in digital form only—for current and future generations.

While NDIIPP was building a network of digital preservation partners around the country, there was also work happening internally to address the issue of preservation and access to born-digital content. There was a recognition that a lot of material normally collected in print form by the Library of Congress (LC) was appearing only on the web.

So that same year, the LC established a pilot web archiving project, originally called the MINERVA (Mapping the INternet Electronic Resources Virtual Archive); today, it is simply called The Library of Congress Web Archives. A team was formed to study methods to evaluate, select, collect, catalog, provide access to, and preserve this content for future generations. The LC partnered with the Internet Archive to test the waters by archiving websites related to the presidential election in 2000. As staff was planning the Election 2002 archive, 9/11 happened, and the project team was thrust into archiving the events and reactions unfolding on the web on that tragic day and in the months after. The pilot project quickly became a program as staff had to grapple with the reality of selection, collection, and preservation of this important content while acting quickly to archive content that was rapidly changing and disappearing in the aftermath of the terrorist attacks. Partnering with Internet Archive and others enabled the collection of more than 30,000 websites during that time.

Challenges in Web Archiving

While much progress has been made with its web archiving program activities, there are a number of social, technological, and legal challenges that the LC faces.

What to preserve. One of the biggest challenges is in determining exactly what to collect. With such a mass of data being produced, selection and curation is an important aspect of the work that librarians and archivists do when it comes to archiving the internet.

There are basically a few distinct approaches to web archiving: bulk or domain harvesting and selective, thematic, or event-based harvesting.

An example of bulk archiving is the approach that the Internet Archive takes, trying to archive as much of the public web as possible. Some libraries focus on domain crawling—archiving an entire domain such as that of a particular country. For instance, Iceland and France’s country domains are easily identified (.is and .fr are fairly good indicators), and the sizes of those domains are manageable enough for them to do a crawl of them once or twice a year. However, identifying what makes up the “U.S. web” is nearly impossible. There is, unfortunately, no master list of all of the websites generated by content creators in this country. Harvesting all websites hosted in the U.S. certainly is not satisfactory.

Taking this into account, as well as a lack of staff resources and, particularly, our legal challenges outlined next, the LC has no choice but to take a highly selective approach. The selection of sites is not something that the LC automates; recommending officers (ROs) do this work.

Recommending materials for addition to the LC’s collections is the responsibility of ROs in the area, subject, and format divisions of Library Services and the Law Library. Not only do they recommend analog materials for the LC’s collections, these subject experts, curators, and reference librarians are increasingly tasked with selecting born-digital content for the LC’s collections. Some suggest ideas for themes and events to document, and many select the specific sites that end up in our web archives.

Since the LC’s pilot program began in 2000, more than 250 terabytes of content have been archived in almost 40 event and thematic collections. The LC’s collections strengths are in government, public policy, and law: The LC archives U.S. national elections; House, Senate, and congressional committee websites; changes in the Supreme Court; and “blawgs” (aka law blogs). The LC has worked with partners to archive web content related to “spontaneous events”—events that the LC can’t plan for but must react quickly to preserve: 9/11, Hurricane Katrina, and the earthquake in Haiti are among those collections, as is archiving content being generated by participants in the recent events in the Middle East. The LC’s web archives also include collections that support special collection divisions—the Manuscript division is archiving sites related to the organizations for which it holds the physical papers. Our Prints and Photographs division has archived websites of photographers, cartoonists, and architectural firms. And the Music division has just started to archive the websites of organizations and institutions with content that relates to the LC’s Performing Arts collections.

In recent years LC staff in our overseas offices in Egypt, Brazil, Indonesia, India, and Pakistan have selected born-digital content documenting elections and other events and topics of interest. Collecting digital materials is a natural extension of the collection of ephemera, books, and other physical objects from these regions of the world.

Technological challenges. Even with a selective approach, the work is pretty massive—collecting about 4–5 terabytes of data a month requires around-the-clock technical expertise. While the LC has some capability for crawling in-house, the scale of activity requires that most of our archiving be done with the help of a contractor.

In 11 years of archiving, there have been a lot of changes in the web, in the archiving tools, and in the community that creates and preserves born-digital content.

The LC uses the Heritrix web crawler—an open source crawler developed by the Internet Archive and national libraries and archives. For access, an open source version of the Wayback Machine is used. The contents of our archives are stored in an ISO standard format called WARC.

The LC’s goal is to create an archival copy—essentially a snapshot—of how a site appeared at a particular point in time. We capture websites multiple times—weekly, monthly, or once or twice a year, depending on the site and the event or subject area. We archive as much of the site as possible, including HTML, images, Flash, PDFs, and audio and video files, to provide context for future researchers.

However, there are some limits as to what can be archived. The crawler technology is usually a few steps behind the technology of the current web. Heritrix is currently unable to archive streaming media, “deep web” or database content requiring user input.

Website owners are likely not thinking about preservation when they are creating their sites. There will always be some websites that take advantage of emerging or unusual technologies that the crawler cannot anticipate. When we began this program, there was no YouTube, Twitter, Facebook, or Flickr. Social media present obvious challenges as new tools and services are developed and used by more and more content producers, and it’s often an important piece of the web presence of organizations or individuals we are trying to archive. For example, when archiving the websites of our congressional members or committees, we want to ensure that we are also preserving content they are generating and posting on third-party websites. This requires identification of that content so we can instruct our crawlers to go get it. We are also dealing with the challenges of the latest, greatest social media tools that web producers latch on to.

Additional technical challenges relate to the transfer and management of content. The generated web archive data that our contractor collects for us (roughly 5 terabytes per month) must be transferred from the West Coast, where our contractor is, to the Library of Congress. This turns out to be nontrivial; it may take the better part of a month with near-constant transfers over an Internet2 connection to move 10 terabytes of data.

Of course, transfer is just the initial stage in our management of the web archive data once it arrives at the Library of Congress; we must also account for redundant storage on tape and/or spinning disk, internal network bandwidth, and processor cycles for copying, indexing, and validation. All are important to consider when managing large sets of data.

Copyright challenges. There also are copyright challenges to web archiving. While some countries have a legal mandate or legislation that allows libraries to make preservation copies of websites without permission, there is currently no mandatory legal deposit process in place under Section 407 of the U.S. Copyright Act that addresses Library of Congress web archiving. While Section 108 of the Copyright Act provides library exceptions for all libraries, it doesn’t address digital preservation and web archiving, although the Section 108 Study Group recommended this option. And while Section 107 on fair use might cover web archiving, this is not yet established.

So for all but U.S. government sites, the LC does quite a bit of work to identify site owners. This can take anywhere from a few minutes to a half an hour to identify each appropriate contact. The LC sends a notice telling site owners about our activity, and in some cases, the LC asks permission to archive or to provide access to allow researchers outside the buildings in Washington, D.C., to view archived content through the LC’s website. In general, site owners respond positively when they do respond. The challenge comes when they don’t respond at all, which is unfortunately more common than not: Some content is not preserved at all, and other sites have limited access for research purposes.

There are varying approaches when it comes to permissions and web archiving, abroad and in the U.S. Other institutions may choose not to seek permissions, depending on their own legal assessments of how to handle the capture of web content.

Collaboration Is Key

With all these challenges, it is clear that one institution cannot archive the web alone. That is why the Library of Congress collaborates with partner libraries, archives, and other organizations in the U.S. and around the globe.

In July 2003, the LC joined with 10 other national libraries and the Internet Archive and formed the International Internet Preservation Consortium (IIPC), acknowledging the importance of international collaboration for the preservation of internet content. The goals of the consortium include collecting a rich body of internet content from around the world and fostering the development and use of common tools, techniques, and standards that enable the creation of international archives.

Today, the consortium has almost 40 member organizations that meet regularly, virtually and in person, to discuss and solve issues related to access, harvesting, and preservation. IIPC members collaborate on collections as well, such as building an archive of Olympics-related websites or, in the case of U.S. members, archiving the entire U.S. government domain during the transition from the Bush administration to the Obama administration.

The LC also partners and aligns with a variety of organizations in the U.S. through the National Digital Stewardship Alliance (NDSA). The NDSA was launched in July 2010 as an initiative of the NDIIPP program; it is a vibrant network of digital preservation experts from universities, consortia, professional societies, commercial businesses, government agencies, and more. The NDSA is open to any organization that has demonstrated a commitment to digital preservation and that shares the stated goals of the consortium.

What Does the Future for Web Archiving Hold?

Recent years have seen an explosion of the number of institutions involved in or beginning to think about web archiving. Many NDSA members, as well as other universities, historical societies, and state and local governments, have recognized the need for and importance of preserving a variety of web content for future generations. We continue to explore collaborations with partners as the demands for already stretched resources increase.

Through the NDIIPP program and the NDSA, the LC has explored the preservation of born-digital news content, including citizen journalism, and is currently working on the development of a blog preservation plug-in that would enable content creators to easily opt in for preservation.

IIPC members are currently engaged in a number of exciting projects: launching a worldwide education and training program that will feature technical and curatorial workshops and staff exchanges; planning an international collaborative collection project around the 2012 Summer Olympics; publishing information about the preservation of web archives in many institutional contexts; and establishing a technical program to fund exploratory projects and report about new techniques and tools to archives the fast-changing web. A new IIPC project that the Library of Congress is actively engaged in working with the Memento protocol to enable federated, seamless access to past versions of (or at least metadata about) websites collected by IIPC members through the MementoFox add-on for Firefox.

There’s also a growing trend in personal archiving. Nonspecialists are learning how to archive their own digital output—websites, blogs, photos, and more—as a digital legacy of their own family history for future generations.

While some think that the web is already dead, web archivists believe they still have plenty to do. One of the exciting parts about this activity is web archivists are never quite sure where this road will lead.

Interested in preserving content on the web but not sure where to start?

Here are some of the things you may want to consider:

1. Determine resources that are available for web archiving. What types of staff are available? Curators are needed to pick content, but technical expertise is necessary for much of the work. Will staff work full-time or part-time? This will determine how much can be accomplished.

2. Determine an approach for capturing content. Many organizations that are just getting started decide to outsource some aspect of the work, unless there are technically savvy staff members available to manage the crawls and an infrastructure in place to store large quantities of data. Outsourcing or collaboration on projects allows organizations to gain experience and learn more while setting up internal infrastructure to manage in-house web archiving projects.

3. Identify tools that can help with selection or workflow management. There are a number of curator tools and services available to help manage different processes such as nomination of URLs, permissions, crawling, quality review, and description. The IIPC (www.netpreserve.org) is a great resource for learning more about what other organizations are using for different workflows.

4. Examine existing selection policies. Web archiving may not be covered by your organization’s existing policies. Selection is key and should make sense for your organization. Decide what themes, subjects, or types of sites you’ll archive. This can help focus the activity, which is particularly important if resources are limited.

5. Know your rights. What permissions are asked and what you can archive depend a lot on the policies at a given organization. Include lawyers on your project team, and think about access rights as well as permissions that may be required for crawling content. Get familiar with robots.txt (www.robotstxt.org)—having the crawler recognize robots (or not) is an organizational decision that will affect the results of your crawl.

6. Monitor and conduct quality reviews. Web archiving is a fluid process. URLs change and disappear, web technology gets more complicated, and policies change over time. It’s important to re-evaluate what’s being collected over time to make sure that it is still in scope for your project and that the crawler is archiving as much of a given site as possible.

7. Consider access. Researcher access sometimes is an afterthought in the frenzy of just trying to capture the content before it goes away. But if you’re starting up a new program, think about how researchers will access your web archives. Will the archived sites be cataloged? Will users search or browse for sites? How will the web archives be integrated (or not) with your existing digital collections?

Abbie Grotke (abgr@loc.gov)is the web archiving team lead in the Office of Strategic Initiatives, Library of Congress. She has been involved in digital initiatives at the Library of Congress for more than 13 years, initially as a digital conversion specialist with the American Memory program. Since 2002 she has been involved in the LC’s web archiving activities, and she currently manages various web archive collection activities and projects. She is also co-chair of the National Digital Stewardship Alliance Content Working Group.

Back to top