Where Have All the Archives Gone?
by Barbara Quint
Newspaper publishers seem to have taken the hardest hit from the technological and economic challenges of publishing. Tales of venerable titles closing their doors after decades of service to their communities have a doomsday tone, less “end of an era” and more “end of life as we’ve known it.”
Major cities face existence without a daily newspaper. Even papers that stay in operation may only do so by making severe cuts in expenditures and staffing. If you don’t believe me, ask a news librarian, if you can find one still working.
Nevertheless, there’s still one marketplace that is charging subscription fees for online publications: the library market, where you can even get money for licensing archives of popular publications such as newspapers. So what are libraries getting for their precious and shrinking institutional dollars? The most interesting factor is what they are not getting, and all too often, that is a complete archive.
Newspaper Archives These Days
Of course, digital newspaper archives have never been complete. Traditional full-text services such as LexisNexis, Factiva, and Dialog have carried most of the editorial content but certainly not all of it. Sections including letters to the editor, births/deaths, or even police blotter reports have rarely appeared. The classified advertising sections that have made up most of the body weight for printed papers have never been searchable online, although ProQuest reports that the company’s cover-to-cover Historical Newspapers (HNP) collection’s digitized images include even the ads.
When the Supreme Court decided the New York Times v. Tasini case in 2001, it led newspapers all over the country to make a choice: They could either close archives completely or plow through back copies pulling out freelance articles or anything for which they did not have a clear copyright title. Some sections dominated by freelance contributions were hit particularly hard, such as travel or book reviews. Some newspapers, such as The Washington Post, started eliminating outside contributor content early, well before the final decision date.
However, since the Tasini case, a lot of content from outside publishers’ purviews became available as contracts between publishers and outside contributors began stipulating digital access rights for the publishers and their third-party vendors. The bottom line for working reference librarians: You can say it couldn’t be found, but that doesn’t mean it didn’t exist.
ProQuest’s Historical Newspaper archives escaped the rending and unraveling of archives due to an exemption in the Tasini decision for microfilm and microfilmlike collections that sold the entire newspaper rather than individual articles. If the digital archive looked similar to the original print copies (page layout and so forth) then that was all right.
User Habits Change
However, times marched on, as did technology and users’ information habits. Newspapers adapted to the new technologies, using them to present stories in new, expanded formats with links to original documentary support, to multimedia feeds, and to high-quality, vetted blogs.
They also published content on their websites written specifically for those websites, content that may never have appeared in a print edition. For example, in print, The New York Times Book section names the 10 best-selling fiction and the 10 best-selling nonfiction books. But NYTimes.com has room to name 25 or more in each category. After all, the web removes the space limits imposed by print costs, creating a “bottomless news hole” in a sense. But retrieving anything from a bottomless hole is very tricky by definition. Basically, if you don’t snatch it as it goes into the hole, wave goodbye because you’ll probably never see it again.
And that’s pretty much what has happened (or hasn’t happened) with digital newspaper archives. Unless the newspapers’ own webmasters retained searchable copies of everything and made those copies available for archiving on their own sites (which is not usually the case), digital-only copy never made it into the “official” archive.
Vince Price, vice president of content operations at ProQuest, pointed out that the historical archives for most of its newspapers “stop sometime in the mid-’80s, just about when the Internet started, so there is no online overlap. For some papers we add a year each year. A facsimile of the print is our policy, although now it’s a print from pdf files for a number of newspapers.”
Price’s comments may reflect the policy for ProQuest’s HNP, but that’s not the only archival product the company provides for newspapers. In fact, the ProQuest Newsstand lists 770 newspaper titles, and about 370 are listed as current, with more than 100 stopping in the 1990s and 30-plus ending in this decade. The Newsstand carries the full text of editorial content supplied by electronic feeds from the newspapers, but it doesn’t carry full images as in Historical Newspaper archives.
The newspapers that still provide annual updates in HNP tend to be the so-called national newspapers, the papers that sell outside of their regions in both print and with ProQuest HNP digital subscriptions. These include titles such as The New York Times, for example. The e-feeds from publishers may contain some of the digital-only content available on newspaper websites, but who does and who doesn’t supply the content and what they might supply and how much of it remains a mystery.
Newspapers Websites Loom Larger
But let’s get back to those digital editions in general. Those newspapers struggling to survive by slashing costs and gathering web advertising have been trying out a new strategy lately referred to as “dayscrapping” (i.e., reducing the number of days they produce print editions for home distribution, such as Detroit’s Free Press (www.freep.com) and The Detroit News. But the papers still offer daily replica e-editions (e.g., http://digitalfreepress.com). Other papers that switched to all-digital, or mostly all-digital, include the Seattle Post-Intelligencer, the Kansas City Kansan (Kan.), and Madison’s (Wis.) Capital Times. Other communities seeing a decline in their daily newspaper service include a suburb of Phoenix, Independence (Mo.), Cincinnati, Idaho Falls (Idaho), and Klamath Falls (Ore.).
But what happens when a paper goes all-digital with maybe a little print on the side? This year, the 100-year-old Christian Science Monitor, to quote its FAQ, became “the first nationally circulated newspaper to replace its daily print edition with its website.” Subscribers can receive daily email editions matching their interests and/or a weekend print product with original content not available on the website. A PDA (personal digital assistant) edition and a downloadable PDF of the print version are also available. (I can only assume that the PDF version offers a view of what the print edition might have looked like.) But the “issue of record” for the Monitor is its digital website (www.csmonitor.com).
So how does this affect the full digitized run of the Monitor in ProQuest’s HNP archive and Newsstand? Both these ProQuest services offer their top-of-the-line support to this national newspaper. But what does that entail? Holdings of the Monitor in HNP date from 1908 to 1995, while current issues of the newspaper in full text date from 1988.
At the HNP service, change comes slowly. Apparently, despite offers in the past to transmit daily PDFs for its print edition (as described by Leigh Montgomery, librarian at the Monitor), ProQuest always insisted on receiving print copies mailed in monthly. Those copies were then forwarded to the National Archives Publishing Co. (NAPC) for microfilming and digitizing. Once the copies were digitized, ProQuest used OCR (optical character recognition) to produce text indexing.
Now that the Monitor has no daily print edition but only the PDF (or the so-called treeless) edition, procedures have not changed significantly, according to a ProQuest spokesperson. The PDFs are still sent to NAPC for transfer into microfilm and then digitized from the microfilm. The digitized microfilm copies are still sent to ProQuest for OCR-ing.
The Joys of PDFs
Clearly these somewhat Byzantine procedures, processes used on other titles in HNP as well, are under review. Price pointed out that the problems involved with using PDF files are not inconsiderable. “PDFs typically arrive as a production issue. The ASCII e-feeds are highly structured, normalized, and very efficient and affordable, but PDF files have [every day’s] articles bunched up. We would need to mine individual articles and populate them with the full bibliographic information. In most cases, it is more difficult to reliably extract from a printer PDF than from the ASCII feed. Also, the e-feed only gives us what we are allowed to have. Most newspaper publishers don’t have the rights for all they publish.” Still, he expected they would some day “mine sites and display what we have rights to. We’ll figure it out and develop mechanisms.”
And in the case of HNP, ProQuest has time to consider the issue. Annual updates to the HNP’s Christian Science Monitor may occur each year, but they have not for the past year. In fact, the June 2009 update of the Monitor covered 1996, so even that update may have lost some digital-only content. The Monitor was one of the earliest and most assertively digital of the nation’s newspapers.
The e-feeds for the Monitor that are sent to ProQuest Newsstand cover editorial content that may or may not have included any digital-only content in the past. Of course, now it definitely contains digital-only content since almost all content from the Monitor is digital-only. The Monitor is not the only publication going all-digital; newspapers aren’t the only publications moving in this direction either: The much younger but as equally venerable (in its own way) PC Magazine also went all-digital this year.
Whither ProQuest’s Archiving?
But let’s go back to that bottom line. For subscribers of digitized newspapers, it seems that ProQuest Newsstand could offer a more complete archive as time goes by than the augustly titled Historical Newspapers archive. And sadly, no matter what new policies and procedures may emerge in response to the dramatic changes in publisher practices in these challenging times, web-based content for the last half of the 1990s and the first decade of the 21st century from established, high-quality publications has been lost forever. Neither publishers nor third-party aggregators have stepped up to their responsibility to protect digital content as well as print content.
ProQuest is moving in the right direction. In March, it announced the expansion of its ProQuest Digital Microfilm full-image newspaper service covering backfiles for 15 papers dating from 2008 to current editions. Unfortunately, the Monitor is not among them, and the product works off traditional microfilm holdings.
Rodrigue Gauvin, senior vice president of publishing at ProQuest, says ProQuest bought a webscraping technology service this year. The small Canadian company, PressDisplay, “gathers native PDFs from news publishers each night and deconstructs some 350 web newspapers to the article level with automatic translation for searching the database, browsing, and full-image. It’s a very cool technology but with a very limited archive, built for the tourism industry. It found its way to us and the library marketplace. We could apply it to everything, but these guys haven’t gone to archiving yet.” He added, “It’s not as easy as perceived.”
The technology is not really the problem, according to Kris Carpenter, director of the web group at Internet Archive, producer of the Wayback Machine that takes snapshots of the entire open web.
“Web crawling and harvesting technology has been around since the ’90s,” she says, “though some might say it was only in the last five years that it became so usable you could do it without special technical expertise in-house. We don’t do news now because we don’t have permission, but if permissions were in place and it would match our mission, we would probably do it.”
Although the Internet Archive’s own products are open-web-oriented, it does have a U.K.-based commercial partner that can handle for-profit efforts. Price confirms that many of the problems in advancing archive procedures stem from publisher resistance: “Newspapers have gotten protective and more defensive. They want every penny as they try to survive. It’s a total value perception. Over the years, publishers have thought they were sitting on a gold mine.”
How long will this continue? How long before content owners and licensors solve the problems involved? Both build their businesses boasting of the importance and quality of their content. How important can it be if even its owners and carriers forget to save it?