REPORT FROM THE FIELD
Progress Report: The British Library and Microsoft Digitization Partnership
by Jim Ashling
Microsoft made it clear that it wasn’t going to let Google tackle mass book digitization exclusively when it announced a partnership with The British Library (BL) in November 2005.
The BL/Microsoft project is designed to digitize 25 million pages of 100,000 out-of-copyright titles from the BL collection related to 19th-century literature. Access will be provided via Microsoft’s Live Search Books site (http://books.live.com) and the BL’s Web site (www.bl.uk). Live Search Books now includes many partners: The University of California Libraries, Cornell University Library, the University of Toronto Library, The New York Public Library, and the American Museum of Veterinary Medicine have all joined, as well as more than 50 publishers.
Wider Access to Lesser-Known Authors
Kristian Jensen, head of British Early Printed Collections, reviewed the selection process. Unlike previous BL digitization projects where material had been selected on an item-by-item basis, the sheer size of this project made such selectivity impossible. Instead, the focus is on English-language material, collected by the BL during the 19th century. Jensen compared the process to mass microfilming. “Nonselectivity widens access,” he said.
Being less selective creates certain advantages, however. First, it lessens the domination of the well-known author, or the high profile enjoyed by the “already famous.” The works of virtually unknown writers will be brought to the attention of scholars as easily as material by Charles Dickens. The collection is being processed by the same classifications used at the time of original acquisition. So unusual classes today, such as “19th-century female poets,” become accessible as research areas. The benefits of looking at the literature as it would have been available at the time will be welcomed by educationalists, and delicate literary items will also benefit from not being overhandled in the future.
Another benefit of the selection process is that entire shelf runs can be taken for scanning at one time. After trying a couple of pilot runs to assess quality standards, Microsoft and the BL chose CCS (Content Conversion Specialists) as the scanning contractors. Richard Helle, CCS managing director, provided a tour of the digitization studio at the BL’s press event in September.
The target is to scan 50,000 pages per day with a 2-year timetable for completion. However, none of the valuable material is subjected to any risk with such fast output. Helle emphasized that these “treasures” were being scanned nondestructively, and all staff involved had received careful training. Book movement pilots were run in advance to determine the amount of staff time required during the full process, from selection and retrieval to delivery, scanning, and reshelving.
Scanning and OCR Conversion
Limits have been established for maximum and minimum book size in terms of what can be scanned now, which prevents digitization of about 20 percent of the relevant collection. Everything is tracked with bar codes, and the condition of each book is checked to ensure that it can stand up to the scanning process. Four Kirtas Technologies BookScan machines are now being used.
These provide semiautomated scanning with an operator in place to ensure that all pages are turned accurately, to preview the quality of the images, and to adjust color settings that can vary with temperature. A separate scanner is used to handle books with fold-out pages.
Scanning produces high-resolution images (300 dpi) that are then transferred to a suite of 12 computers for OCR (optical character recognition) conversion. The scanners, which run 24/7, are specially tuned to deal with the spelling variations and old-fashioned typefaces used in the 1800s. The process creates multiple versions including PDFs and OCR text for display in the online services, as well as an open XML file for long-term storage and potential conversion to any new formats that may become future standards. In all, the data will amount to 30 to 40 terabytes.
In addressing digital preservation issues, Rory McLeod, the BL’s digital preservation manager, pointed out that the BL’s mission is the same for digital material as it is for print: preservation for perpetuity. To determine the costs of preservation for the long term, the BL is running the LIFE Project (Lifecycle Information for E-Literature) in collaboration with University College London. The LIFE project is designed to calculate the cost of preserving digital information for the next 5, 10, or 100 years.
PLANETS (Preservation and Long-term Access through NETworked Services), another project in collaboration with other EU institutions, is designed to produce practical services and tools to help ensure long-term access. Security will be provided by keeping copies at the BL’s Boston Spa location and with the National Library of Wales. McLeod stressed that the BL is working with other digitization project teams to encourage the use of open standards in digital preservation and sustainability.
Mass-book digitization projects have previously triggered controversy for copyright issues. While promoting the benefits that digitization brings to authors by exposing their material to new and larger audiences, the BL insists that it does not wish to ride roughshod over authors’ rights. Nonetheless, intellectual property laws were written for the analog world, and, worse still, British/EU legislation keeps material in copyright for life, plus 70 years.
Obviously, then, an issue exists here for a collection of 19th-century literature when some authors may have lived beyond the late 1930s. An estimated 40 percent of the titles are also orphan works. Those two issues mean that item-by-item copyright checking would be an unmanageable task. Estimates for the total time required to check on the copyright issues involved vary from a couple of decades to a couple of hundred years. The BL’s approach is to use two databases of authors to identify those who were still living in 1936 and to remove their work from the collection before scanning. That, coupled with a wide publicity to encourage any rightsholders to step forward, may solve the problem.
In summarizing the day, Neil Fitzgerald, Microsoft’s digitization project manager, said that he hoped they had given us a sense of the complexity of the project. Although the project is well-planned and -piloted, lessons were being learned in the process all the time. The BL and Microsoft were keen to share their own experiences. He stressed that “partnership” is the keyword. Partnership has been crucial internally and involved all departments in the process, but it’s also crucial externally by using partners every step of the way.
As for the future, content should be available in Microsoft Live Search Books by the time this article appears in print as well as on the BL site in December 2007 or January 2008. Potential future topics are under review for the next mass digitization project. It looks as though the 19th-century focus will continue, with geography next and possibly sociology soon after. Curator Jensen relished the challenge of defining sociology, which is a 20th-century concept, for 19th-century literature.