The Race to the Shelf Continues
The Open Content Alliance and Amazon.com
by Beth Ashmore, Cataloging Librarian, Samford University &
Jill E. Grogg, Electronic Resources Librarian, University of Alabama Libraries
Internet giants such as Google, Yahoo!, Microsoft, and Amazon are in the middle of nothing short of a modern-day space race: Who can scan the most and the best books in alliance with the biggest and brightest libraries in the U.S. — nay, the world! — while simultaneously providing print on demand, “find in a library,” and “buy the book” links as well? The amount of press and controversy surrounding the Google Book Search Library Project tends to overshadow one detail — while these companies may have begun the race to the shelf, they certainly did not invent book digitization. Look no further than Michael Hart’s Project Gutenberg, which celebrated its 35th anniversary in 2006 and expanded its reach to Canada in July 2007, to know that book digitization is nothing new. But, as with almost all things these big internet companies touch, the stakes have been raised significantly.
While Google seems to rack up an increasingly impressive list of library and industry partners [See “Google Book Search Libraries and Their Digital Copies: What Now?” at https://www.infotoday.com/searcher/apr07/Grogg_Ashmore.shtml for a description of the Google Book Search library partners — then], the Open Content Alliance, or OCA, is giving Google a run for its money. OCA comes armed with an open access philosophy and its own impressive stable of partners, including Yahoo! and, at least initially, Microsoft. Amazon, the dark horse in the race, as scanning and making books available for free online would seem antithetical to its book-selling roots, has gotten into the act, offering to partner with libraries to help scan and sell rare and hard-to-find books from library collections. Under Amazon’s model, the libraries retain their own digital copies along with a portion of any print-on-demand profits. Ultimately, librarians now have choices when it comes to large-scale digitization partnerships.
Open Content Alliance: Background
The Internet Archive (IA), under the leadership of cofounder and director Brewster Kahle, began the OCA in 2005 with the initial goal of extending IA’s reach “to help bring digital materials online or take the ‘born digital’ and help make them more readily accessible” into the world of book digitization. The 501(c)3 nonprofit Internet Archive did not stick to the non-profit sector for partners — counting Yahoo! as one of its founders and Microsoft as an early (though now ex-) partner in the Alliance. Kahle described the IA’s mission as “born out of the digital opportunity of universal access to all knowledge. The idea of getting all the books, music, video on the Net and making them accessible to people anywhere … that’s where we’re coming from.”
The OCA counts no fewer than 44 libraries in the Alliance. At first glance, this may seem relatively similar to the Google Book Search Library Project equation: Internet giant + massive library collections = a world of previously untapped searching content. But the similarities pretty much end there. From partners to access models to future plans, OCA offers a different spin on the world’s print knowledge brought online. The OCA is designed to bring together libraries, publishers, and online stakeholders to create joint open repositories of content — with particular emphasis on open. OCA’s approach to this process has two major differences that set it apart from the Google Book Search Library Project: no scanning of in-copyright materials from library collections (at least not yet) and open access is the guiding principle — meaning that even Google itself could (and does) crawl titles from the OCA repository.
With the Internet Archive’s digital collections as a starting point, OCA brings some big names in university libraries to the table: • Boston Library Consortium
• Boston College
• Boston University
• Brandeis University
• Northeastern University
• Tufts University
• University of Connecticut
• University of Massachusetts (four campuses and the medical library)
• Wellesley College
• Williams College
• Columbia University
• Emory University
• Indiana University
• Johns Hopkins University Libraries
• McMaster University
• Memorial University of Newfoundland
• Perseus Digital Library at Tufts University
• Simon Fraser University
• University of Alberta
• University of British Columbia
• University of California
• The University of Chicago Library
• University of Georgia
• University of Illinois at UrbanaChampaign
• University of North Carolina— Chapel Hill
• University of Ottawa
• University of Pittsburgh
• University of Texas
• University of Toronto
• University of Virginia
• Washington University
• York University
Some of these names may look familiar as Google partners as well (California, Illinois, Indiana, Texas, Virginia) and that situation has its own problems. As Kahle described it, in the case of the University of California, one of the earliest partners, “They [ University of California and Google] would not like to scan any book twice. That basically means there’s going to be a choice. So, we would like to see libraries go open.” Kahle sees the danger in libraries partnering with Google to digitize collections as a question of whether or not Google will allow for the widest availability of materials, including the ability for other search engines to crawl digitized texts, as well as mass downloading for data mining purposes.
OCA has also managed to recruit some libraries outside of academe. These organizations include Boston Public Library, the British Library, European Archive, Marine Biological Laboratory/Woods Hole Oceanographic Institution Library, Missouri Botanical Garden, National Archives (United Kingdom), National Library of Australia, Prelinger Library and Archives, San Francisco Public Library, Smithsonian Institution Libraries, State Library of Massachusetts, and the William and Flora Hewlett Foundation. A relatively new and unique partner is the Biodiversity Heritage Library, a cooperative project of the American Museum of Natural History, Harvard University Botany Libraries, Ernst Mayr Library of the Museum of Comparative Zoology, Missouri Botanical Garden, Natural History Museum–London, The New York Botanical Garden, Royal Botanic Gardens in Kew, and Smithsonian Institution Libraries. Kahle is particularly proud of this partnership as it represents a trend that could see other disciplines banning together to bring a wealth of knowledge on a particular topic to the open access world. As Kahle explains: “This is a whole branch of science deciding to go open … it is a massive program to digitize tens of millions of pages, basically all of the literature about species. This is important to have in the open because it can be repatriated to the developing countries that actually have these organisms, as well as making it possible to do data mining research on it … It is a commitment of the major natural history museums, natural history libraries and botanical gardens to go and make the information about species public.”
Academics are not the only ones seeing gold in open access. A number of industry heavyweights have thrown their support — and in some cases their money — behind the OCA. Industry members of the alliance include Adobe Systems Incorporated, HP Labs, O’Reilly Media, the Xerox Corporation, and Yahoo!. How each of these companies contributes to the OCA varies widely. For example, Yahoo! has committed to indexing all the scanned materials as well as having funded the initial setup of the alliance and the scanning of an American literature collection from the University of California.
To appreciate the scale of the OCA partnerships, one must understand just how, where, and how much scanning is getting done. OCA adds about 12,000 books a month to its collection. Kahle announced in October 2007 that so far 200,000 books have been scanned, an equivalent of 50 million pages [http://www.openlibrary.org/details/oca_test_004]. OCA has managed to accumulate this number through scanning centers set up at various partner sites across North America and Europe. Unlike Google’s super secret scanning warehouses, OCA scanning centers are located and run by some of its larger partners at eight locations: San Francisco; Los Angeles; Urbana, Illinois; Toronto; Boston; New York; Washington; and London. “They are inside library facilities and those libraries have made the facilities available for anybody to come scan books as long as they cover the incremental costs,” says Kahle. “The Internet Archives costs 10 cents a page and if we can get it below that we’ll charge less. Right now, however, it costs 10 cents a page to scan a book, assign metadata to it, compress it, run OCR on it, package it in a couple of different access formats, and host it on two continents forever. Ten cents a page does all of these services and of course the library can have any and all of these materials back for themselves or any other reason.”
So, why would a librarian choose to go with the OCA over the other partners currently available? Two words: open access. If the goal is to support open access principles and to get scanned copies of out of copyright books indexed in as many search engines as possible, then OCA is the right choice. With minimum requirements such as attribution and maximum requirements of no re-hosting, OCA leaves the greatest number of opportunities for users to discover and re-use the text that they find, making it ideal for those interested in data mining.
Another selling point of OCA is its affiliation with the Internet Archive. IA has long been interested in opening up existing digital collections. Many libraries have their own digitization projects and OCA partnering thus represents an opportunity for libraries to consolidate projects with one large-scale partner. Not only can libraries begin scanning with the support of the OCA, they can add existing collections that may need greater exposure on the Web, as well as collections that contain far more than text. The Internet Archive houses Web, text, moving images, audio, and software collections. “If we can help by playing some of the roles of back-end infrastructure, we’d love to,” Kahle says. “One thing that we found is that all of these little, and sometimes not so little, projects have happened, but they are difficult to find. They have different interfaces, different service layers.” In addition to providing a common interface, OCA aims to preserve materials in multiple formats. One project under the OCA banner, funded by the Hewlett Foundation, deals with approximately 200 groups with a variety of needs for bringing their collections online. Some involve basic digitization, while others require metadata and still others have files that need migrating to new formats.
OCA may not have the speed or financial resources of Google Book Search to whisk away a library’s holdings and scan them. Nor can OCA scan collections for free, like Google, and we all know how seductive free can be to budget-stretched libraries. OCA is a decidedly community-based effort. It represents a model for the future of digitization efforts that appears viable, provided libraries can cover the associated costs. This model includes popular commercial entities, such as Yahoo!, but lets librarians set the agenda to include the greatest number of access points possible — through search engines, library catalogs, and even bookmobiles. “The Print on Demand Bookmobile is actually doing pretty well in a couple of other countries,” Kahle notes. “One of the things that we are finding kind of exciting is the development of print on demand technologies … where a machine puts it together for you, more like a kiosk. OnDemandBooks.com has a thing called the Espresso Book Machine …[and] we are making it so that all the Open Content Alliance books can be printed through that machine.”
Kahle even sees a future for OCA in copyrighted works: “Our approach at the Internet Archive is to start with out of copyright and then move into orphan works, then out-of-print and then in-print. I’m hoping that by the time we get to in-print commercial publishers, we’ll have moved along to help promote their books online and allow them to be downloaded.” OCA is already making forays into out-of-print works with the October 2007 announcement of a program to digitally interlibrary loan these titles. Kahle describes the importance of the program: “We believe this can be a tremendously valuable way to increase scholarly and public access to hard-to-find resources. Out-of-print books can represent huge portions of library collections … By scanning these volumes, libraries will be better able to fulfill their mission of providing access to scholars and the public” [http://www.openlibrary.org/details/oca_test_004]. In the end, Kahle believes that the OCA’s survival and attraction may lie in its ability to provide the service layers that users require. “This is public domain material. Have the public domain material stay in the public domain and have organizations compete on the service layers. This is the architecture of the World Wide Web.”
Librarians interested in becoming a part of the Open Content Alliance should contact email@example.com.
Amazon’s first volley in the book digitization race looked like just an attempt to sell more books. It introduced Search Inside the Book on Oct. 23, 2003. At the time, it was heralded as an important step in extending the accessibility enjoyed with other material (journals, newspapers, and web sites) to books. Amazon managed to convince publishers that it would help sales more than it would harm them. With an initial collection of 120,000 books in the program, Jeff Bezos announced in his Oct. 23, 2003, letter: “We’re working hard to make your shopping experience better and to make Amazon.com the best place on the planet to find, discover and buy books” [http://www.amazon.com/gp/feature.html?docId=507108]. He was certainly heading in the right direction to reach this goal. When asked how many current titles are available for Search Inside the Book, Kurt Beidler, senior business development manager for Amazon.com, replied, “We do not disclose this information.” Quickly after the Search Inside the Book feature began, Google announced Google Print (now Google Book Search), changing the book digitization milieu radically. In the meantime, users embraced this new Search Inside the Book access and librarians experimented with how they might co-opt this tool to help them in their daily reference and collection development duties.
From the user perspective, Search Inside the Book works seamlessly. The feature is integrated into Amazon’s default search with keywords automatically run through the texts included in the Search Inside database. For users interested in why the feature returns a title, Amazon has provided added features such as “See more references,” which lets users view the number of times a keyword appears in a particular title. Another function allows a user to search within a chosen title. Other functions include links to the first sentence/first page and sample pages; Statistically Improbably Phrases (SIPs); links to books on related topics; and Text Stats, which provides basic statistics on number of characters, words, and sentences, as well as a variety of measures of readability and complexity. In short, Amazon created a fairly comprehensive tool that allows users to get to know a book without actually having to read it.
Since libraries were neither the source of the scanned books nor likely to get traffic from Amazon users looking for copies of the titles they just viewed (not unless they were really savvy/cheap Amazon users), librarians generally did not give the Search Inside the Book infrastructure much more thought. In 2006, Amazon’s Back in Print initiative demonstrated how rights owners of out-of-print titles could get their titles available through print on demand (POD) via Amazon’s BookSurge division, acquired in 2005. However, it was not until Amazon announced that it would be working through its BookSurge division with Kirtas Technologies and libraries to identify these out-of-print, out-of-copyright titles and add them to BookSurge’s POD service that the library community became active partners.
Compared to the laundry lists of Google and OCA partners, the Amazon/BookSurge and Kirtas Technologies project looks fairly timid. Two universities, Emory University and University of Maine at Orono, have signed deals as well as The Toronto Public Library, the Public Library of Cincinnati and Hamilton County, and New York Botanical Gardens. Amazon, through its division BookSurge, works with Kirtas Technologies. Kirtas actually forms the relationships with the libraries and libraries work directly through Kirtas rather than directly with Amazon. As Beidler explains: “The relationship is between BookSurge and Kirtas Technologies, which is a scanning and manufacturing and service provider. And that is the only relationship that we [Amazon] have. Kirtas then forms relationships with individual libraries … they [libraries] sign up directly with Kirtas. As far as we’re concerned, whether there’s one library or a thousand libraries — they all funnel to use through a single point of contact, Kirtas.”
Linda Becker, vice president for sales and marketing at Kirtas Technologies, Inc., further explains Kirtas’ role: “Customers have two choices. One, they could send us their books and we can digitize the books for them and put them on Amazon. This is what we are doing for New York Botanical Gardens and Cincinnati Public Library. Or, they could purchase a system to digitize materials themselves and send us the work. Then, we do the backend work to get it ready for print on demand and we send it on to Amazon.” This second option is the method by which Emory University, University of Maine, and Toronto Public Library are participating. Becker notes that the project was launched as a pilot in June 2007 with the five libraries mentioned above, but Kirtas is currently talking to approximately 20 more libraries.
In either option, the library is in control of what gets scanned. Beidler points out that the libraries maintain complete control and ownership of the entire process and also the end files that result from the digitization.” According to Beidler, libraries are likely choosing “books they feel there might be a market for or they want to make available to the general public in a broader way than they can do through their own patronage.” Very simply, libraries put these titles in the Amazon POD program and those books are then available for Amazon customers to purchase directly through Amazon. When asked if a book can be purchased in digital format from Amazon as well as POD, Beidler responds, “No. This is a rumor on which we have not made any comment. We have not made announcements of any kind.”
Libraries can choose to participate in either the POD or SearchInside! the Book programs on a title-by-title basis, but, according to Beidler, the most common scenario is for participating libraries to place titles in both of these Amazon-provided programs. Beidler said that the benefit for libraries trying to increase access to materials is that the Search Inside! the Book program aids in the discovery process for the end user: “So when you go to Amazon and you do a keyword search, you’re not only searching the metadata for those titles and trying to hit on keywords that happen to be in a title of a book. You are also actually searching the full text of books that are in that [Search Inside! the Book] program.”
In the Amazon/BookSurge/Kirtas model, the libraries function as the publishers, so they create an imprint of sorts that identifies the contributing library as the owner of the material. This means that the library carries the burden for copyright compliance, making sure that the library either owns the copyright or the material is in the public domain. The library also sets the list price for a given title, which varies based on its value, meaning its size, rarity, and other criteria. Can a library set the list price for a book as free or give a book away for free itself? Beidler says: “All our content providers, including libraries, may set list price without restriction. However, because of cost constraints, there may be titles with list prices so low that we would not be able to sell them profitably. In these cases, we might choose not to distribute certain titles. A title with a list price of zero would probably fall into this category.”
The costs of this program are shared among the participants. BookSurge absorbs part of the file preparation for titles placed in the POD or Search Inside! the Book programs. Kirtas also absorbs some of the file preparation costs, as it performs some quality control before sending the files to BookSurge. Kirtas either sells its equipment to the library at a discount or provides the service of scanning the books for the individual libraries. Finally, the library determines its own level of participation, researching which books to include and scan and deciding whether to purchase scanning equipment or send materials to Kirtas for scanning.
Ultimately, however, this program has the potential to provide an ongoing revenue stream for libraries while concurrently giving the library an extremely high-quality digitized copy of rare and, in some cases, deteriorating material. High quality is the key. Becker emphasizes that the Kirtas family of products produces only the highest quality digitized files, “as good as the book looked when it was originally printed” — and sometimes even better. The file size is small because Kirtas uses technologies that allow the text to appear in black and white, while the pictures and other items can appear in color. Furthermore, Kirtas equipment handles the book carefully and offers software to make the title searchable. The benefits do not end there, however. The improved access to these rare materials is unprecedented. Often, such rare materials are not candidates for interlibrary loan, so using these items might require a physical trip to a distant library. Not so with POD: A user orders a copy and usually has it in his or her hands in approximately the same timeframe as your average interlibrary loan book — only they get to keep the copy.
Beidler said that no one in this partnership has stipulated that titles must be rare, but many librarians choose to digitize those materials first, as these are the most difficult to access and at the highest risk for damage and deterioration. Of course, no program is without its drawbacks. Currently there are no links back to the library, such as the OCLC Find in a Library program. Additionally, the cost of purchasing equipment or other related costs may be prohibitive. However, with Amazon, Kirtas, and the contributing library sharing the profits of any print on demand sales, this could make this project, unlike OCA and Google, a self-supporting or even revenue making endeavor — something fairly unique for the library world, which generally has to rely upon grants and operating budgets to cover any new projects. Perhaps David Symonds, BookSurge’s general manager, says it best: “This is a really exciting opportunity to save valuable works and make them available to people. We can’t overemphasize how nice it is to save a book that in the past, someone may have had to travel to one coast or another or even around the world to actually get access to see the book. And now they can actually look at a copy of the book that may have been manufactured the day before. This is completely invisible to the end consumer. The only thing that they see is that now they have access to content where previously they did not.”
Librarians interested in signing up for the service with BookSurge should contact PODinfo@booksurge.com and/or http://www.booksurge.com. Librarians can also contact Kirtas directly at firstname.lastname@example.org.
Which Project to Pick?
Libraries are living in a digitized and, more importantly, a digitizing world. If you don’t believe it, check out Alan Taylor’s Book Search Mashup at http://kokogiak.com/booksearch. Try not to marvel at the wonder of the weird variety of titles that have been digitized on “eggplants” or “sock monkeys.” It’s always good to have choices, and for librarians looking to digitize their book collections, there are at least three viable options eager to dig into the stacks — and maybe more on the way. Let’s not forget Microsoft’s Live Search Books, for example.
Financial concerns certainly must be considered, but there are also some weighty philosophical issues that emerge. The titles included in the Google Book Search program are unavailable to other Web services. Is this a real problem or does Google’s search engine supremacy make this a nonissue? Does OCA have a sustainable model of open access in place and can it continue to scale? Would selling print-on-demand copies of your rare books through Amazon make your digitization project financially feasible? And what do we do about copyright? Some libraries have taken a stance of sorts on these types of issues, as reported in an Oct. 22, 2007, New York Times article, “Libraries Shun Deals to Place Books on Web” [http://www.nytimes.com/2007/10/22/technology/22library.html?
ex=1193716800&en=abc109c23daee1fe&ei=5070&emc=eta1]. In this article, the author explains the resistance of some libraries, such as the Boston Public Library and the Smithsonian Libraries, to sign up with Google. Regardless, answers to these questions will not come soon or easily. In the meantime, librarians will have to make some tough choices, but keep seeking the benefits that can be drawn from these endeavors.