THE DIGITAL ARCHIVIST
Crowdsourcing Cultural Heritage: 'Citizen Archivists' for the Future
by Jan Zastrow
While crowdsourcing has been a catchphrase in libraries for almost a decade—the first instance of the term was in Wired in 2006 by writer Jeff Howe—it’s taken much longer for the concept to be embraced in cultural heritage institutions such as archives, special collections, historical societies, and museums. And for good reason: They all provide access to original records and unique manuscripts in a multitude of formats. One misspelled indexing tag by a careless volunteer can obscure a document from researchers forever. Downright resistance is frequently the professional response. In fact, when I mentioned the subject of this column to one of my colleagues, his reaction was, “Oh, not that again … they’ve been hounding me about doing that at work—a terrible idea!”
| Our collective creativity - and innate human propensity for community - will undoubtedly stimulate as-yet-unimagined ways to harness knowledge bytes into remarkable resources.
Not everyone agrees. This month, I look at some of the exciting ways crowdsourcing is being used to increase online access to unique resources in cultural heritage collections, reflect on the ROI of such activities, discuss the challenges, and hypothesize possible future directions.
So Many Projects!
Crowdsourcing in archives and special collections can take the form of transcribing handwritten documents, indexing genealogical records, identifying people and places in photos, correcting optical character recognition (OCR) errors in digitized newspaper collections, tagging or captioning historical images, adding pictorial content to maps, transcribing oral histories, and much more. The best use of crowdsourcing is when human judgment is required on a large scale in such a way that can be structured into relatively simple, fun tasks. And fun is a key concept here; the gamification of microtasks keeps people coming back for more.
There are so many cool projects going on around the world (several of them are international collaborations), but I’ll narrow it down to U.S. projects for now. Some smaller institutions are doing fascinating projects such as the University of Iowa’s DIY History project to transcribe Civil War diaries, handwritten recipe collections, railroad correspondence, and financial papers (Saylor and Wolfe, 2011). The University of Texas–Austin is crowdsourcing the transcription and identification of “manuscript waste” used to bind books in the Middle Ages (Hartman, 2014). Unsurprisingly, it’s the big players that have the most prominent projects—the Library of Congress (LC), the National Archives and Records Administration (NARA), the Smithsonian, and The New York Public Library (NYPL)—which are the ones we’ll focus on here.
1. The LC’s Flickr Commons Project
The LC started its crowdsourcing project in January 2008 with just a few thousand pictures. Now, there are more than 20,000 images with 6,000 catalog records and more than 60 million page views. The goal was to identify and explain historical photos by tagging them with keywords and identifying names, occupations, and life dates of people in the photos.
Owing to deep staff reduction during the previous decade, the Prints & Photographs Division wanted to try crowdsourcing as a “virtual volunteer corps.” It discovered a larger and more diverse audience on Flickr than on its LC website. The trial was cost-effective at only $25 per year for a Flickr Commons account and about 10 hours a week of staff time—a bargain by any standard. This is one way even small institutions can give crowdsourcing a try without a big upfront investment.
2. NARA’s Citizen Archivist Dashboard
Crowdsourcing is an important strategy at NARA. It invites the public to contribute to the records by tagging, transcribing, and editing articles, as well as uploading digital images. NARA staff finds that researchers often know more about the records than the archivists can because they spend so much more time with individual records. They view crowdsourcing as a way to harness that researcher knowledge rather than letting it “walk out the door.”
NARA calls its user projects “missions,” and the designers organize documents by level of transcription difficulty: beginning, intermediate, and advanced. The goal is not professional transcriptions, but “web acceptability” —i.e., to be able to search a document and read it. In fact, this is a theme in most archival crowdsourcing projects—practical usability over scholarly perfection.
Old Weather is another NARA/“citizen science” initiative in partnership with NOAA (National Oceanic and Atmospheric Administration). NARA digitized historic Navy, Coast Guard, and Revenue Cutter ship logs dating from the pre-Civil War period through World War II. The transcriptions provide data for climate model projections and will improve understanding of past environmental conditions. Used by scientists, geographers, historians, and the public, this could also be a teaching tool to incorporate information about local shipping history or current climate change research among other things.
One lesson learned was the importance of higher-resolution images to permit zooming in on the words. NARA found that some of the older, lower-quality scans were unusable for this purpose. Rather than an add-on or a pilot project, the use of crowdsourcing at NARA is an important strategy to providing greater online access to official records of the U.S.
3. The Smithsonian’s Transcription Center
A relative latecomer to the scene, the Smithsonian benefited from the experience of other crowdsourcing projects when it started planning in March 2012. Ten of the Smithsonian’s 19 museums and archives participate in the Transcription Center, so a very flexible system was required to encompass translation, transcription, and discussion activities. Each format has different data structures: field notebooks, diaries, botanical specimen sheets (many with handwritten notes), and numismatic proof sheets.
The dashboard allows contributors to select a project they want to work on—the theme is “choose your own adventure”—by organization, by theme, by level of completion, or by most recent activity. In order to achieve a high level of quality control, the Smithsonian came up with an ingenious three-step process: transcription by one set of crowdsourcers, review by another set of registered users, and finally approval by staff or trained in-house volunteers. Back-end administrative tools allow sorting by ownership of materials, by name, by ID, by status (active or not), etc., which makes it easy for staff to manage projects. Next up is the joint Harvard-Smithsonian Center for Astrophysics project to crowdsource metadata creation for 19 historic logbooks of galaxy images.
The Smithsonian emphasizes a collaborative space in order to create high-quality, accurate transcriptions. And while most volunteers are from the U.S., the institute knows of at least one pair of sisters who transcribe projects online together although they live in different countries. How’s that for keeping in touch?
4. The NYPL’s Labs
The NYPL invites the public to “Kill time, make history” by helping with a smorgasbord of astounding tools: for transcription of historical menus (“What’s on the Menu?”), to improve on information extracted from 19th-century New York City insurance atlases (“Building Inspector”), to cross-reference U.S. Census data with old phone books (“Direct Me NYC: 1940”), to transform more than 40,000 historical stereographs into web-friendly 3D formats (“The Stereogranimator”), and to digitally align/rectify historical maps with present day locations (“Map Warper”).
This last one leverages a geographic information system (GIS) so you can use a slider to locate historical landmarks that are long gone and see how the landscape has changed over time. Perhaps this is the launch of “virtual walks” through history that would include the maps and photos of today, as well as interactive and immersive sights, sounds, and even scents from times past. Talk about bringing history alive!
Why Do It? Is It Worth It?
Considering all that goes into a crowdsourcing project (the planning, the scanning, the metadata, the marketing, the quality control, the training, the version control, the editing/censoring/assessment, and all the rest—not to mention managing an amorphous gaggle of online volunteers), why would heritage institutions want to engage in this? With the minuscule budgets typical in such settings, it’s tricky not to overburden limited staff resources with these projects while other collections sit unprocessed. Here are some good reasons to give it a try:
- Rich in content on the one hand and yet poor in resources on the other, crowdsourcing can be a method for institutions that are struggling to process backlogs of primary source materials.
Rose Holley, senior project manager for software implementation at the National Archives of Australia, explains, “In our digital world [users] want to: review books, share information, add value to our data by adding their own content, add comments and annotations and ‘digital post-its’ to e-books, correct our data errors, and converse with other users. And now they are telling us they can do even more, they can organise themselves to work together to … make our information even more accessible, accurate and interesting. Why are we not snapping up this great offer immediately?” (2010).
- Crowdsourcing projects provide tools for volunteers to participate and engage. The public is hungry to help, whether as “citizen archivists,” do-it-yourself historians, or amateur info pros.
“Social media tools are making [crowdsourcing] possible on the necessary scale, something prohibitively expensive by conventional means. … Citizen contributors can follow the stories revealed by historic documents. Some become invested in those stories or motivated by furthering the mission of research by enhancing access to important historic documents” (Saylor and Wolfe, 2011).
- Crowdsourcing can serve as a kind of preservation technique for archival materials; it provides access to a digital stand-in, thereby greatly reducing the need to see the original.
Barrie Howard, IT project manager at the LC, puts it this way: “The real win here is that the object, through its digital surrogate, is getting more exposure to researchers and the public than is possible with manual point-of-service transactions. Through crowdsourcing, the researchers and public become strong advocates for the object(s) and the collection(s), and this is the reason why reformatting and further preservation actions are worth it in the long run” (personal communication, Aug. 12, 2014).
The ROI seems well worth it. “Most of the crowdsourcing sites had either no paid staff or very limited staffing to manage and co-ordinate hundreds and thousands of volunteers,” confirms Holley. “The main task of the paid staff … was to create, establish or endorse guidelines, FAQs, and policies for the digital volunteer processes.” Howard agrees, “With the advancing flood of new content, we are awash in a data deluge so increasing our effective FTE buoys up our efforts to stay afloat.”
On the other hand, some institutions have decided it’s not worth it, especially if they have to write their own software. But now there are lots of cloud-based open source products—many free—that make such initiatives attainable even for smaller shops. I began making a list, but then came across Ben Brumfield’s excellent talk and spreadsheet of 36 collaborative transcription tools. Go there to find lots of technical details, web resources, and unique features (see Selected References).
Having said that, another colleague shared the difficulty in implementing the technological infrastructure that crowdsourcing requires. Not all institutions have IT staff available to help. And while Flickr and social media solutions can work for less complex tasks such as tagging, managing larger scale projects does require some degree of automation and control on the back end for it to be feasible. But there’s good news for the future: NARA released its Drupal code for others to use and wants an “ecosystem” of cultural institutions to build and share crowdsourcing tools.
In addition to the technology barrier, the two biggest hurdles to crowdsourcing for cultural institutions are managing volunteers and quality control. “The term ‘crowdsourcing’ itself can connote exploitation of volunteers, and raises the problem of lack of quality control” (Bartlett, 2014).
To overcome this, one project at MIT—the Edgerton Digital Collections project to digitize and transcribe the notebooks of a popular MIT professor (Harold Edgerton)—advertised for participants through its alumni magazine, upping the likelihood of obtaining a scientifically knowledgeable group of volunteers. In fact, several transcribers were former students of Edgerton. This “nichesourcing” of complex tasks among a small group of volunteer experts is the probable trend for cultural heritage collections.
Bartlett continues, “With appropriate training and oversight, crowdsourcing projects can be an excellent way to encourage patron collaboration and cooperation, foster a sense of public ownership, complete projects that the library might not have the resources to accomplish otherwise, and add value to our collections.”
Who knows where all this will lead? The next wave of crowdsourcing will likely be timecoding audio and video files for indexed search (in fact, University of Kentucky Libraries is already doing this), annotating non-native linguistic accents, transcribing antique melodic notation into modern musical scores, and lots more GIS mapping. Our collective creativity—and innate human propensity for community—will undoubtedly stimulate as-yet-unimagined ways to harness knowledge bytes into remarkable resources. In this era of burgeoning digital information creation, we are all becoming, de facto, “citizen archivists.”