Vol. 20, No. 2 • February 2000
The Digital Atheneum:
New Technologies for Restoring and Preserving Old Documents
by W. Brent Seales, James Griffioen, Kevin Kiernan, Cheng Jiun Yuan, and Linda Cantara
|“The manuscripts we are working with now date from approximately the 10th to the 11th centuries and are written primarily in Old English ..."|
We have observed that basic
problems of access faced by humanities scholars frequently make for daunting
technical challenges for computer scientists as well. In that spirit, the
Digital Atheneum project is developing leading-edge computer techniques
and algorithms to digitize damaged manuscripts and then restore markings
and other content that are no longer visible. The project is also researching
new methods that will help editors to access, search, transcribe, edit,
view, and enhance or annotate restored collections. An overall goal of
the project is to package these new algorithms and access methods into
a toolkit for creating digital editions, thereby making it easier for other
humanities scholars to create digital editions tailored to their own needs.
The Cotton Library Collection from
This great collection of ancient and medieval manuscripts was acquired by the 17th-century antiquary, Sir Robert Cotton, in the century following the dissolution of the monasteries in England. His magnificent collection eventually became one of the founding collections of the British Museum when it opened in 1753. Twenty-two years earlier, however, a fire had ravaged the Cotton Library, destroying some manuscripts, damaging many (including Beowulf), and severely devastating others, seemingly beyond the possibility of restoration. The burnt fragments of the most severely damaged, sometimes unidentified, manuscripts were placed in drawers in a garret of the British Museum where they remained forgotten for nearly a century. In the mid-19th century, a comprehensive program was undertaken to restore these newly found manuscripts by inlaying each damaged vellum leaf in a separate paper frame, then rebinding the framed leaves as manuscript books. The inlaid frames kept the brittle edges of the vellum leaves from crumbling away, while the rebinding of the loose framed leaves as books prevented folios from being lost or misplaced.
The manuscripts we are working
with now date from approximately the 10th to the 11th centuries and are
written primarily in Old English, although some are written in Latin and
others, such as a Latin-Old English glossary, include both. One of the
manuscripts we are working on is a unique prosimetrical version (written
in both prose and poetry) of King Alfred the Great’s Old English translation
of The Consolation of Philosophy, a work by the Roman philosopher
Boethius that was later also translated by both Geoffrey Chaucer and Queen
Elizabeth. Other manuscript fragments in the group include saints’ lives,
biblical texts, homilies, the Anglo-Saxon Chronicle, and Bede’s Ecclesiastical
History of the English People.
The Nature of the Damage
Although the 19th-century restoration was a masterful accomplishment, many of the manuscripts remain quite illegible. Few modern scholars have attempted to read them, much less edit and publish them. The inaccessibility of the texts stems primarily from damage sustained in the fire and its aftermath, including the water used to extinguish it. For example, in many instances, the scorching and charring of the vellum render letters illegible or invisible in ordinary light. Words frequently curl around singed or crumbled edges. Holes, gaps, and fissures caused by burning obliterate partial and entire letters and words. In some cases, the letters of a single word are widely separated from each other and individual letters are frequently split apart. Shrinkage of the vellum often distorts once horizontally aligned script into puzzling undulations. And, of course, much of the vellum has been totally annihilated, the text written on it gone forever.
In many cases, the earlier
attempts at preservation have themselves contributed to the illegibility
and inaccessibility of the texts. The protective paper frames, for example,
necessarily cover many letters and parts of letters along the damaged edges.
During the 19th-century restoration, some illegible fragments were inadvertently
bound in the wrong order, sometimes upside down or backwards, sometimes
even in the wrong manuscript. In other instances, multiple fragments of
a single manuscript leaf were misidentified and erroneously bound as separate
pages. Further damage was caused occasionally by tape, paste, and gauze
applied in later times to re-secure parts of text that had come loose,
or by chemical reagents applied in usually disastrous efforts to recover
How We’re Restoring the Illegible Text
Digitizing the manuscripts makes it possible to restore the correct order of the pages and provides improved access to them. However, even the best digital camera cannot restore text that is hidden or invisible to the human eye. One focus of our work, then, is on extracting previously hidden text from these badly damaged manuscripts. We are using fiber-optic light to illuminate letters covered by the paper binding frames, information otherwise hidden to both the camera and the naked eye.
Here’s how: A page is secured
vertically with clamps and the digital camera is set on a tripod facing
it. Fiber-optic light (a cold, bright light source) behind the paper frame
reveals the covered letters and the camera digitizes them.
Click to Enlarge
Ultraviolet fluorescence is
particularly useful for recovering faded or erased text. Outside the spectrum
of human vision, ultraviolet often causes the faded or erased iron-based
inks of these manuscripts to fluoresce, and thus to show up clearly. Conventional
ultraviolet photography requires long exposure times and is prohibitively
expensive, time-consuming, and potentially destructive. We have found,
however, that a digital camera can quickly capture the effects caused by
ultraviolet fluorescence at its higher scan rate, thus eliminating the
need for long exposures. Image-processing techniques subsequently produce
images that often clearly reveal formerly invisible text. (See Figure
Reconstructing the Badly Damaged Manuscripts
The leaves of burned vellum manuscripts rarely lie completely flat, in spite of conservators’ generally successful efforts to smooth them out. Moreover, acidic paper was used for some of the inlaid frames. Besides turning yellow, the frames sometimes buckle and the vellum leaves shift. We are exploring digital ways of flattening the leaves to take account of these three-dimensional distortions. One potential technique attempts to recover the original shape of the manuscript leaves by capturing depth dimensions. Depth information may help us determine how the surface of a leaf has warped or wrinkled from extreme heat or water damage, as well as the effects these deformities have had on the text itself or indeed on the images acquired by the digital camera.
Click to Enlarge
Depth information may also
help solve the problem of accurately reuniting physically separate fragments.
During the 19th-century restoration, some fragments were correctly bound
together on the same page, but because of the medium could not be rejoined,
increasing the difficulty of reading the text. In Figure
2, a digitized image from preservation microfilm on the left shows
how two fragments of one page were separately bound together, while the
ultraviolet digital image on the right shows the same manuscript page with
the smaller fragment moved to its correct position in relation to the larger
fragment. Using a process called “mosaicing” in conjunction with depth
dimension, we are investigating the feasibility of creating transformations
that seamlessly rejoin such separated fragments.
Searching Images with Computational
Computers have historically been quite adept at storing, searching, and retrieving alphanumeric data: Searching textual documents, particularly when encoded with a standard markup system like SGML (Standard Generalized Markup Language), can quickly retrieve large quantities of specific information. Directly searching images for specific content, however, presents major challenges. Unlike alphanumeric letters or words, image content, such as a handwritten letterform, never looks exactly the same. Consequently, a specified image must first be identified, and the search must look for a region of an image that approximately matches the specified image. Because searching images requires image matching and processing, searching image data is far more computationally intensive than searching alphanumeric data. To speed up searching, image data is typically pre-processed to identify content that users are likely to seek again. However, content that is likely to be of special interest depends on the collection, so the search system must be easily configured to identify collection-specific content.
We are developing a framework
for creating document-specific image processing algorithms that can locate,
identify, and classify individual letterforms. In some cases a transcription
may be incomplete or inaccurate because the letterforms are badly damaged
or distorted and therefore difficult to identify. Although no two handwritten
letters are ever exactly alike, the problem is greatly aggravated in the
case of damaged or distorted text. By analyzing several representative
letterforms, we hope to build computer models that can be used to perform
probabilistic pattern matching of damaged letterforms. Developing such
a system is prerequisite to our being able to identify fragmentary text
in these manuscripts.
Click to Enlarge
A transcription significantly
augments the search capability of an image-based digital edition. Linking
a transcription to the corresponding part of an image narrows the search
space and also assists an editor who’s struggling to decipher a charred
leaf. (See Figure 3.) For example, we know
that the lines of script were originally very uniform because the scribes
who wrote the manuscripts routinely scored guidelines directly into the
vellum before beginning to write the text. In the damaged manuscripts we
are using, some lines of script are still evenly spaced, but many others
are extremely distorted by the heat of the fire. Because keeping one’s
place when transcribing such manuscripts is difficult, we are exploring
techniques to facilitate linking a line of script in a manuscript image
with the editor’s textual transcription.
Editing and Annotating the Damaged
Using these new processing techniques that we’re developing specifically for scholars in the humanities, our Digital Atheneum team plans to create and widely disseminate a digital library of electronic editions of these previously inaccessible Cotton Library manuscripts that we’ve digitally restored and reconstructed. As aids to research, we also intend to provide structured information such as electronic transcripts and edited texts, commentaries and annotations, links from portions of images to text and from text to images, and ancillary materials such as glossaries and bibliographies. We are encoding the transcripts and edited texts in SGML to facilitate comprehensive searches for detailed information in both the texts and the images, and are converting both the transcripts and editions to HTML or XML so they can be displayed by Internet browsers.
Another important application
we’re developing as part of this project is a generic toolkit to assist
other editors in assembling complex editions from high-resolution digital
manuscript data. The toolkit is being designed for scholars in the humanities
who would like to produce electronic editions, but do not have access to
programming support. An editor can then collect and create the components
of an electronic edition for any work (digital images, transcriptions,
edited text, glossaries, annotations, and so forth) and use the generic
toolkit to fashion a sophisticated interface to electronically display
or publish the edition. The increased ability to create electronic editions
will enable more libraries to provide access to previously unusable or
untouchable collections of primary resources in the humanities.
The Digital Atheneum’s Funding and
The Digital Atheneum is funded by the National Science Foundation’s Digital Libraries Initiative (NSF-DLI2) with major support from IBM’s Shared University Research (SUR) program. The funding lasts until March of 2002, and although our team hopes to have the project completed before then, there are no guarantees with this kind of work. The British Library is providing privileged access to the manuscripts in the Cotton Collection as well as to curatorial expertise and its digitization resources. Much of the work on the Digital Atheneum is being conducted in a new collaboratory for Research in Computing for Humanities (RCH) located in the William T. Young Library at the University of Kentucky.
The five authors are the principle investigators for the Digital Atheneum: W. Brent Seales (Ph.D., Wisconsin) and James Griffioen (Ph.D., Purdue) are associate professors of computer science; Kevin Kiernan (Ph.D., Case Western Reserve) is a professor of English. Cheng Jiun Yuan is a doctoral student and research assistant in computer science, and Linda Cantara (M.S.L.S., Kentucky) is a master’s student and research assistant in English. All of them work at the University of Kentucky in Lexington. The Digital Atheneum Web site is http://www.digitalatheneum.org.
|• Table of Contents||• Computers In Libraries Home Page|