Securing Your Digital Future
by Bill Greenwood
In the mid-1970s, three new file formats hit the information industry: Dialog, VRS, and ITSC: Orbit. Eager to hop on board with the latest technological trends, many organizations began using these standards to encode information for future use. But while users enjoyed this current information with all the benefits of a modern file format, backfiles were often left unconverted in their original form. Over time, the technology used to view those files gradually disappeared, rendering them inaccessible, according to Marjorie Hlava, president of Access Innovations, Inc. and chairwoman of the NFAIS standards committee.
“In terms of the digital backfiles, the big ones are good to the mid-’70s at the moment, and then it’s like this big black void,” she says. “Some files have gone back further, and some are beginning to go back further, but anything prior to the ’70s, you’re better off looking at print.”
More information was lost as computer programs were updated and improved, leaving users with no way to open files stored in certain proprietary formats. According to Bill Trippe, lead analyst for XML technologies and content strategies at Gilbane Group, Inc., proprietary formats remain problematic today, though the market’s consolidation around Microsoft has reduced the total number of proprietary processes. Still, he says these formats should be avoided because it is unlikely that any individual or company will have the exact same software and operating system required to access them in 10 to 15 years.
But data stored on physical media such as CDs, DVDs, and magnetic tape is degrading over time, and some formats, such as floppy disks, are virtually inaccessible by traditional means, according to Hlava. And today, as formats improve, more data is being lost.
Follow the Format
Companies are now beginning to use the latest industry standard format to encode their data. Hlava says most companies are currently in the process of switching to Unicode—which lets users read 8-, 16-, and 32-bit data—as opposed to ASCII, which supports only 8-bit data. She says some institutions that haven’t made the upgrade or haven’t started using Unicode fully have found themselves at a disadvantage.
“The Library of Congress decided to take a 8-bit standard for Unicode, which means that the new 16-bit data—and I’d say 16-bit, at this point, is more common—they can’t read,” she says. “And so they may have to revisit that standard fairly soon.”
Many libraries are making the transition from MARC records to XML, a combination of SGML and HTML that addresses both formats’ faults, says Hlava. XML can be written in ASCII and Unicode.
“SGML was incredibly versatile, which made it incredibly complicated, and everybody implemented it differently,” she says. “And then HTML came along in about ’94 as a way to code up things and make it display on the web, and that was awesome and simple, but it was really format only. It didn’t have anything to do with context, whereas SGML had everything to do with context. And so XML came out as kind of a hybrid so that you could put in both the format tags and the context tags.”
Trippe says XML has not peaked yet because many companies are still switching over to the format. So it should be around for “quite some time,” making it a good format for library data.
“It seems to be the preferred format for all of the information providers and everything like that, and it’s pretty powerful because you can drill down to different things, so you can get to any level you want to,” he says.
For archiving purposes, a nonproprietary standard such as TEI (Text Encoding Initiative) or EAD (Encoded Archival Description) should be used, says Hlava. These formats work across various systems as a flat ASCII or rich text format file would, but TEI and EAD files retain formatting and enable metadata to be added, she says.
“A flat file has no tagging,” she says. “It doesn’t identify the author or the date of publication or any of that kind of stuff. It’s just the text, and it’s stripped of all formatting and so on. So a lot of people will use something a little more involved than that, particularly if they want to save it for reuse.”
To use these formats, Hlava says archivists must first determine the current format and the target format. They can then use programs such as Sun Microsystems, Inc.’s OpenOffice.org to convert files to the appropriate format. She says those who are in the museum business or are members of organizations such as the Society of American Archivists should use the TEI standard; the library world is best served by EAD.
“They were built for archiving,” she says. “They were built so that people could retain as much of the original coding and perhaps enrich the data as they archived it.”
Hlava says she exports her files to a neutral format once every 2 years and recommends others follow a similar schedule. She says data should be stored either on a computer that is regularly backed up or in a national archiving facility, such as the one in Boyers, Pa., where the U.S. government stores Social Security and patent data.
“Frequently, it’s a university or a government entity where you know that they have a long-term interest in your data,” she says.
‘A’ Is for Archive
Another format that is gaining popularity for digital document archiving is PDF/A, a subset of Adobe’s Portable Document Format that stores all the information needed to display the document within the file itself, eliminating reliance on outside programs that may not exist in the future. According to Trippe, AIIM (Association for Information and Image Management) has already adopted this format.
“I think it’s a very good sort of low cost, low impact format,” he says. “I think it’s really attractive to be able to say, ‘OK, we’ve now collected all of this material. Let’s now have a workflow into PDF/A, and we know how to deal with PDF/A, and we have high confidence that PDF/A will work for the foreseeable future.’”
Rita Knox, research vice president at Gartner, Inc., agreed, adding that she tells her clients to store their content in PDF/A for a variety of reasons.
“I usually encourage PDF just because it’s searchable,” she says. “It’s text underneath. Also, it has the look and feel of paper, which people often like it preserved as a layout. So I think when you look at some of the longer-term archiving that happens going forward, that that would be the standard format that’s used.”
Alive Through Emulation
In Europe, a research team from nine universities, national libraries, companies, and organizations is collaborating on another way to stop data loss and address proprietary formats through emulation. The KEEP (Keeping Emulation Environments Portable) project is a €4.02 million (roughly $5.34 million) medium-scale research program that started Feb. 1. Its mission is to develop an architecture that acts as a foundation for existing emulators, says David Anderson, principal investigator for the Portsmouth branch of KEEP and a principal lecturer in creative technologies at the U.K.’s University of Portsmouth School of Creative Technologies. In time, this architecture’s language can be upgraded to run on new systems, but the emulators themselves will not need to be touched.
“We’re clearly going to be dependent to some extent on the emulator-running community,” he says. “Each of the emulators that is out there currently has some deficiencies, and that’s likely to be the case for some time. The idea would be if we can provide a stable backbone, people will have some reason to try and commit to writing emulators to go on top of it and to improve the emulators we’ve already got.”
According to Anderson, this approach is meant to address quality loss that occurs when companies and individuals migrate their digital files onto new systems. He says that after 10 to 30 years of technological change, many files don’t “look or seem or feel very much like the original at all.” However, by keeping files in their original form and using emulation to replicate old computer systems, users can maintain the original richness of their digital content.
Trippe says he is “fascinated” by the concept of emulation and could see it working for some businesses where time frames for archiving are “shorter and more practical,” about 3–7 years. But government institutions, which often have a constitutionally mandated responsibility to preserve data, may have to look for other solutions, he says.
Hlava says it is better for organizations to continue converting files to neutral formats. However, she did see a potential use for emulation in the engineering realm.
“If you have CAD/CAM drawings of the workings of the power plant or something, then I think it’s probably worth keeping that emulation going because there aren’t many things that are going to be able to read that stuff,” she says.
Keeping It Physical
In terms of physical media, the best way to curb degradation and avoid obsolete formats is to rewrite digital files on a regular schedule, says Hlava. The expected life of a CD or DVD is 5 years, she says, while magnetic tape usually lasts 10 years and hard drives will last 2 years.
“The best practice these days is you write it out to hard disk, and then you rewrite the file on a periodic schedule, and you can even automate that process,” she says. “The big shops do that. The big archival files write a program to rearchive the data, especially since disk drives die, so you want to be sure you keep more than one copy.”
Michael Mahoney, a senior research analyst at Nerac, agreed that the best bet is for companies to research the life expectancy of the physical formats they store data on and then transfer their files to a newer medium before that time expires. He says this also solves the problem of hardware inaccessibility that currently plagues floppy disks.
“As things are becoming close to obsolete or reaching end of shelf life, they need to put some kind of process in place to transfer them to something else,” he says. “I don’t think that any time with digital preservation that you’re done with it. There’s got to be some cycle of when you have to do something again to preserve that data over a long period of time. Nothing lasts forever.”
According to Anderson, emulation can also help with this problem. He says many museums and other archivists are making good progress transferring files from obsolete physical formats such as floppy disks to new media such as CDs and DVDs. Once the original bitstream is saved on a usable medium, he says, the files can be run in their original format with emulators.
For corporate data, Hlava suggests
a backup system consisting of a mirror server and a backup server, both of which are live. Enterprises can then back up data on a weekly and monthly basis, with this information stored off-site. The archives can then be rotated on a set schedule so the media doesn’t go bad.
“It’s saved my bacon more than once,” Hlava says.
Regardless of how digital data storage is approached, one thing is certain: Individuals and companies need to be vigilant about preserving their files. Information may be power, but it is also extremely fragile.
“Just because you’ve got it on a CD in a drawer doesn’t mean it’s going to be good 5 years from now, particularly next to the heater,” Hlava says. “So you shouldn’t expect it to.”