[ONLINE] feature

The Web as Database: New Extraction Technologies and Content Management

Katherine C. Adams

ONLINE, March 2001
Copyright © 2001 Information Today, Inc.


Information extraction (IE) software is an important part of any knowledge management system. Working in conjunction with information retrieval and organization tools, machine-driven extraction is a powerful means of finding content on the Web. Information extraction software pulls information from texts in heterogeneous formats–such as PDF files, emails, and Web pages– and converts it to a single homogeneous form. In functional terms, this converts the Web into a database that end-users can search or organize into taxonomies. The precision and efficiency of information access improves when digital content is organized into tables within a relational database. The two main methods of information extraction technology–natural language processing and wrapper induction–offer a number of important benefits:


Information extraction (IE) software identifies and removes relevant information from texts, pulling information from a variety of sources, and aggregates it to create a single view. IE translates content into a homogeneous form through technologies like XML (eXtensible Mark-up Language). The goal of IE software is to transform texts composed of everyday language into a structured, database format [1]. In this way, heterogeneous documents are summarized and presented in a uniform manner.

To improve accuracy and ease development, IE software is usually domain or topic specific. An IE system designed to monitor technical articles about Information Science, for example, could pull out the names of professors, research studies, topics of in- terest, conferences, forthcoming publications from press releases, news stories, or emails and encode this information in a database. End-users can then search across this database by textual attribute or feature. A typical search could be for all forthcoming publications about information retrieval or to locate all conference presentations on a specific information science topic. In addition, the structured information contained within a database could be ordered into a taxonomy.


Information retrieval (IR) recovers a subset of documents that match an end-user's query, while IE recovers individual facts from documents. The difference between IR and IE is one of granularity regarding information access. IR is document retrieval and IE is fact retrieval [2].

Information extraction software requires that end-users specify in advance the categories of information they want to capture from a text. For instance, a system devoted to scanning financial news stories could extract all company names, interest rate changes, SEC announcements, or stock market quotes from texts. Because the parameters that define a particular topic are determined a priori, IE systems are fully customizable. IR and IE are different, but complementary. Together they create powerful new tools for accessing and organizing information stored on Web servers.


Both IR and IE are difficult because they must overcome the ambiguities inherent in language. The complexities of representation make information access tricky. Indexing specific keywords can produce poor results because individual terms don't always line up with concepts. For example, the most popular type of IR available on the Web, and in traditional online IR, is keyword searching. The problem of information retrieval via keywords centers around two issues: synonymy and homonymy. Indexing the keyword "ambulance" won't retrieve documents that only use the synonymous "emergency vehicle," but they will retrieve homonymous but irrelevant documents concerning, for example, a disreputable lawyer who is an "ambulance chaser."

In addition to synonymy and homonymy, IE also must contend with co-reference recognition. Co-reference recognition determines when an expression, such as pronouns like "he," "she," or "it" and noun phrases like "the company," refers to the same thing in a sentence. For IE to work correctly, various entities within documents (locations, people, places, events) must be identified within a block of text. Information extraction involves discourse analysis, and co-reference recognition refers to an entity introduced earlier in specific dis- course [3]. In the following example, co-reference recognition would disambiguate the personal pronouns in the third sentence: "Barbara is a professional chef. Lael likes to make desserts. She made a birthday cake for her." The Web is filled with such terse and abbreviated communications.


Information extraction research in the United States has a fascinating history. It is a product of the Cold War. In the late 1980s, a number of academic and industrial research sites were working on extracting information from naval messages in projects sponsored by the U.S. Navy. To compare the performance of these software systems, the Message Understanding Conferences (MUC) were started. These conferences were the first large-scale effort to evaluate natural language processing (NLP) systems and they continue to this day. The MUCs are important because they established a mechanism that systematically evaluates and compares IE technology.

All participants in the MUCs develop software systems that extract information from texts composed of everyday speech. Participants develop software to perform a pre-determined IE task and then convene to compare notes. Conference organizers determine the topic of study–past MUCs have analyzed news releases about terrorist activities in Latin America, corporate joint ventures, company management changes, and microelectronics.

Formally evaluating NLP information extraction software requires compiling a corpus of texts and manually creating an answer key. Texts are run through an IE software system and a template of answers is produced. This template is then measured against an answer key that specifies what information should be extracted from a text [4]. Each software system fills empty template slots with appropriate values derived from the test documents.

Natural language processing is a complex task and involves many steps. The text is divided into sentences and each sentence is tagged according to its part of speech (verb, adverb, noun). This syntactic structure is matched to pre-existing linguistic patterns and relevant content is determined. Semantic content (or a text's meaning) is determined via syntactic patterns. Then, information is extracted and a summary is produced.


Evaluation metrics for information extraction have been refined with each MUC. Information extraction standards of measurement grew out of IR metrics, but the definitions of these measurements were altered. In brief, recall measures how much information was extracted, while precision measures how much information extracted was correct information and over-generation measures how much superfluous information was extracted.

When evaluating test results for NLP systems, it's instructive to consider human performance as a point of comparison. When humans perform information extraction tasks they tend to fall short of perfect performance for a number of reasons. Boredom, time constraints, and an inadequate subject background can contribute to less than stellar precision and recall. In studies, human workers tested on 120 documents performed at 79% recall and with 82% precision [5]. NLP systems often score in the mid-50s on both precision and recall [6]. While humans can outperform most machine-driven extractions, they cannot beat software in terms of speed and scalability.

Furthermore, it's important not to equate these metrics with standard grading scores where 90-100% is excellent, 80-90% is good, and 70-80% is acceptable [7]. Using this measurement, highly-trained human labor can manage no more than "acceptable" recall and "good" precision when measured against an answer key.


Wrapper induction, the other tradition in information extraction, evolved independently of NLP. A wrapper is a procedure designed to extract content from a particular Web resource using predefined templates [8]. In contrast to NLP, wrapper induction operates independently of specific domain knowledge. Instead of analyzing the meaning of discourse at the sentence level, this software identifies relevant content based on the textual qualities that surround desired data. Wrappers operate on surface features (often document structure) that characterize training examples. A number of vendors, such as Jango (purchased by Excite), Junglee (purchased by Amazon), and Mohomine employ wrapper induction technology.

While the MUCs encouraged the development of IE for the natural language processing community, the explosive growth of the Web is responsible for increasing popularity of wrappers. The need for tools that could extract and integrate data from multiple Web sources led to the development of the wrapper generation field. They are less dependent on full grammatical sentences than NLP techniques and, as noted earlier, this is important when extracting content from resources like emails and press releases. Many information resources on the Web do not exhibit the rich grammatical structure that NLP was designed to exploit. Furthermore, linguistic approaches tend to have a long processing time. Wrappers are fast to create and test.

Furthermore, wrappers demonstrate that extensive linguistic knowledge is not necessary for successful IE. Instead, shallow pattern-matching techniques can be very effective. Information can be extracted from texts based on document formats rather than what the sentences "actually mean." This type of IE analysis is ideally suited to the Web because online information is a combination of text and document structure. Almost all documents located on Web servers offer clues to their meaning in the form of textual formatting. For example, research on a collection of email conference announcements shows that speaker's names are often prefixed by "Who" and many names begin with the title "Dr." One can easily exploit document structure in deciding where relevant content is located. Such announcements often follow a discernable pattern and this means that relevant information, such as location, affiliation, or job title, can be located within a text based on formatting alone [9].


Both schools of IE (NLP and wrapper induction) depend on a set of extraction patterns to retrieve relevant information from each document. These patterns are established through machine learning. A machine learning program is software that improves its performance based on experience. Machine learning involves the use of manually indexed documents to "teach" the software what attributes make up desired content. Machine learning identifies patterns in sample documents and makes predictions about as yet unprocessed texts [10]. Instead of isolating keywords, this technology looks for patterns that exist in documents and uses this information to determine the meaning of a text. In summary, training documents build models that describe relevant information, and texts are run through these models to extract information.

At the most fundamental level, machine learning rests on the assumption that mathematical concepts capture relevant properties of reality, and that you can translate back and forth between reality and mathematics In short, concepts are translated into mathematical patterns.

Machine learning is important because content management on the Web must be automated as much as possible. Automation is key because human labor cannot scale to an ever-increasing number of documents and users. Automation significantly reduces the amount of money and number of hours required to manage content effectively.


Scalability and portability are the challenges that face both NLP and wrapper induction technologies. NLP establishes patterns that are valid for a specific domain and for a particular task only. As soon as the topic changes, entirely new patterns need to be established. For instance, the verb "to place" within the domain of terrorist activities is always linked with bombs [11]. Making this assumption outside this topic would lead to trouble. To place the ball on the ground, the vase on the table, or the person in the job is unrelated to terrorism. No NLP software system can claim to tackle general language in an open-ended task. This is the chief problem of all practical natural language processing systems. NLP is effective only in a narrowly restricted domain. Unrestricted natural language processing is still a long way from being solved, yet IE methods are effective because they rely on topical restrictions.

On the wrapper side of IE, these systems require large amounts of training data, and collecting these examples can be time-consuming. To avoid gathering a lot of training data, the domain in which the wrapper is expected to be effective must be limited. In addition, Web sites are occasionally remodeled, and when the user interface changes, a site's wrapper is broken. Because wrappers rely on low-level formatting details, they are brittle [12].


XML is an important step towards offering efficient resource discovery on the Web, although it does not completely solve the problem. XML is important because it facilitates increased access to and description of the content contained within documents. The technology separates the intellectual content of a text from its surrounding structure, meaning that information can be converted into a uniform structure.

XML makes it easier for developers to take the pieces of a document apart and reassemble them, yet information extraction still needs to be accomplished. XML is an enabling technology–it functions like a building permit at a construction site by permitting information aggregation and synthesis and performing the work itself.


In the current "infoglut" context, content management is a key component of basic literacy. A Fall 2000 study numbered the Web at 2.5 billion documents with 7.3 million unique pages being added everyday [13]. The study's authors concluded, "We are all drowning in a sea of information–the challenge is to learn how to swim" [14]. The ability to access, organize, and think critically about information is becoming both more important and more difficult. Because IE technology integrates content scattered across Web servers, it is an important first step in taming the Internet wilderness.

Information access would improve if information in many different formats could be extracted and integrated into a structured form. Because IE is an attempt to convert information from various text documents into database entries, it plays a key role in improving online knowledge discovery. Information extraction software has the potential to convert the Web into a structured database. This is an exciting vision for reordering how end-users retrieve and organize digital information. Once information is encoded in a database, it could be organized into a taxonomy or searched over by textual attribute or feature. This stands as a vast improvement over the usual search protocol: index content and query full-text documents by keyword.


The "hidden Web" refers to Web pages that are dynamically generated from databases. Web technology is shifting away from putting content into static pages and towards placing information in relational databases. These databases are flexible structures that assemble content "on-the-fly" and deliver it to end-users. Such Web sites are organized as a series of templates. Based on a user query, content is pulled from databases and placed in a template. Conventional search engines cannot index the "hidden Web." When spiders come across such databases, they are locked out. They can record the database's address, but cannot index its documents.

IE can access the useful information hidden away in relational databases. In fact, wrapper induction technology is especially well-suited to this problem because the wrapper only has to learn a Web site's template. Templates are easy for wrappers to train on. The "hidden Web" is another example of how document structure can be exploited by IE technology.


Information extraction technology is part of a promising new trend that breaks up content and reassembles it into smaller chunks [15]. As content "goes to pieces" on the Web, information extraction technology grows in importance. The Web's current navigation model of browsing from site to site does not facilitate retrieving and integrating data from multiple sites. IE tears down the barriers that separate information residing on different Web sites. Because this technology aggregates and synthesizes content from various sources on the Web, it introduces greater efficiency and granularity to the task of finding digital information.

Call it what you will–integral to knowledge management initiatives, an alternative to traditional search functions, the catalyst for knowledge mining, an access point to the hidden Web, the solution to information overload–data extraction technologies foreshadow exciting new developments for information professionals.

Acknowledgements: Many thanks to Chris Harris for his thoughtful comments. I appreciate Neil Senturia's continuing support.


[1] Gaizauskas, Robert and Yorick Wilks. "Information Extraction: Beyond Document Retrieval." Journal of Documentation. 54, no. 1 (January 1998): pp. 70-105.

[2] Ibid.

[3] Cardie, Claire. "Empirical Methods in Information Extraction." AI Magazine. 18, no. 4 (Winter 1997): pp. 68-80.

[4] Lehnert, Wendy and Beth Sundheim. "A Performance Evaluation of Text-Analysis Technologies." AI Magazine (Fall 1991): http://www-nlp.cs.umass.edu/ciir-pubs/ aimag.pdf.

[5] Lehnert, Wendy. "Information Extraction." Etext from the Natural Language Processing Laboratory, University of Massachusetts: http://www-nlp.cs.umass.edu/nlpie.html.

[6] Ibid.

[7] Lehnert, Wendy. "Cognition, Computers and Car Bombs: How Yale Prepared Me for the '90s" in Beliefs, Reasoning, and Decision Making: Psycho-logic in Honor of Bob Abelson (eds: Schank & Langer), Lawrence Erlbaum Associates, Hillsdale, NJ. (1994): pp. 143-173. Shorter version of the essay without graphics: http://ciir.cs.umass.edu/pubfiles/cognition3.pdf.

[8] Eikvil, Line. "Information Extraction from World Wide Web: a Survey." Report No. 495, Norwegian Computing Center (July 1999).

[9] Freitag, Dayne. "Two Approaches to Learning for Information Extraction. " Talk at UC, San Diego (San Diego, CA) on October 16, 2000.

[10] Kushmerick, Nicholas. "Wrapping up the Web." Synergy: Newsletter of the EC Computational Intelligence and Learning Cluster Issue 2 (Spring 2000): http://www.dcs.napier.ac.uk/coil/news/feature46.html.

[11] See note 7.

[12] Kushmerick, Nicholas. "Gleaning the Web." IEEE Intelligent Systems, vol. 14, no. 2 (1999): http://www.cs.ucd.ie/staff/nick/home/research/download/kushmerick-ieeeis99.pdf.

[13] Lyman, Peter & Hal R. Varian. "How Much Information?" http://www.sims.Berkeley.edu/how-much-info.

[14] "Online Content: How Much Information can you Handle?" Internet Content (Oct. 23, 2000): http://www.internetcontent.net.

[15] Luh, James. "Content Goes to Pieces." Internet World (July 1, 2000): http://www.internetworld.com/070100/7.01Cover1.asp.


CORPUS A set of documents. For example, the Monster.com resume database, the works of Shakespeare, and the Web itself.

EXTRACTION PATTERN A pattern that represents a pre-determined entity or event (corporate names, conferences, and workshops, etc.) in a natural language text.

NATURAL LANGUAGE PROCESSING (NLP) Using software to "understand" the meaning contained within texts. Everyday speech is broken down into patterns. Typically, these systems employ syntactic analysis to infer the semantic meaning embedded in documents. NLP identifies patterns in sample texts and makes predictions about unseen texts. Also called computational linguistics.

SEMANTIC The part of language concerned with meaning. For example, the phrases "my mother's brother" and "my uncle" are two ways of saying the same thing and, therefore, have the same semantic value.

SEMI-STRUCTURED TEXTS Most information contained on the Web is embedded in semi-structured texts. This includes email, new stories, resumes, magazine articles, press releases, etc. The information contained within these documents is not as rigidly ordered as database entries, but does contain someone reliable formatting.

STRUCTURED TEXTS The various types of documents available on the Internet are often (erroneously) characterized as structured or unstructured. Structured documents refer to database entries and information in tabular form. For example, MARC records in an OPAC database, search results on Yahoo!, Ebay product postings, etc. See semi-structured texts.

SYNTACTIC The part of language concerned with syntax, sentence structure. For example, the phrases "my mother's brother" and "my brother" express the same relationship, but the way in which the information is expressed differs.

UNSTRUCTURED TEXTS Sometimes called "natural language" texts, these are documents that exists with minimal formatting. The difference between unstructured and structured texts is a matter of document formatting. See semi-structured texts.

Katherine C. Adams (kadams@monohime.com) is an information architect for Mohomine.

Comments? Email letters to the Editor at marydee@infotoday.com.

[infotoday.com] [ONLINE] [Current Issue] [Subscriptions] [Top]

Copyright © 2001, Information Today, Inc. All rights reserved.