The Eighth Search Engine Meeting
By Donald T. Hawkins
After last year's detour to San Francisco, the Search Engine Meeting returned
to Boston April 78 for its eighth annual gathering. Despite these shaky
economic times, organizer Harry Collier of Infonortics, Ltd. was pleased with
the turnout of 135 attendees. These participants were primarily technologists
and researchers working at the leading edge of information searching.
Delegates were treated to a feast of 20 presentations and a panel discussion.Adobe
PDF files of the presentations are available at the Infonortics Web site (http://www.infonortics.com).
One of the major themes from last yearthe differences between general
Web search engines and those on intranetswas continued at this year's
meeting. That idea seems to have broadened somewhat to focus on search engines
as components of corporate intranets'information portals. A new topic unified
systems for searching both structured and unstructured informationhas
emerged as the theme of much current research.
David Evans, CEO of Clairvoyance Corp., opened the meeting with a keynote
address that reviewed search engine history and incorporated some personal
reminiscences from 1994 to the present. During this session, he wondered which
search engine was the first, but his search for the answer was not entirely
successful. He found a number of conflicting opinions and had difficulty defining
a search engine in the context of the Web's early days. Some go back to the
pre-Web period and suggest that Archie, Veronica, and Gopher qualified as search
engines, while others say that AltaVista and Excite were the first. Still others
date searching to the early days of UNIX and its "grep" command. (Trivia question:
What does grep stand for? The answer is at the end of this article.)
By searching Google's newsgroups, Evans found a message from Aug. 14, 1994,
announcing Lycos, a new search service that offered "probabilistic retrievalof
over 390,000 WWW documents." (Compare that with today's marketing claims by
search engine companies, which tout the billions of documents available
through their services!) Searching was not nearly as widespread then as it
is now. Probably fewer than 100,000 people conducted searches regularly, as
opposed to millions today.Americans are now said to spend more than 1 1/2 hours
a week searching for information on the Web. All this searching has led to
the emergence of the term "search rage," which describes the feelings experienced
when searchers don't find what they're looking for within 12 minutes.
The years 19901994 saw the rise of human-produced subject directories.
This led to the introduction ofYahoo!, which achieved its first million-hit
day in 1994. By 1996, new search engines with additional features and increased
functionality had appeared on the market, and the business was in full swing.
The concept of paying for clicks or a search term began shortly thereafter
and is a major industry force today.
We now have improved technologies that lead to higher relevancy of search
results, the spread of search engines into intranets, new avenues for obtaining
revenue from searching, and interestingly, a return to ontologies and taxonomies
as ways to organize information.
Evans' keynote was followed by four presentations that covered search process
problems. Elizabeth Liddy of Syracuse University, a pioneer in search engine
research, described three projects that address the automation of metadata
generation. Metadata is now generated manually, often by professional indexersa
costly and labor-intensive process.
Research at the University of Washington, Cornell University, and Syracuse
is attempting to develop metadata-generation algorithms. Liddy's experiments
compare search results using algorithmically generated metatags with those
obtained using human-assigned tags. Further work is needed before such systems
can be put into general use, but the research shows promise.
Claude Vogel, chief scientist at Convera, described a different approach
to document classification. Documents are often indexed using a thesaurus (i.e.,
pre-coordinated indexing), which may not give the best retrieval results. Better
results are obtained by combining pre-coordinated indexing with terms that
have been dynamically generated from a full-text search using rule-based techniques
(post-coordinated indexing). The problem with this approach is its complexity.
Vogel suggested that a search engine could assign a "semantic signature" to
documents and then use it to organize them, thus improving results. Prototype
systems built by Convera and Endeca take slightly different approaches to this
technology. Endeca reorganizes search results using indexing built from the
retrieval set, while Convera utilizes an ontology to organize the results.
Information overload is a common problem in Web searches. Raul Valdes-Perez,
president of Vivísimo, suggested that many people solve information
overload by "information overlook"simply ignoring much of the data they
retrieve. Using a statistic from Evans' keynote address, Valdes-Perez asked, "How
many documents can you open in the 12 minutes before search rage occurs?"
Information overlook has the following significant business costs:
Employees don't get the information they need to do their jobs.
Customers can't solve their problems.
Publishers may lose readership.
Web advertisers may lose revenue through click-throughs.
Users may miss discoveries and opportunities.
We can stop overload by eliminating useless and irrelevant information or
by helping people become more efficient. Manual tagging is labor-intensive
andexpensive. A Forrester Research report estimates that it costs up to $50
to tag a large document. Companies that have employed automatic tagging include
Northern Light, whose search engine (which is no longer publicly available)
placed search results in "folders," and Vivísimo, which uses document
clustering that lets searchers organize information dynamically without the
need to construct and maintain taxonomies.
New Searching Tools
Several presentations described some practical applications of new searching
technology. Frank Smadja, chief technology officer of Elron Software, noted
that with the recent rapid growth of e-mail spam, a huge opportunity exists
for text-categorization and filtering tools.
Chahab Nastar, president of LTU Technologies, updated his presentation from
last year's meeting that described his work retrieving images from the Corbis
collection. Many "image retrieval" systems are simply doing text retrieval
by searching the text of a caption or a description of the image. LTU's system
looks at the actual pixels of the image and creates a "DNAsignature" for them,
which is then searched. (A demo database of 70,000 images is available at http://corbis.ltutech.com.)
Alan Smeaton described Dublin City University's research on the more difficult
problem ofsearching video archives.The school's system, Físchlár,
uses various characteristics of video encoding to let its users search a library
of TV programs. More than 2,000 people on campus use Físchlár
for research, teaching, and entertainment.
Scientific documents are difficult to index because they have multi-word
concepts and multilevel hierarchies. Written at an expert level, they are generally
longer than other documents and many of their concepts are not stated explicitly.
Because of this complexity, automated indexing may be impossible. In its merger
with Union Carbide Corp. 2 years ago, Dow Chemical Co. faced those challenges.
Union Carbide's information was largely in print form and not well-indexed.
A team whose members had a wide variety of skills integrated Dow's and Union
Carbide's documents and created a globally accessible electronic repository.
The entire collection was re-indexed using automated and human-assisted techniques.
Dealing with the federal government is far different than dealing with academic
or corporate institutions. Steve Arnold, a well-known industry observer, discussed
several of the pitfalls. He also listed some Web sites where searchers can
find information on government procurement. Focusing on searching,Arnold said
that there are three broad areas in which the government is interested: GSA
schedules, records from a single agency, and classified material.
Arnold also noted that we must be aware of four major benchmarks that the
government applies to search software proposals: relevance, database content,
integration, and the interface. In his opinion, many of today's database search
engines do not have the functionality that government agencies demand, primarily
because much of the government's computing platform is based on UNIX, not Windows.
On the second day of the meeting, Arnold led a panel that examined pay-per-click
advertising as a major new trend and identified four important issues: crawling
technology, index freshness, fraud prevention, and analysis of click data.
The second most heavily used Web functionality (after e-mail) is search, and
companies using pay/click technology have figured out how to monetize it. This
model may well drive the future of search engines.
A group of presentations focused on search applications for enterprises.
Martin White gave an excellent list of criteria for selecting an intranet search
engine. He said that because there's often little to link to on an intranet,
searchnot Web surfingis the most important function. White has
found that many CIOsdo not understand searching, the conceptsof precision and
recall, or taxonomy requirements. He suggested that the evaluation of a search
engine and a taxonomy development system should be done separately, not as
an integrated package from a single vendor.
White's session was followed by descriptions of case studies from AT&T
and Deutsche Telekom. At AT&T, six teams were brought together to evaluate
search, and the participants exploited synergies to develop a common set of
requirements. Deutsche Telekom developed an intranet search engine that incorporates
semantic features to classify e-mail and other information. The search engine
was successfully integrated into the intranet and handles approximately 80,000
Matthew Koll, a pioneer in search engine development, now leads start-up
company Wondir. He has observed that much information is not being found because
it's in the invisible Web or in people's heads. Many ask-an-expert sites are
available, but they're rarely used because searchers must know about them in
advance. The growth of instant messaging services shows that you can find information
by asking others. Wondir meets this need by providing an electronic meeting
place where people can ask questions and receive answers from experts.
Koll thinks question-answering will one day be as easy as searching, with
the Wondir system integrated into existing search engines as a value-added
service. He said that Wondir could become "the last information service whose
results remain totally free of commercial influence."
Collections of information with diverse types of data are common today, but
such archives present significant challenges for search engines. Much of today's
data isn't well-structured and doesn't lend itself to storage in relational
databases. Paul Odom, president of Pliant Technologies, echoed many of the
other speakers in describing the problems involved. Frequently, one must determine
the intent of the user's query and deal with different word meanings, variant
endings, concepts, and even contexts. Pliant's retrieval technology uses semantics
and knowledge-based navigation combined with taxonomies. Generally, high-relevance
hits can be retrieved with fewer than five mouse clicks.
Sue Feldman, vice president of content technologies research at IDC, categorizedthe
tasks commonly done by knowledge workers as they explore, retrieve, analyze,
and distribute information. Noting that we're drowning in a sea of content,
she distinguished between content, data, and related technologies. Data- and
content-centric applications have different technology requirements, but there's
a need for integrated systems that can handle both. An integrated system could
access various types of information with a single query and use standard tools
to manipulate the results. Thus, the strengths of both data and content applications
can be exploited. Feldman presented data showing a significant increase in
retrieval using a combined system and discussed strategies for combining the
The final group of presentations dealt with search models and applications.
Peter Bell, co-founder of Endeca Technologies, talked about the integration
of searching and navigation. Navigation helps users find information, but it's
difficult to do well because there are often only one or two paths to each
record. Using aids that combine full-text retrieval with information facets
(broad subject categories), Endeca has developed a navigation system to guide
searchers through information. This approach allows searching of both structured
and unstructured data, which are often in different databases.
Raymond Lau, chief technology officer of iPhrase Technologies, addressed
the problem of self-service information retrieval. He noted that the current
model of searching is inefficient and places the burden of success on the user.
Much content is buried deep in Web sites (more than three clicks from the home
page), and few users click down to find it. The search engine often presents
its results in a list that extends for several pages. Studies have shown that
85 percent of users abandon a search before looking at all the hits. Lau offered
some technological solutions to these challenges, including natural language
processing, dynamically designed presentation formats and user guidance, and
single access to all information sources.
Prabhakar Raghavan, chief technology officer of Verity, Inc., continued the
discussion of access to structured and unstructured information. He suggested
that exploiting the structure (classification, tags, taxonomy, etc.)
of information and tracking usage is the key to effective retrieval. Because
XML is becoming a standard for tagging data, it provides a means of creating
an information structure. However, most of today's search engines do not search
XML data directly.
Michael Wollowski and Robert Signorelli of the Rose-Hulman Institute of Technology
described their efforts to develop an XML search engine. Their prototype uses
the structure inherent in XML documents, and it can display search results
as plain text. The developers conducted a test with two groups of students,
one using Google and the other using the XML engine. They found that the latter
worked well on certain types of documents but failed as a general search engine.
Searchers liked the interface once they learned how to use it.
In the final presentation, Jean Poncet of Pertimm suggested that in order
to search effectively, we must consider information's context as well as its
concepts. We communicate with complex mixtures of words and phrases, not Boolean
logic. Pertimm uses linguistic data from seven languages in its retrieval system.
By utilizing a document's text and structure, a "semantic glimpse" can be created
without opening the file. The glimpse is then used to develop a search query.
Excellent search results can be obtained with this approach.
The Search Engine meetings annually show that searching is by no means a
fully developed technology. It's an exciting field that continues to progress.
The Ninth Search Engine Meeting will be held April 1920, 2004, in The
(Trivia question answer: grep stands for "general regular expression processor." It's
a UNIX system command that's used to search for character strings in text files.)
Donald T. Hawkins is director of intranet content for Information
Today, Inc. and editor in chief of Information Science & Technology Abstracts.
His e-mail address is firstname.lastname@example.org.