Search of ...The Good Search: The Invisible Elephant
by Stephen E. Arnold President, Arnold Information
Search is a problem. Or perhaps we should rephrase the sentiment as, "Search
remains a challenging human-computer issue." Listen to one MBA's comment about
the Louisville Free Public Library's online search functions. "Too many hits." Accurate
Such candor is generally in limited supply at conferences,
in journal articles, and during sales presentations
from vendors of search-and-retrieval systems. We have
an elephant in our midst, and no one wants to ask, "What's
this elephant doing here?" I suggest we make an attempt
to acknowledge the situation and do whatever it takes
to put the fellow back in the zoo.
Many extremely complex and costly search-and-retrieval
systems are in use in many large organizations. A typical
news-focused system supporting about 500 users costs
one U.S. government agency more than $3 million per
year. According to one expert close to the agency, "About
95 percent of the system's functionality is not used.
People type one or two words and take whatever is provided.
The users seem happy with good enough." But there are
different types of search engines and different "ecologies" for
Meanwhile, ever more advanced linguistic-statistical,
knowledge-based, adaptive search systems are showcased
at trade shows, in impassioned sales presentations,
and on often inscrutable Web sites. White papers explain
and re-explain such concepts as "an ontology generation engine" and/or "real-time
linguistic analysis of diverse document types." Data
are displayed in animated hyperbolic maps or relevancy-ranked
lists with the key concepts highlighted for the busy
user. Some of the "advanced" systems create reports
in the form of packets of Adobe Portable Document Format
pages or, at the other extreme, collections of "key
paragraphs" from longer documents. Dot points, extracts,
and flagged items are supposed to make perusing a list
of "hits" a more productive task.
Eyes glaze and potential buyers in marketing departments
looking for information wonder, sometimes out loud, "What's
all this jargon hiding? Will this system search, find
results, work almost all the time, and fit my budget?" Not
surprisingly, these are difficult questions to answer,
and the answers are very, very hard to get.
What's the Jargon?
Ontology generation. This bit of jargon covers activities
ranging from creating a list of subject categories
for a particular content collection to downloading
the Library of Congress subject headings and making
additions and deletions as required. Automatic
ontology generation means no librarians, please.
Real time remains, of course, the term du
jour for updating an index when new content becomes
available. Real time is relative and, as a
bit of testing on a breaking news story like the 9-11
attack proved, essentially untrue except for a handful
of specialized services.
Linguistic analysis is quite a slippery
phrase. When used by a pitchman describing a new search
engine, the listener is supposed to conclude that the
software "understands" words and phrases in a manner
similar to a human. Software remains rule-based, and
the fancier algorithms using "ant technology" or "swarming
techniques" remain locked in research and development
laboratories. Linguistic analysis boils down to knowledge
bases or statistical routines hooked together in a
clever new way.
Hyperbolic maps and other visualization techniques
are increasingly available. The idea is to display
a list of "hits" for a query in a visually instructional
manner. For a look at what programmers can do with
Macromedia's Flash technology, click to Kartoo (see
Figure 1 on page 41). Are these techniques likely to
tackle finding a single electronic mail message with
a PowerPoint attachment? No. And not for quite a while.
Search has become a digital form of roulette. The
customer picks a product, spins the wheel, releases
the ball, and hopes for a winner. search-and-retrieval
software is a similar bet. As in casinos, the customer an
information technology manager in charge of the search
software acquisition usually walks away disappointed.
Anyone who has had to implement a large-scale system
indexing content from 40,000 or more servers and processing
50 million or more documents knows that the search-and-retrieval
software for a system of this size will differ quite
a bit from the free search routine included with Windows
XP or the "Find" option in Outlook Express.
The surprising truth is that there are a very small
number of companies with products that can handle a
50 million document baseline, keep it up and running
99 percent of the time, and update the index in less
than 48 hours.
There are reasons why Google grows its index in chunks,
jumping from 1 million to 3 million documents over
a period of years. Those reasons include planning the
growth, engineering the many subsystems that make up
search, and accruing the money to extend the infrastructure
so that it can add the content while maintaining the
Anyone making it to the third or fourth year of a
computer science program at a major university can
implement the software to find, index, and make searchable
content. What computer science classes and a quick
course in Excel macros do not cover is how the costs
work in a large-scale Internet or commercial intranet
search-and-retrieval product. Costs, not technology,
account for much of the attrition in the search-and-retrieval
The Darwinian nature of the search business allows
boutique search companies to appear and often disappear
as quickly. Among the companies whose investors have
considerable optimism are MyAmigo (Australia), Pertimm
(France), ClearForest (United States), and iPhrase
(United States). Hopefully, most of these companies
will survive and thrive. It is, however, doubtful that,
in the near term, any of the newcomers will challenge
the dominance of a handful of search-and-retrieval
companies. Verity, Inc., PC Docs, Autonomy Ltd, and
a few others dominate the commercial market. Google
and FAST Search & Retrieval are best positioned
for a run at the market leaders. Overture, a company
that has quietly transformed search to a form of advertising,
has revenues and earnings that dwarf virtually all
search-and-retrieval companies. Overture, however,
downplays its search technology and focuses on its
revenues of more than $500 million. Only Google, with
an estimated 2002 revenue of $300 million, seems positioned
to mount a threat. The rest of the thousands of search
vendors are finding themselves new homes nestled inside
such fuzzy product packages as customer relationship
management, knowledge management, and content management
Search and retrieval is at once everywhere in the
form of free Web searching via Google, FAST's "alltheweb.com," and
Yahoo!'s Inktomi service. But it is also nowhere in
the warp and woof of the fabric of Windows XP, e-mail
programs, and the ubiquitous "search" box on intranet
and extranet Web pages. Search is ubiquitous, so most
users do not see it as a separate function, but just
as a handy and necessary tool.
The Market Segments for Search
Search has its distinct niches or market segments.
The diagram entitled "Search and Soft Market Segment
Boundaries" (see page 43) provides a simplified view
of customer and user clusters. First, every computer
or mobile device has some type of search function.
In a mobile phone, search may underlie hitting a number
key and seeing the name and phone number stored at
that location in the phone's memory. Search may mean
using the built-in tools in commercial application
software. Even Excel has a "find" command. Within this
segment, complete micro-ecologies of search software
exist. For chemists, Reed Elsevier and Chemical Abstracts
offer specialized tools that meet users' needs in their
laboratory or on an organization's intranet. An "intranet" is
a network that operates within an entity with access
requiring a user name and a password.
Second, there is the Internet. Internet search and
retrieval has been free, although the financial model
for monetizing search has been cobbled together from
the failure of many early entrants. Google, Alta Vista,
Yahoo!, and newcomers such as Bitpipe offer "free searches." The
thirst of search engines for money is unslakeable.
Search engines generate revenue by selling advertising,
selling a "hit" when a user does a search on a particular
topic, or reselling searches of content to other Web
sites. The variations for monetizing search are proliferating.
The issue with monetizing is, of course, objectivity.
In the Internet "free" search segment, objectivity
is not the common coin of the realm. Paying for clicks
and traffic is more important than relevance. When
Google goes public in the next year or so, the need
for revenue means that objectivity will take a back
seat to monetizing.
The third segment, I call "special domains." These
are the collections of content that defy the mainstream,
text-centric search engine. Music, videos, computer-aided
drafting diagrams with a database of parts and prices,
medical images, and audio content are not searchable
with the software that falls in the purview of librarians
or expert searchers. These special domains account
for as much as 90 percent of the digital content produced
at this time, based on a study we conducted over a
period of 6 months in 2002 for a major technology firm.
Chinese language Web pages, an electronic mail message
with an Excel attachment in a forwarded message, purchase
order information in CICS system files, and streaming
audio from radio stations are just four examples of
content that becomes ever more plentiful, while remaining
difficult, if not impossible, to search.
The critical portion of the diagram is the boundaries
among and between segments. These boundaries are like
a paramecium's. The boundary is semi-rigid, permeable,
and subject to its environment. Search, therefore,
can be explained in an infinite number of ways, making
comparisons difficult. What was true yesterday of Google
may not be true today. Google's catalog service was
essentially unusable. However, the Froogle service
is a useful, high-value service. Consultant analyses
and comparisons of search software are the intellectual
Twinkies of this software sector. One can eat many
Twinkies and go into sugar shock, but the essential
nutrients are simply not there, and the growling stomach
remains unsatisfied. Fuzzy boundaries make comparing
search software in an "apples to apples" way almost
This diagram illustrates in part why the search landscape
and the dominant companies in the search business change
over time. Consider the Canadian search company Fulcrum.
Fulcrum's software is quite good among today's intranet
tools. Several years ago, it was bought by another
Canadian company (Hummingbird). Hummingbird provided
software to permit a PC user to access data on a corporate
mainframe via a screen scraping program. Hummingbird
was, in turn, acquired by another Canadian company
(PC Docs), a document management outfit that wanted
to upgrade its search-and-retrieval functions and leverage
Hummingbird's customer base. Now, Fulcrum search is
a facet of PC Docs product suite. Or look at Open Text.
Originally a Web indexing and SGML database with a
search function, Open Text now consists of pieces of
Tim Bray's search engine and the BASIS database search
tool plus other search functions to handle the collaborative
content in Live Link. Inktomi sold its intranet search
business to Verity and then allowed itself to be purchased
for about $250 million by Yahoo! Other companies have
simply retreated from search, repositioned themselves,
and emerged as taxonomy and ontology companies. Examples
of this include Semio (France and California) and Applied
Linguistics (formerly Oingo, operating in Los Angeles).
There is more horse swapping and cattle rustling in
these three segments than almost any other software
sector. Confusing. Absolutely.
The Institutional Self-Discovery Market
In an interesting twist, the search business has
begun morphing search and retrieval into a system that
discovers what information a company already has. As
silly as this sounds, there are organizations that
simply do not know what information exists on the organization's
own servers. (If this sounds like a commercial for
knowledge management, it is not.) search-and-retrieval
software has been packaged as a way for a security
officer at a large company to know what the Des Moines,
Iowa, office put on the Internet.
Most organizations have allowed their Web server
population to grow like Manchester, England, at the
height of the industrial revolution. In our security-crazed,
post 9-11 and post-Enron world, boards of directors
have to know what information exists, in what form,
and, of course, accessible to whom. As more companies
reinvent themselves as "knowledge organizations" or "information
companies," few if any employees know such basics as:
What information is in digital form
Which information is the most recent
or "correct" version
Where a particular piece of information
If this reminds the reader of "content management," it
is an easy mental leap to the role of search and retrieval
in this market sector. In the pre-digital age, people
could stay late and look through stacks of paper. Today,
not even the most caffeinated Type A can browse hundreds
or thousands of files on different machines in many
different formats. The job is too onerous, too tedious.
With a few deft marketing touches, a search engine
can be paraded as an information discovery engine.
The idea is that the search-and-retrieval system
looks at a company's information objects, figures out
what each object is "about," and then clusters the
objects in a Dewey Decimal type of scheme. There is
a word for this type of work, and that word is indexing.
Indexing professionals, librarians, and content specialists,
e.g., those working for the National Library of Medicine,
used to do this kind of work. Now that such individuals
are deemed non-essential or simply too expensive, software
is supposed to do the job. Not a chance.
Verity, the current industry leader, makes it very
clear that part of the firm's professional service
fees include payment for humans who "tune" and "train" the
Verity system. For those who can't afford Verity, the
transformed search companies or specialist firms can
deliver software that indexes and classifies so someone,
somewhere knows what is on a corporate intranet. ClearForest
is one company leading in this "discovery" niche. For
military intelligence and government security applications,
i2 Ltd. (Cambridge, England) provides a tightly integrated
suite of tools that allows discovery to run as a process
with results depicted with icons, connector lines,
and "hooks" to non-text objects.
The companies that have done the most effective job
of getting their technology embedded in content management,
customer relationship management, and my favorite
meaningless discipline knowledge management,
are Verity (Mountain View, California) and Autonomy
Ltd. (Cambridge, England). These "M" businesses document
management, customer relationship management, knowledge
management, and content management need reliable
The segment leaders, Verity and Autonomy, have about
70 percent of the U.S. and European corporate and government
market. Both companies have products that "work." The
precise meaning of "work" is somewhat difficult to
define, because the lists of each company's customers
have about a one-third overlap. For basic search and
retrieval, these companies are market leaders. Unlike
Overture, Verity and Autonomy follow a business model
of licensing software and then selling support, customization,
and services. Both firms will provide the services
required to satisfy the customer. The price for "search
that works" can reach seven figures.
Verity's and Autonomy's strengths do not lie in the
firms' respective technologies. Verity relies on thesauri
and what might be called traditional indexing by extracting
terms. Newer algorithms have been added, and the company
can process database files in the recently upgraded
K2 engine. Autonomy relies on statistical techniques
originally based on Bayesian statistics. Like Verity,
Autonomy has embraced other approaches and acquired
companies in order to gain customers and technologies
in speech recognition. Both Verity and Autonomy can
support corporate customers. Smaller companies with
lower fees usually find that the juicy accounts go
to Verity or Autonomy because those firms can install,
support, and service enterprise clients. One systems
manager said in a focus group in 2002, "No one gets
fired for licensing Verity or Autonomy."
Most commercial search software with an intranet
version work at what might be called the 70 percent
level. For a query, more than two-thirds of the content
will be available when the query is passed. The results
will be about 70 percent on the topic. The very best
engines push into the 80-percent range. It is very
difficult with today's technology to get consistently
high scores unless you restrict content domain tightly,
freeze updates, and craft correctly formed, fielded
queries. The reader familiar with SQL queries or Dialog
Boolean queries will immediately see why typing one
or two terms, hitting the enter key, and looking at
a list of hundreds of results requires considerable
Do commercial search engines work? Yes. Effective
search-and-retrieval software gets about 80 percent
of the relevant material, as shown by the results of
TREC competitions. Stated another way, the most effective
searches usually miss at least 20 percent of the content
that could be highly pertinent to the user's query.
Verity delivers this type of search effectiveness when
the company's software is properly set up. But, as
many Verity customers have discovered, this means employing
considerable human, manual effort. Searches in limited
domains with tightly controlled word lists are more
satisfying than searches run across heterogeneous domains
of content. For most users, precision and recall at
or near the B minus or C plus level is "good enough."
"Good enough," in fact, describes how most search-and-retrieval
engines work. Google is "good enough" because the results
are ranked by a voting algorithm that weights pages
with many links over pages with few links. What if
a page does not have links, but does have outstanding
content? Google may index the page, but unless the
query for that page is well formed, the page without
links may end up buried deep in the list of results
or not displayed at all. Most searches follow the Alexis
de Toqueville rule that when the majority votes, the
result is mediocrity. Excuse the heresy as Google's
popularity continues to grow, but potentially useful
pages may disappear beneath more popular pages.
Back to Basics
In the early days of search and retrieval, there
was only one way to find information. A proprietary
system offered a command line interface. To find a
document, the searcher (certainly not a pejorative
term in most Fortune 500 companies or at NASA, where
search began in the late 1960s) crafted a query.
The query required a reference interview with the
person wanting the information or a conversation with
a colleague who understood a particular domain's jargon
and the context of the query for a particular client.
The search query was assembled using appropriate terms,
usually selected from a printed controlled vocabulary.
(In the early days of search, it was considered a point
of professional pride to have professional indexers
assign keywords using a thesaurus to the documents
or entries in a database. The word list called
a controlled vocabulary served as a road map
to information in the database.)
If synonyms were common, as they were in medical,
technical, general business, or news databases, "use
for" and "see also" references were inserted into the
thesaurus. The searcher then crafted a well-formed
Boolean query using the syntax of the online system.
Well formed means that the logic of the query would
return a narrow set of results or hits and the provision
of such precise, on-point result sets proved that expert
searchers were at the controls.
The expert researcher then selected specific databases
(now called a corpus in today's search jargon)2 and
ran the well-formed query against the appropriate files.
In 1970, only a handful of these online databases existed a
number which grew to about 2,000 by 1985. (Today, the
Web has given rise to database proliferation, where
a single Web page counts as a database, and there are
more than 3 billion Web pages indexed by Google alone.)
The searcher reviewed the results and selected the
most relevant for their clients. Additional queries
were run by constructing search syntax that essentially
told the online system "give me more like this." How
far have we come since 1970?
Not far. Today's software is supposed to look at
the user's query, automatically expand the terms, run
the search, bring back relevant documents, rank them,
highlight the most important sections, and display
them. The blunt truth is that, for most online users
today, looking for information boils down to a pretty
straightforward set of actions. Hold on to your hat
and fasten your seat belt; users today do one or more
of three things:
Type one or two terms into a Google-style
search box and pick a likely "hit" from the first
page of results. (The most searched term on Google,
I have been told, is Yahoo! see item 2 in
Go to a site with a Yahoo!-style taxonomy
or ontology and click on a likely heading and keep
clicking until a "hit" looks promising. (This is
the point-and-click approach much beloved by millions
and millions of Web users each day. It meets the "I'll
know it when I see it" criteria so important in research.)
Look at a pre-formed page and use what's
there. This is the digital equivalent of grabbing
a trade magazines from one's inbox and flipping pages
until a fact or table that answers a question catches
one's eye. Believe it or not, a 1981 Booz, Allen & Hamilton
study found that this was the second most popular
way to answer a question among polled executives.
The most popular way was to ask a colleague. An updated
study showed that today asking is still first, but
looking at a Web page has become the second most
popular way to get information.
Why Search and Retrieval Is Difficult
Search and retrieval is a much more complex problem
than most information professionals, systems engineers,
and even MBA-equipped presidents of search engine companies
grasp. Search brushes up against some problems computer
scientists may call intractable. An intractable problem
is one that cannot be solved given the present state
of computing resources available to solve the problem.
Let me highlight a few examples to put the search challenge
First, language or languages. The "answer" may not
be stated explicitly. Years ago, Autonomy Ltd. pitched
its search engine by saying that its Bayesian approach
would find information about penguins if the user entered, "Black
and white birds that cannot fly." I think the demonstration
worked, but in the real world, the Autonomy system
performs best on closed content domains of homogeneous
information. Language is a problem because of metaphor,
structure, and neologisms, and it becomes intractable
when one tries to support, say, a French query delivered
against content consisting of Arabic, Chinese, and
Korean material. To be fair, most people looking for
a pizza in Cleveland, Ohio, want to use English and
get the information with a single click. Interface
and presentation of search must balance power and ease
Second, most people doing searches don't know what
the answer is. The human mind can synthesize and recognize,
but is less adept at foretelling the future. So searching
requires looking at information and exploring. Much
of the Web's popularity stems from its support of browsing,
exploring, and wandering in interesting content spaces.
Clicking on a list of "hits" that numbers 100,000 or
more is mind-numbing. User behavior is predictable.
They want something useful so they can get on with
their lives. search-and-retrieval systems must permit
a chance encounter with information to illuminate a
problem. Showing 10 hits may be inappropriate in some
cases and just right in others.
Third, as demographics change and thumb-typing young
people join the workforce, we need new types of search
systems. It is difficult to tell what the long-term
impact of Napster's peer-to-peer model will have on
information retrieval. One pundit (Howard Rheingold)
opines that swarm behavior will become the norm, not
solitary search and retrieval. I think of this approach
to answering questions as Google's popularity algorithm
on steroids. The answer is what people believe the
answer to be. One part of my mind wants to stop this
type of information retrieval in its tracks. The other
part says, "Maybe swarm searching is good enough 'let
many flowers bloom,' as a famous resident of China
once said." The idea is that one asks a question and
passes it among many system users. The answers that
come back reflect a swarming process. Swarm technology
has been replicated to a degree in the search-and-retrieval
technology developed by NuTech Solutions in Charlotte,
North Carolina3. NuTech
uses the term "mereology" to describes its approach.
Fourth, the emergence of an ambient computing environment
supports the pushing of information to individuals.
With IPv6, every digital gizmo can have a Web address.
Personalization technology is becoming sufficiently
robust to deliver on-point information to an individual's
mobile phone without the "user" having to trigger any
query. In Scott McNealy's vision, an automobile that
needs fuel will query a database for the nearest gasoline
station. When a station comes in range, a map of where
to turn to get gas will appear on the automobile's
digital display. Search in this model reverts to what
used to be called Selective Dissemination of Information.
Today, the words used to describe the SDI approach
range from text mining to agent-based search and even
more esoteric terminology. Search will mesh with decision
support or Amazon-like recommendation systems. Most
people looking for information today seem to open their
arms to environments that wake up when the "searcher" turns
on a wireless device. The screen says, in effect, "Hello,
here's what you need to know right now." The Research
in Motion Blackberry device showed that pushing e-mail
and stock quotes was a potent online combination for
go-go executives in financial services and management
Fifth, in some search-and-retrieval situations, source
identification and verification or what art
dealers call provenance is difficult. Few point-and-click
Google searchers or employees browsing filtered news
on a personalized portal page know or care about a
commercial database's editorial guidelines. If a consulting
firm's table of statistics appears on a Web site, it "must
be" accurate. Some pundits have winced when thinking
about Enron executives making decisions based on a
casual Web search or television talk show. Bad information
and loose ethical boundaries are combustible4.
Most search engines for intranets drag in whatever
they find. My dog often brings me dead groundhogs.
Thoughtful of the dog, but not germane to my needs.
I am not sure software alone can address this challenge,
but it warrants thought.
This list of challenges can be extended almost indefinitely.
And we haven't even touched the cost of bandwidth to
index large content domains, the size and computational
capability of the indexing environment that must process
data, make judgments, and deliver results often to
thousands of users hitting a system simultaneously,
or the performance issues associated with making updates
and results display before the user clicks away in
frustration over slow response. The costs associated
with search are often difficult to control and, when
search firms run out of money, they close.
Approach Search's Weaknesses Objectively
Search vendors are scrupulously polite about their
competitors' technologies. That politesse stems from
the results of large-scale tests of search engines.
Look at 3 or 4 years of TREC results. Most of the technologies
perform in a clump. Precision and recall scores remain
essentially the same. What's more interesting is that
the scores top out in the 80 percent range for precision
and recall and have done so for several years.
Two observations are warranted by the TREC data and
actual experience with brand-name search "systems." First,
search has run into a wall when it comes to finding
relevant documents and extracting all the potentially
relevant documents from a corpus. Despite the best
efforts of statisticians, linguists, and computer scientists
of every stripe, improving the TREC score or satisfying
the busy professional looking for a Word document works
less than 100 percent of the time. As noted, the use
of voting algorithms has created a self-fulfilling
prophecy whereby users are thrilled with a C or B-minus
performance. The more people who find this level of
performance satisfactory, the more their feedback guarantees
mediocrity. Second, the compound noun neologisms of
marketers cannot change the fact that commercial search
systems work on text. Most commercial and university
think tank software of search engines including
the ones wrapped in secrecy at the Defense Advanced
Research Projects Agency (DARPA) cannot handle
video, audio, still images, and compound files (a Word
document that includes an OLE object like an Excel
spreadsheet or a video clip). There are search engines
for these files. Just ask any 13-year-old with an MP3
player or your college student living in a dormitory
with a DVD recorder, ripping software, and a broadband
Multilingual material complicates text search in
certain circumstances. Accessing information in other
languages is gaining importance in some organizations.
When carefully set up, search engines can handle the
major languages, but running search queries across
the multilingual search engines performs less well
than search engines running on a corpus composed of
text files in a single language. This means that finding
associated or relevant documents across corpuses, each
in a different language, is essentially a job for human
analysts. Said another way, search produces manual
In the post 9-11 world, the inability to address
Arabic, Farsi, Chinese, and other "difficult" languages
from a single interface is a problem for intelligence
analysts in many countries. Toss in digital content
with corpuses composed of audio clips, digitized video
of newscasts, electronic mail, and legacy system file
types, and we have a significant opportunity for search
innovation. From the point of view of a small company,
solving the problem of searching electronic mail might
be enough to make 2003 a better year.
Search is serious. It is a baseline function that
must become better quickly. Search will not improve
as long as buyers and users are happy with "good enough." A
handful of information professionals understand the
problem. In the rush of everyday business, voices are
not heard when questions arise about purchased relevance
versus content relevance, bait-and-switch tactics,
and the thick "cloud of unknowing" that swirls around
data provenance, accuracy, completeness, and freshness.
New technology acts like a magnet. Novelty revives
the hope that the newest search technology will have
the silver bullet for search problems. Pursuit of the
novel amid the word-spinning done by the purveyors
of new search technology may attract users, but few
step back and ask hard questions about search free,
intranet, Internet, peer-to-peer, or wireless.
To see how the snappy can befuddle understanding
of the limitations in present search-and-retrieval
technology, look at the positive reception given Kartoo
(Paris, France) and Groxim (Sausalito, California).
Strictly speaking, these two companies have interesting
and closely related technology. A query is launched
and "hits" grouped into colorful balls. Each ball represents
hits related in some way to a particular concept. The
links among the balls show relationships among the
concepts. Sounds useful, and the technology is appropriate
for certain types of content and users. Visualization of
results in clusters, of course, relies on underlying
clustering technology, which must be sufficiently acute
to "understand" extreme nuance. To get a sense of how
well that technology works, run a query on Kartoo in
an area where you think you know the subject matter
well. Now explore the balls. Are the "hits" clustered
correctly? In our tests of Kartoo, we found that more
than half the balls contained some useful information.
But Kartoo and Groxim still return results that are
too coarse to serve an expert in a particular domain5.
Results: Biased and Sometimes Useless
Search and retrieval is believed to be unbiased.
It is not. Virtually all search systems come with knobs
and dials that can be adjusted to improve precision
or adjust recall in a commercially successful search
engine such as FAST Search & Retrieval (Wellesley,
Massachusetts and Oslo and Trondheim, Norway). The
company can make adjustments to the many algorithms
that dictate how much or how little a particular algorithm
can affect search results. Yahoo!-Inktomi's, Open Text's,
and AltaVista's search engines have similar knobs and
dials. Getting the settings "just right" is a major
part of a software deployment. For intranet search,
Verity is the equivalent of the control room of a nuclear
Once one knows about the knobs and dials, one starts
asking questions. For example, "Is it possible to set
the knobs and dials to bias or weight the precision
and recall a certain way so when I type "airline," I
display the link of the company paying the most money
for the word airline?" The answer is, "Absolutely.
Would you like to buy hotel, travel, trip, vacation,
rental car?" One can see this type of adjustment operating
in Google when the little blue boxes with the green
relevancy lines appear on a page of hits. Indeed, the
very heart of Google is to use weighting that emphasizes
popular sites. "Popularity" is defined by an algorithm
that considers the number of links pointing to a site.
For a different view of Google's controls, go to the
BBC's Web site [http://www.bbc.co.uk].
Enter the word "travel" in the search box. The hits
for both the BBC Web site and the "entire Web" are
BBC affiliate sites. Coincidence? No search bias?
The clever reader will ask, "What about sites that
have great content, no links pointing in or out, and
relatively modest traffic? These sites are handled
in an objective manner, aren't they?" Go to Ixquick,
a metasearch site with a combination of links and traffic
popularity algorithms. Enter the term "mereology." No
hits on http://www.mereology.org.
No hits for the NuTech Solutions Web site whose founder
brought "mereology" from obscurity to the front lines
of advanced numerical analysis. Serious omissions?
Absolutely. Such problems are typical among specialist
resources for very advanced fields in physics, mathematics,
and other disciplines. However, similar problems surface
with search tools used on intranet content. The research
and development content as well as most of the data
residing in accounting remain black holes in many organizations.
For expert searchers, locating the right information
pivots on Boolean queries and highly precise terms.
This assumes, of course, that the desired content resides
in the index at all. Verity's PDF search engine stumbles
over PDF files for one good reason6.
The content of PDF files is not designed for search
and retrieval. PDF files are designed for page rendering.
Textual information runs across columns, not up and
down columns. PDF search and retrieval requires deft
programming and canny searchers. For intranets, indexing
corporate content is somewhat less troublesome than
indexing the pages on a public Web server or the billions
of pages on the hundreds of thousands of public Web
servers, but comprehensive and accurate indexing of
even small bodies of content should not be taken for
For a common example of deliberately biased search
results, look at the display of for-fee "hits." Companies
selling traffic allow a buyer essentially, an
advertiser to "buy" a word or phrase. When a
search involves that word or phrase, the hits feature
the Web site of the buyer. Such featured results are
usually segregated from the "other results" but searchers
may not notice the distinctions. Google and Overture
are locked in a fierce battle for the pay-for-click
markets. FindWhat.com is a lesser player. In the U.K.,
eSpotting.com is a strong contender in the "we will
bring you interested clients" arena.
What about sites that offer to priority index a Web
page if the Webmaster pays a submission fee? AltaVista
offers a pay-for-favorable-indexing option. Yahoo!
offers a similar service as well, even while the company
shifts "search" from its directory listings to spidered
search results. In fact, most search engines have spawned
an ecology of services that provides tricks and tactics
for Webmasters who want to get their pages prominently
indexed in a public search engine. Not surprisingly,
discussions abound on the use of weighting algorithms
in public sector search services as well. For example,
an agency might use those "knobs and dials" to ensure
that income tax information was pushed to the top of
results lists in the month before taxes are due. (I
hasten to stress that this is a hypothetical example
Innovation Checklist: The Ideal Search Engine's
Over the last 2 years, my colleagues and I have compiled
a list of what I call yet-to-be-mined gold ore in search.
The list includes functions not currently available
in search software on the market today. Some might
view this as a checklist for innovators. I view it
as a reminder of how much work remains to be done in
search and retrieval.
Table 3 on page 50
provides a summary of what search-and-retrieval systems
cannot do at this time.
Professionals Must Take
What should trained information professionals,
expert searchers, and the search engine providers
themselves do? This is the equivalent of practicing
good hygiene. It may be tilting at windmills
but here are the action items I have identified:
1. Explain, demonstrate, teach by example
the basic principles in thinking critically
2. Emphasize that the source of the information its
provenance is more important than the
convenience of the fact the source provides
3. Be realistic about what can be indexed
within a given budget. Articulate the strengths
and weaknesses of a particular approach to
search and retrieval. (If you don't know what
these strengths and weaknesses are, digging
into the underpinnings of the search engine
software's technology is the only way to bridge
this knowledge gap.)
4. Do not believe that the newest search
technology will solve the difficult problems.
The performance of the leading search engines
is roughly the same. Unproven engines first
have to prove that they can do better than
the engines now in place. This means pilot
projects, testing, and analysis. Signing a
purchase order for the latest product is expedient,
but it usually does not address the underlying
problems of search.
5. Debunk search solutions embedded in larger
software solutions. Every content management
system, every knowledge management system,
every XP installation, and every customer relationship
management system has a search function. So
what? Too often these "baby search systems" are
assumed to be mature. They aren't; they won't
be; and more importantly, they can't be.
6. Professional information associations
must become proactive with regard to content
standards for online resources. Within one's
own organization, ask questions and speak out
for trials and pilot projects.
Search has been a problem and will remain
a problem. Professionals must locate information
in order to learn and grow. The learning curve
is sufficiently steep that neither a few Web
Search University sessions nor scanning the
most recent issue of SearchEngineWatch will
suffice. We have to turn our attention to the
instruction in library schools, computer science
programs, and information systems courses.
Progress comes with putting hard data in front
of those interested in the field. Loading a
copy of Groxim's software won't do much more
than turn most off-point "hits" into an interesting
picture. Intellectual integrity deserves more.
Let's deliver it.
1 Readers interested
in the performance of various search engines should
review the results of the Text Retrieval Conference
(TREC), co-sponsored by the National Institute of
Science and Technology (NIST), Information Technology
2 The Citeseer Web
site provides useful information about corpus. A shorthand
definition is "the body of material being examined." See http://citeseer.nj.nec.com/hawking99results.html.
Links were updated in 2002.
3 See NuTech Solutions'
description of its technology at http://www.nutechsolutions.com.
The search product is marketed as Excavio. For a demonstration,
go to http://www.excavio.com.
4 I received a round
of applause at the Library of Congress during my talk
on wireless computing when I said, "Where should
information quality decisions be made? In the boardroom
like Arthur Andersen and Enron or in management meetings
where trained information professionals vet data?"
5 Kartoo's engine
is located at http://www.kartoo.com.
Groxim requires that the user download a software module
and run the program on the user's machine. Groxim's
software is located at http://www.groxim.com.
6 PDF is the acronym
for Adobe's Portable Document Format. Adobe has
placed the PDF specification in the public domain
wide adoption. Like PostScript, the PDF file focuses
on rendering a page for rasterization, not search