In Search of ...The Good Search: The Invisible Elephant

Online

KMWorld

CRM Media, LLC

Streaming Media Inc

Faulkner

Speech Technology

Other ITI Websites

American Library Directory Boardwalk Empire Database Trends and Applications DestinationCRM Enterprise AI World Faulkner Information Services Fulltext Sources Online InfoToday Europe KMWorld Literary Market Place Plexus Publishing Smart Customer Service Speech Technology Streaming Media Streaming Media Europe Unisphere Research

Magazines > Searcher > March 2003
Back Index Forward

SUBSCRIBE NOW!

Vol. 11 No. 3 — March 2003

Feature
In Search of ...The Good Search: The Invisible Elephant
by Stephen E. Arnold • President, Arnold Information Technology

Search is a problem. Or perhaps we should rephrase the sentiment as, "Search remains a challenging human-computer issue." Listen to one MBA's comment about the Louisville Free Public Library's online search functions. "Too many hits." Accurate observation.

Such candor is generally in limited supply at conferences, in journal articles, and during sales presentations from vendors of search-and-retrieval systems. We have an elephant in our midst, and no one wants to ask, "What's this elephant doing here?" I suggest we make an attempt to acknowledge the situation and do whatever it takes to put the fellow back in the zoo.

Search Realities

Many extremely complex and costly search-and-retrieval systems are in use in many large organizations. A typical news-focused system supporting about 500 users costs one U.S. government agency more than $3 million per year. According to one expert close to the agency, "About 95 percent of the system's functionality is not used. People type one or two words and take whatever is provided. The users seem happy with good enough." But there are different types of search engines and different "ecologies" for each.

Meanwhile, ever more advanced linguistic-statistical, knowledge-based, adaptive search systems are showcased at trade shows, in impassioned sales presentations, and on often inscrutable Web sites. White papers explain and re-explain such concepts as "an ontology generation engine" and/or "real-time linguistic analysis of diverse document types." Data are displayed in animated hyperbolic maps or relevancy-ranked lists with the key concepts highlighted for the busy user. Some of the "advanced" systems create reports in the form of packets of Adobe Portable Document Format pages or, at the other extreme, collections of "key paragraphs" from longer documents. Dot points, extracts, and flagged items are supposed to make perusing a list of "hits" a more productive task.

Eyes glaze and potential buyers in marketing departments looking for information wonder, sometimes out loud, "What's all this jargon hiding? Will this system search, find results, work almost all the time, and fit my budget?" Not surprisingly, these are difficult questions to answer, and the answers are very, very hard to get.

What's the Jargon?

Ontology generation. This bit of jargon covers activities ranging from creating a list of subject categories for a particular content collection to downloading the Library of Congress subject headings and making additions and deletions as required. Automatic ontology generation means no librarians, please.

Real time remains, of course, the term du jour for updating an index when new content becomes available. Real time is relative and, as a bit of testing on a breaking news story like the 9-11 attack proved, essentially untrue except for a handful of specialized services.

Linguistic analysis is quite a slippery phrase. When used by a pitchman describing a new search engine, the listener is supposed to conclude that the software "understands" words and phrases in a manner similar to a human. Software remains rule-based, and the fancier algorithms using "ant technology" or "swarming techniques" remain locked in research and development laboratories. Linguistic analysis boils down to knowledge bases or statistical routines hooked together in a clever new way.

Hyperbolic maps and other visualization techniques are increasingly available. The idea is to display a list of "hits" for a query in a visually instructional manner. For a look at what programmers can do with Macromedia's Flash technology, click to Kartoo (see Figure 1 on page 41). Are these techniques likely to tackle finding a single electronic mail message with a PowerPoint attachment? No. And not for quite a while.

Search has become a digital form of roulette. The customer picks a product, spins the wheel, releases the ball, and hopes for a winner. search-and-retrieval software is a similar bet. As in casinos, the customer — an information technology manager in charge of the search software acquisition — usually walks away disappointed.

Anyone who has had to implement a large-scale system indexing content from 40,000 or more servers and processing 50 million or more documents knows that the search-and-retrieval software for a system of this size will differ quite a bit from the free search routine included with Windows XP or the "Find" option in Outlook Express.

The surprising truth is that there are a very small number of companies with products that can handle a 50 million document baseline, keep it up and running 99 percent of the time, and update the index in less than 48 hours.

There are reasons why Google grows its index in chunks, jumping from 1 million to 3 million documents over a period of years. Those reasons include planning the growth, engineering the many subsystems that make up search, and accruing the money to extend the infrastructure so that it can add the content while maintaining the response time.

Anyone making it to the third or fourth year of a computer science program at a major university can implement the software to find, index, and make searchable content. What computer science classes and a quick course in Excel macros do not cover is how the costs work in a large-scale Internet or commercial intranet search-and-retrieval product. Costs, not technology, account for much of the attrition in the search-and-retrieval sector.

The Darwinian nature of the search business allows boutique search companies to appear and often disappear as quickly. Among the companies whose investors have considerable optimism are MyAmigo (Australia), Pertimm (France), ClearForest (United States), and iPhrase (United States). Hopefully, most of these companies will survive and thrive. It is, however, doubtful that, in the near term, any of the newcomers will challenge the dominance of a handful of search-and-retrieval companies. Verity, Inc., PC Docs, Autonomy Ltd, and a few others dominate the commercial market. Google and FAST Search & Retrieval are best positioned for a run at the market leaders. Overture, a company that has quietly transformed search to a form of advertising, has revenues and earnings that dwarf virtually all search-and-retrieval companies. Overture, however, downplays its search technology and focuses on its revenues of more than $500 million. Only Google, with an estimated 2002 revenue of $300 million, seems positioned to mount a threat. The rest of the thousands of search vendors are finding themselves new homes nestled inside such fuzzy product packages as customer relationship management, knowledge management, and content management software.

Search and retrieval is at once everywhere in the form of free Web searching via Google, FAST's "alltheweb.com," and Yahoo!'s Inktomi service. But it is also nowhere in the warp and woof of the fabric of Windows XP, e-mail programs, and the ubiquitous "search" box on intranet and extranet Web pages. Search is ubiquitous, so most users do not see it as a separate function, but just as a handy and necessary tool.

The Market Segments for Search

Search has its distinct niches or market segments. The diagram entitled "Search and Soft Market Segment Boundaries" (see page 43) provides a simplified view of customer and user clusters. First, every computer or mobile device has some type of search function. In a mobile phone, search may underlie hitting a number key and seeing the name and phone number stored at that location in the phone's memory. Search may mean using the built-in tools in commercial application software. Even Excel has a "find" command. Within this segment, complete micro-ecologies of search software exist. For chemists, Reed Elsevier and Chemical Abstracts offer specialized tools that meet users' needs in their laboratory or on an organization's intranet. An "intranet" is a network that operates within an entity with access requiring a user name and a password.

Second, there is the Internet. Internet search and retrieval has been free, although the financial model for monetizing search has been cobbled together from the failure of many early entrants. Google, Alta Vista, Yahoo!, and newcomers such as Bitpipe offer "free searches." The thirst of search engines for money is unslakeable. Search engines generate revenue by selling advertising, selling a "hit" when a user does a search on a particular topic, or reselling searches of content to other Web sites. The variations for monetizing search are proliferating. The issue with monetizing is, of course, objectivity. In the Internet "free" search segment, objectivity is not the common coin of the realm. Paying for clicks and traffic is more important than relevance. When Google goes public in the next year or so, the need for revenue means that objectivity will take a back seat to monetizing.

The third segment, I call "special domains." These are the collections of content that defy the mainstream, text-centric search engine. Music, videos, computer-aided drafting diagrams with a database of parts and prices, medical images, and audio content are not searchable with the software that falls in the purview of librarians or expert searchers. These special domains account for as much as 90 percent of the digital content produced at this time, based on a study we conducted over a period of 6 months in 2002 for a major technology firm. Chinese language Web pages, an electronic mail message with an Excel attachment in a forwarded message, purchase order information in CICS system files, and streaming audio from radio stations are just four examples of content that becomes ever more plentiful, while remaining difficult, if not impossible, to search.

The critical portion of the diagram is the boundaries among and between segments. These boundaries are like a paramecium's. The boundary is semi-rigid, permeable, and subject to its environment. Search, therefore, can be explained in an infinite number of ways, making comparisons difficult. What was true yesterday of Google may not be true today. Google's catalog service was essentially unusable. However, the Froogle service is a useful, high-value service. Consultant analyses and comparisons of search software are the intellectual Twinkies of this software sector. One can eat many Twinkies and go into sugar shock, but the essential nutrients are simply not there, and the growling stomach remains unsatisfied. Fuzzy boundaries make comparing search software in an "apples to apples" way almost impossible.

This diagram illustrates in part why the search landscape and the dominant companies in the search business change over time. Consider the Canadian search company Fulcrum. Fulcrum's software is quite good among today's intranet tools. Several years ago, it was bought by another Canadian company (Hummingbird). Hummingbird provided software to permit a PC user to access data on a corporate mainframe via a screen scraping program. Hummingbird was, in turn, acquired by another Canadian company (PC Docs), a document management outfit that wanted to upgrade its search-and-retrieval functions and leverage Hummingbird's customer base. Now, Fulcrum search is a facet of PC Docs product suite. Or look at Open Text. Originally a Web indexing and SGML database with a search function, Open Text now consists of pieces of Tim Bray's search engine and the BASIS database search tool plus other search functions to handle the collaborative content in Live Link. Inktomi sold its intranet search business to Verity and then allowed itself to be purchased for about $250 million by Yahoo! Other companies have simply retreated from search, repositioned themselves, and emerged as taxonomy and ontology companies. Examples of this include Semio (France and California) and Applied Linguistics (formerly Oingo, operating in Los Angeles). There is more horse swapping and cattle rustling in these three segments than almost any other software sector. Confusing. Absolutely.

The Institutional Self-Discovery Market

In an interesting twist, the search business has begun morphing search and retrieval into a system that discovers what information a company already has. As silly as this sounds, there are organizations that simply do not know what information exists on the organization's own servers. (If this sounds like a commercial for knowledge management, it is not.) search-and-retrieval software has been packaged as a way for a security officer at a large company to know what the Des Moines, Iowa, office put on the Internet.

Most organizations have allowed their Web server population to grow like Manchester, England, at the height of the industrial revolution. In our security-crazed, post 9-11 and post-Enron world, boards of directors have to know what information exists, in what form, and, of course, accessible to whom. As more companies reinvent themselves as "knowledge organizations" or "information companies," few if any employees know such basics as:

• What information is in digital form

• Which information is the most recent or "correct" version

• Where a particular piece of information is.

If this reminds the reader of "content management," it is an easy mental leap to the role of search and retrieval in this market sector. In the pre-digital age, people could stay late and look through stacks of paper. Today, not even the most caffeinated Type A can browse hundreds or thousands of files on different machines in many different formats. The job is too onerous, too tedious. With a few deft marketing touches, a search engine can be paraded as an information discovery engine.

The idea is that the search-and-retrieval system looks at a company's information objects, figures out what each object is "about," and then clusters the objects in a Dewey Decimal type of scheme. There is a word for this type of work, and that word is indexing. Indexing professionals, librarians, and content specialists, e.g., those working for the National Library of Medicine, used to do this kind of work. Now that such individuals are deemed non-essential or simply too expensive, software is supposed to do the job. Not a chance.

Verity, the current industry leader, makes it very clear that part of the firm's professional service fees include payment for humans who "tune" and "train" the Verity system. For those who can't afford Verity, the transformed search companies or specialist firms can deliver software that indexes and classifies so someone, somewhere knows what is on a corporate intranet. ClearForest is one company leading in this "discovery" niche. For military intelligence and government security applications, i2 Ltd. (Cambridge, England) provides a tightly integrated suite of tools that allows discovery to run as a process with results depicted with icons, connector lines, and "hooks" to non-text objects.

The companies that have done the most effective job of getting their technology embedded in content management, customer relationship management, and — my favorite meaningless discipline — knowledge management, are Verity (Mountain View, California) and Autonomy Ltd. (Cambridge, England). These "M" businesses — document management, customer relationship management, knowledge management, and content management — need reliable search.

The segment leaders, Verity and Autonomy, have about 70 percent of the U.S. and European corporate and government market. Both companies have products that "work." The precise meaning of "work" is somewhat difficult to define, because the lists of each company's customers have about a one-third overlap. For basic search and retrieval, these companies are market leaders. Unlike Overture, Verity and Autonomy follow a business model of licensing software and then selling support, customization, and services. Both firms will provide the services required to satisfy the customer. The price for "search that works" can reach seven figures.

Verity's and Autonomy's strengths do not lie in the firms' respective technologies. Verity relies on thesauri and what might be called traditional indexing by extracting terms. Newer algorithms have been added, and the company can process database files in the recently upgraded K2 engine. Autonomy relies on statistical techniques originally based on Bayesian statistics. Like Verity, Autonomy has embraced other approaches and acquired companies in order to gain customers and technologies in speech recognition. Both Verity and Autonomy can support corporate customers. Smaller companies with lower fees usually find that the juicy accounts go to Verity or Autonomy because those firms can install, support, and service enterprise clients. One systems manager said in a focus group in 2002, "No one gets fired for licensing Verity or Autonomy."

Most commercial search software with an intranet version work at what might be called the 70 percent level. For a query, more than two-thirds of the content will be available when the query is passed. The results will be about 70 percent on the topic. The very best engines push into the 80-percent range. It is very difficult with today's technology to get consistently high scores unless you restrict content domain tightly, freeze updates, and craft correctly formed, fielded queries. The reader familiar with SQL queries or Dialog Boolean queries will immediately see why typing one or two terms, hitting the enter key, and looking at a list of hundreds of results requires considerable manual filtering¹.

Do commercial search engines work? Yes. Effective search-and-retrieval software gets about 80 percent of the relevant material, as shown by the results of TREC competitions. Stated another way, the most effective searches usually miss at least 20 percent of the content that could be highly pertinent to the user's query. Verity delivers this type of search effectiveness when the company's software is properly set up. But, as many Verity customers have discovered, this means employing considerable human, manual effort. Searches in limited domains with tightly controlled word lists are more satisfying than searches run across heterogeneous domains of content. For most users, precision and recall at or near the B minus or C plus level is "good enough."

"Good enough," in fact, describes how most search-and-retrieval engines work. Google is "good enough" because the results are ranked by a voting algorithm that weights pages with many links over pages with few links. What if a page does not have links, but does have outstanding content? Google may index the page, but unless the query for that page is well formed, the page without links may end up buried deep in the list of results or not displayed at all. Most searches follow the Alexis de Toqueville rule that when the majority votes, the result is mediocrity. Excuse the heresy as Google's popularity continues to grow, but potentially useful pages may disappear beneath more popular pages.

Back to Basics

In the early days of search and retrieval, there was only one way to find information. A proprietary system offered a command line interface. To find a document, the searcher (certainly not a pejorative term in most Fortune 500 companies or at NASA, where search began in the late 1960s) crafted a query.

The query required a reference interview with the person wanting the information or a conversation with a colleague who understood a particular domain's jargon and the context of the query for a particular client. The search query was assembled using appropriate terms, usually selected from a printed controlled vocabulary. (In the early days of search, it was considered a point of professional pride to have professional indexers assign keywords using a thesaurus to the documents or entries in a database. The word list — called a controlled vocabulary — served as a road map to information in the database.)

If synonyms were common, as they were in medical, technical, general business, or news databases, "use for" and "see also" references were inserted into the thesaurus. The searcher then crafted a well-formed Boolean query using the syntax of the online system. Well formed means that the logic of the query would return a narrow set of results or hits and the provision of such precise, on-point result sets proved that expert searchers were at the controls.

The expert researcher then selected specific databases (now called a corpus in today's search jargon)² and ran the well-formed query against the appropriate files. In 1970, only a handful of these online databases existed — a number which grew to about 2,000 by 1985. (Today, the Web has given rise to database proliferation, where a single Web page counts as a database, and there are more than 3 billion Web pages indexed by Google alone.) The searcher reviewed the results and selected the most relevant for their clients. Additional queries were run by constructing search syntax that essentially told the online system "give me more like this." How far have we come since 1970?

Not far. Today's software is supposed to look at the user's query, automatically expand the terms, run the search, bring back relevant documents, rank them, highlight the most important sections, and display them. The blunt truth is that, for most online users today, looking for information boils down to a pretty straightforward set of actions. Hold on to your hat and fasten your seat belt; users today do one or more of three things:

• Type one or two terms into a Google-style search box and pick a likely "hit" from the first page of results. (The most searched term on Google, I have been told, is Yahoo! — see item 2 in this list.)

• Go to a site with a Yahoo!-style taxonomy or ontology and click on a likely heading and keep clicking until a "hit" looks promising. (This is the point-and-click approach much beloved by millions and millions of Web users each day. It meets the "I'll know it when I see it" criteria so important in research.)

• Look at a pre-formed page and use what's there. This is the digital equivalent of grabbing a trade magazines from one's inbox and flipping pages until a fact or table that answers a question catches one's eye. Believe it or not, a 1981 Booz, Allen & Hamilton study found that this was the second most popular way to answer a question among polled executives. The most popular way was to ask a colleague. An updated study showed that today asking is still first, but looking at a Web page has become the second most popular way to get information.

Why Search and Retrieval Is Difficult

Search and retrieval is a much more complex problem than most information professionals, systems engineers, and even MBA-equipped presidents of search engine companies grasp. Search brushes up against some problems computer scientists may call intractable. An intractable problem is one that cannot be solved given the present state of computing resources available to solve the problem. Let me highlight a few examples to put the search challenge in context.

First, language or languages. The "answer" may not be stated explicitly. Years ago, Autonomy Ltd. pitched its search engine by saying that its Bayesian approach would find information about penguins if the user entered, "Black and white birds that cannot fly." I think the demonstration worked, but in the real world, the Autonomy system performs best on closed content domains of homogeneous information. Language is a problem because of metaphor, structure, and neologisms, and it becomes intractable when one tries to support, say, a French query delivered against content consisting of Arabic, Chinese, and Korean material. To be fair, most people looking for a pizza in Cleveland, Ohio, want to use English and get the information with a single click. Interface and presentation of search must balance power and ease of use.

Second, most people doing searches don't know what the answer is. The human mind can synthesize and recognize, but is less adept at foretelling the future. So searching requires looking at information and exploring. Much of the Web's popularity stems from its support of browsing, exploring, and wandering in interesting content spaces. Clicking on a list of "hits" that numbers 100,000 or more is mind-numbing. User behavior is predictable. They want something useful so they can get on with their lives. search-and-retrieval systems must permit a chance encounter with information to illuminate a problem. Showing 10 hits may be inappropriate in some cases and just right in others.

Third, as demographics change and thumb-typing young people join the workforce, we need new types of search systems. It is difficult to tell what the long-term impact of Napster's peer-to-peer model will have on information retrieval. One pundit (Howard Rheingold) opines that swarm behavior will become the norm, not solitary search and retrieval. I think of this approach to answering questions as Google's popularity algorithm on steroids. The answer is what people believe the answer to be. One part of my mind wants to stop this type of information retrieval in its tracks. The other part says, "Maybe swarm searching is good enough — 'let many flowers bloom,' as a famous resident of China once said." The idea is that one asks a question and passes it among many system users. The answers that come back reflect a swarming process. Swarm technology has been replicated to a degree in the search-and-retrieval technology developed by NuTech Solutions in Charlotte, North Carolina³. NuTech uses the term "mereology" to describes its approach.

Fourth, the emergence of an ambient computing environment supports the pushing of information to individuals. With IPv6, every digital gizmo can have a Web address. Personalization technology is becoming sufficiently robust to deliver on-point information to an individual's mobile phone without the "user" having to trigger any query. In Scott McNealy's vision, an automobile that needs fuel will query a database for the nearest gasoline station. When a station comes in range, a map of where to turn to get gas will appear on the automobile's digital display. Search in this model reverts to what used to be called Selective Dissemination of Information. Today, the words used to describe the SDI approach range from text mining to agent-based search and even more esoteric terminology. Search will mesh with decision support or Amazon-like recommendation systems. Most people looking for information today seem to open their arms to environments that wake up when the "searcher" turns on a wireless device. The screen says, in effect, "Hello, here's what you need to know right now." The Research in Motion Blackberry device showed that pushing e-mail and stock quotes was a potent online combination for go-go executives in financial services and management consulting.

Fifth, in some search-and-retrieval situations, source identification and verification — or what art dealers call provenance — is difficult. Few point-and-click Google searchers or employees browsing filtered news on a personalized portal page know or care about a commercial database's editorial guidelines. If a consulting firm's table of statistics appears on a Web site, it "must be" accurate. Some pundits have winced when thinking about Enron executives making decisions based on a casual Web search or television talk show. Bad information and loose ethical boundaries are combustible⁴. Most search engines for intranets drag in whatever they find. My dog often brings me dead groundhogs. Thoughtful of the dog, but not germane to my needs. I am not sure software alone can address this challenge, but it warrants thought.

This list of challenges can be extended almost indefinitely. And we haven't even touched the cost of bandwidth to index large content domains, the size and computational capability of the indexing environment that must process data, make judgments, and deliver results often to thousands of users hitting a system simultaneously, or the performance issues associated with making updates and results display before the user clicks away in frustration over slow response. The costs associated with search are often difficult to control and, when search firms run out of money, they close.

Bang.

Approach Search's Weaknesses Objectively

Search vendors are scrupulously polite about their competitors' technologies. That politesse stems from the results of large-scale tests of search engines. Look at 3 or 4 years of TREC results. Most of the technologies perform in a clump. Precision and recall scores remain essentially the same. What's more interesting is that the scores top out in the 80 percent range for precision and recall and have done so for several years.

Two observations are warranted by the TREC data and actual experience with brand-name search "systems." First, search has run into a wall when it comes to finding relevant documents and extracting all the potentially relevant documents from a corpus. Despite the best efforts of statisticians, linguists, and computer scientists of every stripe, improving the TREC score or satisfying the busy professional looking for a Word document works less than 100 percent of the time. As noted, the use of voting algorithms has created a self-fulfilling prophecy whereby users are thrilled with a C or B-minus performance. The more people who find this level of performance satisfactory, the more their feedback guarantees mediocrity. Second, the compound noun neologisms of marketers cannot change the fact that commercial search systems work on text. Most commercial and university think tank software of search engines — including the ones wrapped in secrecy at the Defense Advanced Research Projects Agency (DARPA) — cannot handle video, audio, still images, and compound files (a Word document that includes an OLE object like an Excel spreadsheet or a video clip). There are search engines for these files. Just ask any 13-year-old with an MP3 player or your college student living in a dormitory with a DVD recorder, ripping software, and a broadband connection.

Multilingual material complicates text search in certain circumstances. Accessing information in other languages is gaining importance in some organizations. When carefully set up, search engines can handle the major languages, but running search queries across the multilingual search engines performs less well than search engines running on a corpus composed of text files in a single language. This means that finding associated or relevant documents across corpuses, each in a different language, is essentially a job for human analysts. Said another way, search produces manual work.

In the post 9-11 world, the inability to address Arabic, Farsi, Chinese, and other "difficult" languages from a single interface is a problem for intelligence analysts in many countries. Toss in digital content with corpuses composed of audio clips, digitized video of newscasts, electronic mail, and legacy system file types, and we have a significant opportunity for search innovation. From the point of view of a small company, solving the problem of searching electronic mail might be enough to make 2003 a better year.

Search is serious. It is a baseline function that must become better quickly. Search will not improve as long as buyers and users are happy with "good enough." A handful of information professionals understand the problem. In the rush of everyday business, voices are not heard when questions arise about purchased relevance versus content relevance, bait-and-switch tactics, and the thick "cloud of unknowing" that swirls around data provenance, accuracy, completeness, and freshness.

New technology acts like a magnet. Novelty revives the hope that the newest search technology will have the silver bullet for search problems. Pursuit of the novel amid the word-spinning done by the purveyors of new search technology may attract users, but few step back and ask hard questions about search — free, intranet, Internet, peer-to-peer, or wireless.

To see how the snappy can befuddle understanding of the limitations in present search-and-retrieval technology, look at the positive reception given Kartoo (Paris, France) and Groxim (Sausalito, California). Strictly speaking, these two companies have interesting and closely related technology. A query is launched and "hits" grouped into colorful balls. Each ball represents hits related in some way to a particular concept. The links among the balls show relationships among the concepts. Sounds useful, and the technology is appropriate for certain types of content and users. Visualization of results in clusters, of course, relies on underlying clustering technology, which must be sufficiently acute to "understand" extreme nuance. To get a sense of how well that technology works, run a query on Kartoo in an area where you think you know the subject matter well. Now explore the balls. Are the "hits" clustered correctly? In our tests of Kartoo, we found that more than half the balls contained some useful information. But Kartoo and Groxim still return results that are too coarse to serve an expert in a particular domain⁵.

Results: Biased and Sometimes Useless

Search and retrieval is believed to be unbiased. It is not. Virtually all search systems come with knobs and dials that can be adjusted to improve precision or adjust recall in a commercially successful search engine such as FAST Search & Retrieval (Wellesley, Massachusetts and Oslo and Trondheim, Norway). The company can make adjustments to the many algorithms that dictate how much or how little a particular algorithm can affect search results. Yahoo!-Inktomi's, Open Text's, and AltaVista's search engines have similar knobs and dials. Getting the settings "just right" is a major part of a software deployment. For intranet search, Verity is the equivalent of the control room of a nuclear submarine.

Once one knows about the knobs and dials, one starts asking questions. For example, "Is it possible to set the knobs and dials to bias or weight the precision and recall a certain way so when I type "airline," I display the link of the company paying the most money for the word airline?" The answer is, "Absolutely. Would you like to buy hotel, travel, trip, vacation, rental car?" One can see this type of adjustment operating in Google when the little blue boxes with the green relevancy lines appear on a page of hits. Indeed, the very heart of Google is to use weighting that emphasizes popular sites. "Popularity" is defined by an algorithm that considers the number of links pointing to a site. For a different view of Google's controls, go to the BBC's Web site [http://www.bbc.co.uk]. Enter the word "travel" in the search box. The hits for both the BBC Web site and the "entire Web" are BBC affiliate sites. Coincidence? No search bias?

The clever reader will ask, "What about sites that have great content, no links pointing in or out, and relatively modest traffic? These sites are handled in an objective manner, aren't they?" Go to Ixquick, a metasearch site with a combination of links and traffic popularity algorithms. Enter the term "mereology." No hits on http://www.mereology.org. No hits for the NuTech Solutions Web site whose founder brought "mereology" from obscurity to the front lines of advanced numerical analysis. Serious omissions? Absolutely. Such problems are typical among specialist resources for very advanced fields in physics, mathematics, and other disciplines. However, similar problems surface with search tools used on intranet content. The research and development content as well as most of the data residing in accounting remain black holes in many organizations.

For expert searchers, locating the right information pivots on Boolean queries and highly precise terms. This assumes, of course, that the desired content resides in the index at all. Verity's PDF search engine stumbles over PDF files for one good reason⁶. The content of PDF files is not designed for search and retrieval. PDF files are designed for page rendering. Textual information runs across columns, not up and down columns. PDF search and retrieval requires deft programming and canny searchers. For intranets, indexing corporate content is somewhat less troublesome than indexing the pages on a public Web server or the billions of pages on the hundreds of thousands of public Web servers, but comprehensive and accurate indexing of even small bodies of content should not be taken for granted.

For a common example of deliberately biased search results, look at the display of for-fee "hits." Companies selling traffic allow a buyer — essentially, an advertiser — to "buy" a word or phrase. When a search involves that word or phrase, the hits feature the Web site of the buyer. Such featured results are usually segregated from the "other results" but searchers may not notice the distinctions. Google and Overture are locked in a fierce battle for the pay-for-click markets. FindWhat.com is a lesser player. In the U.K., eSpotting.com is a strong contender in the "we will bring you interested clients" arena.

What about sites that offer to priority index a Web page if the Webmaster pays a submission fee? AltaVista offers a pay-for-favorable-indexing option. Yahoo! offers a similar service as well, even while the company shifts "search" from its directory listings to spidered search results. In fact, most search engines have spawned an ecology of services that provides tricks and tactics for Webmasters who want to get their pages prominently indexed in a public search engine. Not surprisingly, discussions abound on the use of weighting algorithms in public sector search services as well. For example, an agency might use those "knobs and dials" to ensure that income tax information was pushed to the top of results lists in the month before taxes are due. (I hasten to stress that this is a hypothetical example only.)

Innovation Checklist: The Ideal Search Engine's Functions

Over the last 2 years, my colleagues and I have compiled a list of what I call yet-to-be-mined gold ore in search. The list includes functions not currently available in search software on the market today. Some might view this as a checklist for innovators. I view it as a reminder of how much work remains to be done in search and retrieval.

Table 3 on page 50 provides a summary of what search-and-retrieval systems cannot do at this time.

Actions Information Professionals Must Take

What should trained information professionals, expert searchers, and the search engine providers themselves do? This is the equivalent of practicing good hygiene. It may be tilting at windmills but here are the action items I have identified:
1. Explain, demonstrate, teach by example the basic principles in thinking critically about information.
2. Emphasize that the source of the information — its provenance — is more important than the convenience of the fact the source provides
3. Be realistic about what can be indexed within a given budget. Articulate the strengths and weaknesses of a particular approach to search and retrieval. (If you don't know what these strengths and weaknesses are, digging into the underpinnings of the search engine software's technology is the only way to bridge this knowledge gap.)
4. Do not believe that the newest search technology will solve the difficult problems. The performance of the leading search engines is roughly the same. Unproven engines first have to prove that they can do better than the engines now in place. This means pilot projects, testing, and analysis. Signing a purchase order for the latest product is expedient, but it usually does not address the underlying problems of search.
5. Debunk search solutions embedded in larger software solutions. Every content management system, every knowledge management system, every XP installation, and every customer relationship management system has a search function. So what? Too often these "baby search systems" are assumed to be mature. They aren't; they won't be; and more importantly, they can't be.
6. Professional information associations must become proactive with regard to content standards for online resources. Within one's own organization, ask questions and speak out for trials and pilot projects.
Search has been a problem and will remain a problem. Professionals must locate information in order to learn and grow. The learning curve is sufficiently steep that neither a few Web Search University sessions nor scanning the most recent issue of SearchEngineWatch will suffice. We have to turn our attention to the instruction in library schools, computer science programs, and information systems courses. Progress comes with putting hard data in front of those interested in the field. Loading a copy of Groxim's software won't do much more than turn most off-point "hits" into an interesting picture. Intellectual integrity deserves more. Let's deliver it.

Footnotes

¹ Readers interested in the performance of various search engines should review the results of the Text Retrieval Conference (TREC), co-sponsored by the National Institute of Science and Technology (NIST), Information Technology Laboratory [http://trec.nist.gov].

² The Citeseer Web site provides useful information about corpus. A shorthand definition is "the body of material being examined." See http://citeseer.nj.nec.com/hawking99results.html. Links were updated in 2002.

³ See NuTech Solutions' description of its technology at http://www.nutechsolutions.com. The search product is marketed as Excavio. For a demonstration, go to http://www.excavio.com.

⁴ I received a round of applause at the Library of Congress during my talk on wireless computing when I said, "Where should information quality decisions be made? In the boardroom of companies like Arthur Andersen and Enron or in management meetings where trained information professionals vet data?"

⁵ Kartoo's engine is located at http://www.kartoo.com. Groxim requires that the user download a software module and run the program on the user's machine. Groxim's software is located at http://www.groxim.com.

⁶ PDF is the acronym for Adobe's Portable Document Format. Adobe has placed the PDF specification in the public domain to ensure wide adoption. Like PostScript, the PDF file focuses on rendering a page for rasterization, not search and retrieval.

Back to top