Beyond the Spider: The Accidental Thesaurus

Whether it's your favorite search engine on the public Web or an intranet search tool that your organization supports, it's almost a certainty that you rely on a robot to find content every day. Whether Internet-wide or intranet-specific, a search engine, or spider, relies on robotic elements to help you find content. A robotic crawler traverses Web pages, looking for content and for links to more content; a robotic indexer compiles a database of words and URLs; a robotic search component fields user queries, trying to match a user's search with the most relevant pages.

A human search administrator starts the process, telling the spider — our robotic helper — to start at a particular spot, and where to go — or where not to go — in its crawling and indexing. From then on, the robotic components of your favorite search tool proceed with very little human guidance. By and large, end users see hit lists whose content has been determined mainly by choices in algorithms made by the software engineers back at the factory.

A robotic spider can be incredibly powerful and accurate — or it can deliver a hit list full of irrelevant minutiae. Is it possible to offer the very best hits, specifically for those searches performed the most frequently?

Most organizations which take time to do the research learn that a relatively small number of searches are repeated quite frequently by a large number of users. For those most common searches, maybe we can deliver better results by keeping a human editor in the equation. A human editor can augment the results the robotic spider churns out — by finding the very best pages that would most benefit those users and inserting those items in the hit list before the robot offers its best guesses.

A Modest Proposal:
The Accidental Thesaurus

For intranet, online product catalog, newspaper, campus sites.

Build a thesaurus based on what people look for.

Don't even try to be comprehensive.

Use your search logs to find what people look for - and how they actually search.

Fuzzy matching of user searches against thesaurus, à la Ask Jeeves.

Case in point. Soon after AT&T established consumer sales on its Web site, prospective customers began searching for "long distance" and receiving large hit lists. Most of the listed results turned out to be press releases, issued over a period of months, each announcing yet another new wrinkle on a long-distance service plan.

But sometimes customers don't want to see a long hit list. In this case, most potential or actual customers wanted a very short hit list, perhaps one only including items like this:

So spider software and the search engines behind them can indeed be foolish at times. Eventually, AT&T came up with a solution: "AT&T Keywords" (see ATT Search Engine at right). Today, the search box at att.com says: Enter Search Term or AT&T Keyword.

An AT&T Keyword is a word or short phrase that you can type into the "Search" box and that will take you directly to an AT&T Web page. Some examples include:

Just like Web site search terms, you type AT&T Keywords into the "Search" box located on every AT&T Web page. If an exact match to your keyword is found, your browser will automatically display the associated Web page; otherwise, you will see a list of related AT&T pages. If the word you typed matches two or more AT&T Keywords, the results will return a list of matching keywords (with short descriptions of each).

Here, AT&T has interposed an editor's judgment into the equation. A human being analyzes what customers most likely seek when they come to the AT&T site and manually inserts key phrases and matching URLs into a database. When a customer enters a search, the system probes both the manually constructed database as well as the robotically built index. The majority of users see what they want at the very top of the hit list.

AT&T isn't the only Web presence to adopt this editorially chosen keywords approach. Consider ESPN. At their site you can type "Vitale" and go directly to the Web site of their voluble basketball commentator. Or travel across the pond to the BBC's site and do a search for Changing Rooms (the precursor to the popular American TV show Trading Spaces). You'll see exactly what you're looking for as a "Best Link" at the top of the hit list (See the BBC homepage below).

Given the wide breadth of BBC content, a robotic search engine might or might not locate the correct page given the common English words "changing" and "rooms." The BBC wisely has an editor intervene in the search. Why depend on the kindness of strange robots? Why not make the choice editorially?

Perhaps the most famous example of the editorially chosen keywords concept is "AOL Keywords" as defined for America Online and its customers. AOL Keywords serve multiple purposes. The keywords are a shorthand way for customers to get to popularly sought content. Need to report someone who has violated AOL's Terms of Service? Go to Keyword TOS. AOL Keywords provided a navigational "handle" for millions of customers long before they understood the concept of the URL. (Today, a single search box on the AOL service accommodates both AOL Keywords as well as URLs.)

Of course, AOL doesn't just choose its Keywords as a navigational aid or shortcut. People who don't use AOL are mystified by the redundancy when an ad on television says, "Come to sears.com. AOL Keyword: Sears." For AOL, that redundancy is revenue; companies pay to be listed in AOL Keywords.

My own interest in the "disconnect" between what users tend to type and what search engines tend to find goes back to the days of Gopher and Veronica. In 1992, I helped organize a national workshop on the hot tool of the day, the Internet Gopher. Light bulbs went off when Nancy John of the University of Illinois at Chicago provided some "remedial library school for computer people" and showed how analysis of search logs could illuminate what content customers really seek. Hmm, search logs as feedback into decisions on how to present information ... interesting concept.

Library science has long understood ways to map multiple choices of terms into a single concept. The classic example is a bibliographic database that treats Mark Twain and Samuel Clemens as the same author. A thesaurus makes it possible.

If we were to hire a professional thesaurus builder to solve a vocabulary problem, the consultant would probably do a rigorous analysis of the terminology used in the literature of the discipline — manufacturing, physics, management, whatever. We would contract with our information scientist to build a comprehensive thesaurus covering the language of the field.

But the examples we've seen so far are considerably less rigorous than that. No doubt AT&T examined its search logs to inform its choices of keywords to ensure that the searches done by 99 percent of its customers yield pay dirt. AOL chooses its internal Keywords for convenience — and its partner keywords for profit. For practical, everyday solutions, we do not need the rigor of an academic discipline examining the literature.

In a presentation for the Access '98 conference in Calgary, I termed this user-driven, explicitly non-comprehensive approach "the accidental thesaurus" (see A Modest Proposal below).

In the remainder of this article we'll explore other projects to build a smarter spider via the accidental thesaurus approach: my own efforts at Michigan State University and those of information scientists at Bristol-Myers Squibb.

When AltaVista burst onto the scene in 1995, it quickly caught my attention as a new high water mark in Web search technology. Soon after its global Web search went online, I was on the phone with Digital Equipment Corporation suggesting that it market the product for intranet applications. In 1996, Michigan State University beta tested the product and became the first institution of higher education to license it as a campus-wide spider.

Since that time, our msu.edu Web presence has grown dramatically, and, over time, this growth in content coverage has made it increasingly difficult to find popular campus service points on the Web. The classic example: a simple search for

This turns out to be a wonderful challenge for a spider. Both terms in the phrase are common English words. We have numerous personal pages on campus whose owners want to become human resources professionals, and we have an academic program in the area as well. But most users just want the HR department. The university even has more than one unit under that name. In this context, an AltaVista search became increasingly futile: The hit list was too polluted to be useful, especially at the top.

Over time, as campus search administrator, I began to receive more and more complaints from users about the difficulty of finding commonly sought content. At the same time, complaints rose from campus content providers reacting to what they heard from irate customers.

We began analyzing user search logs to see what content users sought the most. The analysis confirmed that popular campus service points such as "human resources" were among the most common. (Note that this was aggregated log analysis; MSU has strict rules against monitoring the searches of any individual user.)

Ironically, AltaVista and its rivals had already developed technology that could solve our problem, ironically to accommodate advertising. Before AT&T Keywords dawned on AT&T's own site, AltaVista would deliver you an AT&T banner ad and hit list link if you typed "long distance" as a search phrase. In the absence of a similar built-in feature in AltaVista's intranet product, I decided we needed to roll our own.

Working with a student programmer, I began fleshing out a design for MSU Keywords. We decided to use Active Server Pages to connect a Web user with a database back-end. Initial prototype work was done with Microsoft Access; the production system uses MS-SQL as the database (see above).

We began the effort with some clear design goals. As the software became more functional and the database grew, those goals evolved somewhat. Here are the basic specifications of MSU Keywords:

As the service became functional I began building a corpus of keywords and matching URLs for the most commonly sought content. For the most part the effort was literally driven by what users searched for the most in the existing AltaVista service: I examined logs to find the most common searches, found the best Web page to match, and entered the keyword (or phrase) and URL into the database. Most entries came from user input, but not all: in some cases I added Keywords for sites in the campus telephone book or items I encountered in brochures or newspaper articles (see image on right).

We quietly launched MSU Keywords early in 2002. We did nothing to highlight the new functionality with end-users, though we did send a letter to all departments urging that content providers submit keywords and corresponding URLs for entry into the database.

Recent features have augmented functionality for both users and content providers:

We continue to learn from log analysis. In the age of Google, users now usually search in a way more likely to yield good results: a key word or two with no extraneous words. The top 30 searches over a recent period of several weeks are illustrated in Table 1.

Some of these terms are peculiar to the institution; "twig" appears not because we have a Department of Forestry, but because that is the moniker of a Web gateway to an e-mail system named "pilot." "Blackboard" is a commercial course management system. Its consistent appearance at the top of the search charts shows its popularity — and perhaps the difficulty in finding the service without doing a search.

Note that these top 30 search phrases represent some 15 percent of the 200,000 searches performed in this period. Thus, if MSU Keywords offers the best Web pages for those 30 searches, a very small database can supply 15 percent of our users with exactly the right content. When you consider that numerous near-match phrases correspond to these same searches, a small editorial investment in MSU Keywords can yield even more benefit to a larger percentage of users. (Although we believe this data is representative, we do not capture all searches in our logs; Google's index of msu.edu is a popular alternative to our service.)

Conversely, the law of diminishing returns applies as well. The question becomes: How far down the list of unique search phrases should our human editor venture?

Just how steep is the curve of diminishing returns? If we ask how many unique searches it takes to account for percentiles of total searches performed, an interesting pattern emerges (see Table 2 on page 74).

The data confirms our supposition: There is high payoff for putting a small number of unique phrases — perhaps several hundred out of over 50,000 — in our thesaurus, after which returns diminish rapidly. We can match 50 percent of users' searches by manually matching fewer than 1,000 unique search phrases — a manageable amount of editorial effort. But if we want to achieve 90 percent coverage, we must include over 30,000 phrases in our thesaurus. Obviously that would entail a huge amount of effort, far beyond a reasonable allocation of staff resources in an environment as volatile as the Web.

The diminishing of returns becomes even more obvious if we look at the distribution graphically (see "Returns Distribution" graph on page 75).

As it happens, scientific literature supports our observations. People often refer to the "80-20 rule." We can thank the Italian economist Vilfredo Pareto who, in 1906, observed that 20 percent of the populace controlled 80 percent of the wealth. Over the years numerous corollaries to the rule have been put forth: 20 percent of your workers provide 80 percent of your output; 20 percent of staff issue 80 percent of employee complaints, etc.

Others have proposed analogous rules. In 1934, Samuel C. Bradford theorized that a small number of scholarly journals contribute the vast bulk of scientific output in any given discipline. Librarians (and journal publishers) still contemplate Bradford's Law of Scattering. George Kingsley Zipf analyzed how frequently each of 29,899 unique words appeared in James Joyce's Ulysses. Zipf was independently wealthy and hired a team of workers to perform these summations. (Fortunately at MSU we use MS-SQL and efficient stored procedures to do our analysis.) Zipf found that a small number of reused words account for a huge percentage of the total. His book, Human Behavior and the Principle of Least Effort, published in 1949, analyzed the notion of outputs disproportionate to inputs. Zipf's distribution has since been applied to other areas, such as the populations of cities within a nation.

The distribution curves of Pareto, Zipf, and Bradford are remarkably similar, even though describing very different things. The distribution of search phrases at MSU follows suit: Almost eerily, our curve mirrors theirs. The core concept is incontrovertible: You can achieve a huge payoff with a small investment of effort, after which you may start wasting your time.

If we look at the tail end of the search logs — search phrases that appear only a handful of times — we continue to observe perfectly reasonable searches, such as:

It is tempting to try to put all "reasonable" queries into the MSU Keywords database, either as unique entries or as aliases to existing ones. Carried to its logical extreme, this would mean that we would analyze every search and hand-enter the best matching URL into the database. But that way lies doom; this would in effect substitute a human for a search engine. If a given search is performed only a few times, far better to let the search engine do its job as best it can, instead concentrating manual efforts on hundreds or thousands of searches. (And far better to use the best search engine for relevancy, arguably Google.) Advice to the accidental thesaurus builder: Avoid scanning raw search logs; only look at the top of the charts.

We have learned a great deal from building MSU Keywords — with some outcomes not intuitive. Feedback from users and content providers indicates far greater happiness with the search experience. Most users and content providers don't know the role MSU Keywords plays in vastly improved search relevancy.

I continue to be astonished by the extent to which users seek not obscure "leaf" pages but major starting points by using the search engine. I shouldn't be surprised: They've been trained by Google. In the first moments after planes hit the World Trade Center, 6,000 people a minute typed "CNN" into the Google search box. People count on search engines to find home pages that ought to be otherwise highly visible.

Our search logs are remarkably stable over time, with some understandable exceptions. When the last school year ended, searches for course information declined, while searches for online grades went up dramatically.

Thus far, our A-Z browsing view (see page 75) sees little use. In part this is because the MSU home page does not link to the A-Z. It's also likely that most users, now more satisfied with the search experience, see no reason to go there.

Originally, we planned for MSU Keywords to behave like AT&T, AOL, and ESPN Keywords: If MSU Keywords found an exact match, the user would be immediately redirected to the matching site. For instance, if a user sought the home page for the Wharton Center for the Performing Arts, the keyword "Wharton," as an exact match, would drive the user straight to the Wharton site, without showing a hit list at all. Although this functionality is in place, it hasn't yet been put in production as a default; MSU searchers always see a hit list, even with an exact match.

As for content providers, in practice it is somewhat difficult to convey what the new service is all about and how to exploit it. Some content providers make it hard for MSU Keywords (or any finding aid) to work well. For instance:

Student programmer Nathan Burnett developed our first search log analyzer. Journeyman student programmer Mathew Shuster designed and developed the MSU Keywords database and the MSU Search Logger. A graduate student, Qin Yu, also worked on the project. Anne Hunt designed the graphics. Dennis Boone and Edward Glowacki are past MSU AltaVista administrators.

Bristol-Myers Squibb (BMS) is a major pharmaceutical and healthcare products company with 40,000 employees working at its New York headquarters and campuses around the world. Information scientists at the company have developed an intranet application based on the accidental thesaurus concept. Their efforts at Bristol-Myers Squibb bear a remarkable resemblance to ours at Michigan State University; the main concepts and goals were almost identical, although the user base and implementation differ substantially.

Mike Rogers, associate director for Information Architecture at the company, says he and his colleagues knew they had a problem with helping company users find commonly sought content on a huge and diverse intranet. After hearing a presentation by Vivian Bliss about her approaches to the same problem for Microsoft's intranet, Rogers saw an opportunity. After additional consultations with Microsoft, Rogers set out to add an editorially chosen keywords component to his search services.

Lydia Bauer, a senior Information Scientist at Bristol-Myers Squibb, says that in analyzing search logs at the company, they noticed that the top 100 search terms were sought quite commonly. Searches related to the company's lines of research were common, as were searches related to human resources. Like many organizations, the company had an intranet that evolved without a governing body, covering everything from scholarly texts on pharmaceuticals to information about the annual picnic. Improving the search experience would help tie things together. (Another approach, installing a company-wide portal product, is also underway.)

Bristol-Myers Squibb actually has two search engines, one based on Verity, and another a corporate Web search engine. Rogers and Bauer sought to add a "Best Bets" service (see above), that would integrate with the other two engines. They consulted with in-house developers and decided upon Cold Fusion as the middleware to connect to their keywords database.

It took several months to build the database back-end and integrate it with the company search tools. Today, when Bristol-Myers Squibb workers search the company intranet, if a matching Best Bet exists, it displays prior to the beginning of the hit lists from the other two engines — a "federated" search in Bauer's words.

Bauer, who is currently finishing work on an MLS degree, expresses surprise that she has found very little discussion of this approach in library or information science literature. "I've found lots of discussion of using search log analysis in augmenting a thesaurus, but not much on building a Top 100 or Best Bets service." Perhaps this is due to history: Only in the era of the Web do we have robotic spiders whose logs can inform the creation of an accidental thesaurus. Bauer also wonders how much editorial effort to invest in the service. She originally intended to stop after the Top 100. However, in surveying users, she finds impressive positive feedback — 87 percent found the service good or great — with the only negative being, "We want more." Currently, the BMS database has some 1,500 terms relating to some 450 sites — interestingly, about the same scale as MSU's.

As with MSU Keywords, Bauer bases her editorial decisions primarily on what users seek the most. She reads the company's daily news reports and adds keywords for events, new product information, etc., as reported each day. The majority of entries in Best Bets are internal company resources — this is, after all, an intranet service — but she has added links to Mapblast, Yahoo!, etc., on the assumption that workers seek these resources for a reason.

Bauer sometimes encounters situations, as I have, in which content that ought to be online — we know this because the logs show that people look for it — simply isn't on the intranet. In those cases she tries to contact the relevant department and suggest it publish a new page.

MSU Keywords and BMS' Best Bets overlap considerably, but there are differences. Each of us has some features the other desires. Bauer gets a daily report of the most-sought keywords not in Best Bets — very useful feedback for the editor. Her service does not yet incorporate birth and expiration dates, nor does it have an A-Z option. Overall, though, it's uncanny how similar the efforts are.

One might expect that, over time, as users learn the effectiveness of searching, they will migrate towards full system searching. In studying logs and talking with users, Bauer has learned that currently about the same number of people browse as search.

Best Bets has been a big win for users of the Bristol-Myers Squibb intranet, just as Bliss' work has been at Microsoft. (We'll examine Microsoft's use of Best Bets in a future article.) But Rogers and Bauer aren't resting on their laurels; they are investigating adding a natural language facility to their search services.

I Never Metadata That Solved a Problem

Someone from the Dublin Core school of thought may have this argument: "Wait a minute — you could solve this problem if only you had good metadata." (The "Dublin Core" is a set of core metadata elements defined at a famous meeting in Dublin, Ohio, the home of OCLC.) The likely claim: If content providers thoroughly describe their own content as they publish it, then your robotic spider can harvest that data as it crawls, providing for superior searching.I have watched for almost a decade as the Web community has struggled with metadata issues while in practice precious little is done in the real world. There is not even a consistent way to determine the last time a given Web page was modified. At my university, we struggle in a constant uphill battle to convince content providers not to do things that hide their sites from spiders altogether, much less to provide good metadata as they publish.

Until authoring tools and Web publishing environments rigorously enforce metadata standards, we won't have good metadata for robots to chew on. I offer the East Lansing Maxim as a response to Dublin Core aficionados: "Everybody talks about metadata, but no one does anything about it." That's not literally true, of course; there are some promising applications of Dublin Core. But as Mike Rogers observes, "It's very difficult to enforce metadata standards across a large enterprise."

Even if Web publishers provided good metadata with their content, that only solves the problem of having good labels to describe the content. It doesn't help decide which pages belong in the best bets service and which ones do not. Again, take the example of maps. A university may have thousands of pages involving maps online, from the holdings of the Map Library to databases in geography to personal pages with Mapquest links. The vast majority of site visitors want a campus map. Content-provider-chosen metadata can't make essential editorial decisions.

Does the Accidental Thesaurus make sense for your organization? I claim that any organization that has a substantial Web presence with a significant user base should incorporate an accidental thesaurus into its search service. Examine your search logs. If a large number of users seeks the same content areas using the same terms, test to see if your robotic spider delivers the goods at the top of the hit list. If it doesn't, time for an editor to step in.

For any given Web presence, whether intranet or global, the top 500 unique search phrases entered by users represent at least 40 percent of the total searches performed.

The examples of Michigan State University and Bristol-Myers Squibb alone cannot prove this theorem, but given our data as well as Pareto and Zipf, I'm confident enough of its validity to challenge people to disprove it. Given a tool like MSU Keywords or Bristol-Myers Squibb's Best Bets, you can vastly improve your users' search experience with a modicum of editorial effort.

You need not build your own software; some search products support editorially chosen "best bets" in the native product. For instance, Inktomi's intranet search product offers a "Quick Links"feature that fills the bill; go to www.nortelnetworks.com and search for "vpn" to see how it works. Some knowledge management products that cost six or seven figures let editors deliver even more sophisticated result sets, essentially offering robot-assisted presentation of frequently sought content, organized by category. Whether you use a million dollar tool or roll your own Best Bets service, by augmenting the spider you can help your users find the most popular content more easily.

Ultimately, my argument is simple: If you help a lot of people find content that they frequently seek, you improve the overall efficiency of the organization. In an IDC White Paper, "The High Cost of Not Finding Information," Susan Feldman and Chris Sherman note that lack of information results in poor decisions, duplicated effort, lost sales, and lost productivity. They estimate that for an organization with 1,000 knowledge workers, the cost of information not found exceeds $2.5 million per year. Even your management should understand those reasons for investing effort in building an accidental thesaurus.