The Future Revisited: What's New with Web Search

web search feature

The Future Revisited: What's New with Web Search

Chris Sherman

ONLINE, May 2000
Copyright © 2000 Information Today, Inc.

...it remains to be seen whether [AOL/Time Warner] will emerge as an online titan or succumb to the sclerotic forces that inevitably tend to bedevil huge companies.
Last year, in "The Future of Web Search" (ONLINE, May/June 1999, p. 54-61), I took on the somewhat quixotic task of painting a portrait of the state of Web search in the year 2004. All of the industry leaders I spoke with were amused by the audacity of the idea--after all, on Internet time, most people can't predict what will happen five months from now, let alone five years hence.

Nonetheless, some long-term trends emerged that are definitely shaping the constantly shifting landscape of search. But inevitably, given the startling acceleration in the growth of the Web and players seeking to capture the minds (and dollars) of its users, new ideas, technologies, and projects have sprung up that hold significant promise--and implications--for the future of Web search.

CONVERGENCE: THE EVERYWHERE WEB

Convergence also arrived in not-so obvious forms.

The most ballyhooed and visible example of convergence in the past year was the proposed merger between AOL and Time Warner. Widely touted as the killer marriage of a giant content company with the dominant provider of Internet access, it remains to be seen whether the new company will emerge as an online titan or succumb to the sclerotic forces that inevitably tend to bedevil huge companies.

Also widely heralded were new-generation mini-browsers, appearing on everything from palm organizers to cellular phones to (calling Dick Tracy) wrist watches. These devices will soon become pervasive thanks to advances in micromachining--carving out and building up microscopic structures on silicon wafers--that will allow multiple components to be constructed on a single chip at very reasonable cost. Search is a killer app for mini-browsers, though it will predominantly be search for real-world and real-time information--maps and directions, stock quotes, news and sports scores, and so on.

Convergence also arrived in not-so- obvious forms. Scores of Internet-enabled devices were announced at the CES International show in Las Vegas in January. Internet refrigerators, dishwashers, and microwave ovens will soon be commonplace. It's easy to dismiss these new developments as irrelevant for serious searchers. Look at the companies involved in the new generation of Web gizmos, though, and you'll see the major players behind today's Internet infrastructure: Cisco, Sun Microsystems, Microsoft, Intel. You'll also see serious involvement from makers of consumer- entertainment products.

For these new products to be successful, massive amounts of R&D money will be spent, with particular emphasis on making them easy to use. This should drive a huge wave of innovation, particularly in the area of connectivity and user-interface design. These products will also spur (finally) the creation of an effective micropayment system.

The implications of micropayments for both traditional information providers and Web search engines/ portals are enormous. Documents will shatter into packets; packets into straightforward "answers" that are true and succinct, not links back to full documents. Micropayment capabilities will once and for all overcome the pervasive but ridiculous notion that "information wants to be free," allowing content owners to slice and dice information in myriad ways, and searchers to become information consumers who won't think twice about paying for what they're getting because their needs will finally be satisfied. While a certain part of the Web population will continue to want full-text search and retrieval, a much larger portion will use search simply to solve problems. When your Web-enabled VCR can automatically answer hard questions like "How do I get the clock to stop blinking 12:00" by querying the U.S. Naval Observatory Master Clock and automatically reprogramming itself, even "user friendly" services like Ask Jeeves may find themselves with countless underutilized answers in search of a questioner.

INFORMATION UNDERLOAD: THE SEARCH ENGINES AWAKEN

Distributed search engines would have much in common with the massive scalable databases used by scientists who are increasingly working with huge datasets.

July 1999: A study published in Nature by NEC Research Institute scientists Steve Lawrence and Lee Giles estimated that the publicly indexable Web had over 800 million pages, but that no search engine indexed more than 16% of the total. Like a glass of cold water splashed in their collective faces, this seemed to awaken the major search services and spur them to action. Suddenly, the competition to have the biggest index of the Web became intense. Within months of publication of the NEC study, search-engine claims of doubling or tripling index sizes became commonplace.

Then, in early 2000, Inktomi, together with the NEC Research Institute, announced the results of a study that the Web had grown to more than 1 billion documents. At this juncture, relative newcomer FAST appeared to be the leader, with 300 million pages in its index. Both AltaVista and Excite claimed numbers in the 250-million-page range. All claimed to have sampled many more pages than they indexed, presumably filtering out duplicate pages or spam.

As impressive as these catch-up numbers are, the engines are still lagging behind the growth rate of the Web. IDG estimates that the Web will grow to over 13 billion pages over the next three years. It's beginning to appear that centralized approaches to creating Web indexes may not scale with the Web's explosive growth. Catching up will likely require adopting some sort of distributed search approach, similar to the approach patented by Infoseek.

Distributed search engines would have much in common with the massive scalable databases used by scientists who are increasingly working with huge datasets. For example, CERN (the birthplace of the Web) is building the world's largest particle accelerator, the Large Hadron Collider (LHC). The LHC will generate at least one pedabyte of data per year. To put that in perspective, consider that the entire Library of Congress stores less than a thousandth of that amount (one terabyte).

To cope with this truly staggering amount of information, the LHC will decentralize storage, maintaining databases housed on computers around the world. Any researcher with Web access can query these servers and display results in seconds--from the full, unfiltered dataset. Adapting this technique to search indexes has obvious advantages, and as a technological solution, distributed search appears to have great potential. Whether the business models of the search engines can adapt to embrace the approach is another question altogether.

POWER TO THE PEOPLE

A surprising development during 1999 was the surging popularity of human-compiled directories of the Web, most notably, the Open Directory Project (ODP). Though it is little more than a year-and-a-half old, the ODP now provides directory data to more than 100 search engines, including AltaVista, AOL Search, Dogpile, HotBot, Lycos, Metacrawler, and Netscape Search. This gives the ODP a reach of total users comparable to that of traffic- leader Yahoo!

However, if automatically spidered search engines can't keep up with the explosive growth of the Web, human-compiled directories don't have a prayer of being comprehensive. This doesn't mean they will be any less popular. What we'll likely see over the next few years is the emergence of hybrid human/machine-compiled directories. These will take two primary forms: machine-compiled directories that will be "edited" by humans for quality control, and systems that apply intelligence to searching an existing human-compiled directory. The Inktomi directory engine is an example of the former, and Oingo is an example of the latter. Oingo currently uses ODP data as its source, though its technology is readily adaptable to any source of directory data.

Rather than doing simple keyword-matching searches, Oingo searches are conducted within what Oingo calls the realm of "semantic space," bringing up categories and documents that are close in meaning to the concepts the searcher is interested in. The key advantage to this approach is that results can be retrieved that would have been missed in a traditional plain-text search. Simply because a certain word does not appear on a page does not preclude its relevance, if the document is conceptually related to the query. Oingo also provides a sophisticated filtering mechanism that allows successively greater degrees of control over search results by specifying the exact meaning of query words, eliminating irrelevant alternate definitions.

SEARCH GETS PERSONAL

The holy grail for search services has always been a system that adapts itself to user needs by observing, learning, and reconfiguring itself to deliver only totally relevant results while screening out the dross. Most systems that attempt personalization apply Artificial Intelligence methods to some degree. AI has made great strides in the past few years, but we're still a long way from truly intelligent agents that essentially become your all-knowing e-librarian.

But there are less ambitious approaches to capturing information about what people find relevant and fine-tuning relevance algorithms to reflect this information. Google's PageRank system does this in part by analyzing page "importance," a form of citation analysis that amounts to a virtual peer-review process for Web pages. Direct Hit takes a different approach to measuring the relative popularity of a page. The system observes which pages are selected from search results, and how long visitors spend reading the pages. Direct Hit has compiled data on more than one billion of these "relevancy records" and continuously updates its user-relevancy rankings based on newly gathered data.

Both of these systems rely on analyzing the behavior of the aggregate population of Web users. New systems are emerging that bring the focus to the level of the individual. Backflip is an interesting and highly promising example of such a system.

Backflip essentially creates a Yahoo-like directory from your bookmarks or Internet favorites. Because it is constructed from pages and sites that you have already vetted by bookmarking them, the directory is almost totally relevant to your needs. And it transcends a simple list of bookmarks, because Backflip captures the full text of all bookmarked pages, and then allows you to run sophisticated keyword queries on just that limited set of pages.

Soon, Backflip will introduce the capability to do keyword searches on every page you've ever visited, whether you've bookmarked it or not. No longer will you need to use a search engine to find a previously visited page--and you won't want to, since the visited search space of even a power user will be orders of magnitude smaller than even a smallish general-purpose directory, assuring dramatically relevant results for nearly any query.

Don't expect the search engines to ignore this threat. They could easily offer a similar system by taking advantage of information they already provide to advertising networks in the form of click-stream data. With appropriate permissions, you could allow a search engine to analyze your personal Web-surfing data (perhaps under license from ad servers like Double Click or Flycast) and use it as a massive filter on the engine's full index. The privacy implications of this approach are enormous, of course, but so is the potential for dramatically improved, customized search results.

BROWSER-FREE SEARCHING

Searching the Web typically means calling up a search engine or directory in your browser window, entering keywords, and viewing the results. We're now seeing a new class of tools emerge that operate independently from the main browser window. Alexa was one of the first of these utilities, providing links to related sites and other information.

AltaVista offers Discovery, a thin standalone bar that offers AltaVista search and several other useful features. It includes the Hyperbolic Tree by Inxight, that displays Web- page relationships, allowing you to visually navigate sites. Like Backflip, Discovery can search your previously viewed Web pages, and it can also summarize Web pages, highlight keywords, and perform several other useful tasks. Excite's Assistant offers different features, but is also browser-independent.

Two relatively new programs point to further browser-free innovations that we'll likely see more of in the future. These are applications designed to provide ready reference information at your fingertips. GuruNet is a deceptively simple utility that allows you to highlight a word on a Web page and get dictionary and encyclopedia definitions for the word in GuruNet's pop-up window, as well as a translation of the word into the language of your choice. Depending on the word, you may also be offered science- or technology-related definitions.

GuruNet also provides you with relevant RealNames links, and the results of an Oingo search for the word, providing you with direct access to relevant Web pages without the steps involved in calling up a search engine or directory and entering a query. Flyswat is what you might call an "association engine." It scans the full text of every word on a Web page, and automatically hyperlinks words for which it has additional information. Clicking the Flyswat link brings up a window with links to focused, targeted information related to the word.

For example, if a page mentions a publicly traded company, Flyswat automatically turns its name into a link. Clicking the link brings up a menu with literally dozens of sources of information about the company, including links to analysis, news, various investment-related sources, the company's home page, and so on. For names of people, Flyswat creates links to biographical information. Cities, countries, or other places get links to maps, country facts, weather, and tourist information. In essence, Flyswat automatically constructs targeted search results using all of the significant keywords on a page.

WHOLE DOCUMENT QUERIES

Lacking the thousands of real-life heuristics we humans have learned from the day of our birth, search engines are still essentially idiot savants...

Most current search systems are limited to keyword queries. Some systems also allow natural-language queries, but typically these are limited to a single phrase or question.

New systems under development allow searching based on similarity of content rather than similarity of vocabulary. These systems allow users to submit paragraphs or even entire pages of text as a query, then attempt to actually understand what the query is about. They work by evaluating words in context and assigning documents numerical values that can be used for sophisticated relevance calculations.

Using huge chunks of text rather than specific keywords may seem like a sure recipe for muddying the clarity of a query, but the approach seems to significantly improve recall and precision. Ejemoni is an example of such a new service that should have been demonstrated in beta format sometime earlier this year. The program was tested at Vanderbilt University in a controlled academic environment on a relatively small database, and achieved impressive scores in both recall (85%+) and precision (86%+). Though it remains to be seen whether the system can scale to the full size of the Web, Ejemoni has attracted some high-power talent with impressive credentials to its executive team and board of directors. Two other promising systems that go beyond traditional keyword searching are Albert and Simpli. (Editor's Note: Dialog's WebTop also allows full-text, cut-and-paste searching. See Mick O'Leary's description in O'LEARY ONLINE on page 91.)

BEYOND HAL

A common archetype for the intelligent computer is HAL, the murderous silicon star of 2001: A Space Odyssey. HAL's flaw, and that of all current search tools, is the lack of a fundamental component of human intelligence: common sense. Lacking the thousands of real-life heuristics we humans have learned from the day of our birth, search engines are still essentially idiot savants, able to perform startling feats of recall without really "knowing" what they are doing.

One of the most interesting attempts at integrating common sense into a computer program is the CYC Knowledge Server. The work of noted AI researcher Doug Lenat, CYC (pronounced "psych") attempts to introduce common sense into the search equation by tapping into a large knowledgebase of facts, rules of thumb, and other tools for reasoning about objects and events of everyday life. The knowledgebase is divided into hundreds of "microtheories" that share common assumptions, but which can also appear to contain contradictory facts when applied to different domains of knowledge or action. Humans, using common sense, can resolve these apparent contradictions based on context. For example, it's not acceptable to whoop like a lunatic when you approve of a business decision at a meeting, but it's equally unacceptable not to go bonkers when your favorite football team scores a touchdown during a big game. The goal with CYC is to dramatically increase tolerance for ambiguity or uncertainty while simultaneously providing a "reasoning engine" that approximates human common sense. CYC works well on semi-structured information, such as the records in the Internet Movie Database. Begun as a well-funded research project in 1984, one hopes the technology underlying CYC, and systems like it, becomes more widespread.

THE FUTURE ISN'T WHAT IT USED TO BE

These are just a few of the countless developments and innovations that are shaping and transforming the landscape of Web search as we move into the new millennium. For space reasons, we didn't touch on other areas where notable progress is being made, such as intelligent agents, visualization and interface design, metadata standards, and the development of micromachines and bio-silicon chips that will soon allow people to "jack in" to the Web, directly inhabiting the world of cyberspace as described in novels like William Gibson's Neuromancer and Neil Stephenson's Snow Crash.

This isn't science fiction anymore. Carver Mead and his group at Cal Tech are working with neuromorphic analog VLSI chips--silicon models that mimic, or capture in some way, the functioning of biological neural systems. It will only be a matter of time until these systems begin to interact with one another, learning, storing, and of course "remembering" on demand (i.e., becoming searchable).

Ray Kurzweil, writing in Scientific American, says that "by 2019, a $1,000 computer will at least match the processing power of the human brain. By 2029, the software for intelligence will have been largely mastered, and the average personal computer will be equivalent to 1,000 brains."

We can only hope that computers with this much intelligence will be well-equipped with common sense and a healthy sense of ethics. We appear to be on the right track. Google! co-founder Sergey Brin, speaking at the Search Engine Strategies 1999 conference, offered this optimistic prediction: "In the future, search engines should be as useful as HAL in the movie 2001: A Space Odyssey--but hopefully they won't kill people."

Search Engines & Directories

FAST
http://www.alltheweb.com

Oingo
http://www.oingo.com

Direct Hit
http://www.directhit.com

Open Directory Project
http://www.dmoz.org

Sites Using Open Directory Data
http://dmoz.org/Computers/Internet/WWW/
Searching_the_Web/Directories/Open_Directory_Project/
Sites_ Using_ODP_Data/

Google
http://www.google.com

Search Tools & Utilities

Backflip
http://www.backflip.com

Excite Assistant
http://www.excite.com/assist/html/download/

AltaVista Discovery
http://discovery.altavista.com

GuruNet
http://www.gurunet.com

Flyswat
http://www.flyswat.com

Ejemoni
http://www.ejemoni.com/

Albert
http://www.albert-inc.com/

Simpli
http://www.simpli.com

Other References

Inktomi WebMap
http://www.inktomi.com/webmap/

Infoseek Distributed Search Patent
http://software.infoseek.com/patents/dist_search/

CYC
http://www.cyc.com/products2.html

HAL's Legacy, by Douglas Lenat
http://www.cyc.com/halslegacy.html

The Coming Merging of Mind and Machine,
by Ray Kurzweil
http://www.sciam.com/1999/0999bionic/0999kurzweil.html

Carver Mead's Physics of Computation Group
http://www.pcmp.caltech.edu/

Chris Sherman (websearch.guide@about.com or csherman@searchwise.net) is the Web Search Guide for About.com, http://websearch.about.com. He holds an MA from Stanford University in Interactive Educational Technology and has worked in the Internet/Multimedia industry for two decades, currently as President of Searchwise.net, a Web consulting and training firm.

Comments? Email letters to the Editor at editor@infotoday.com.

[infotoday.com]

[ONLINE]

[Current Issue]

[Subscriptions]

[Top]