Online KMWorld CRM Media Streaming Media Faulkner Speech Technology Unisphere/DBTA
Other ITI Websites
American Library Directory Boardwalk Empire Database Trends and Applications DestinationCRM Faulkner Information Services Fulltext Sources Online InfoToday Europe KMWorld Literary Market Place Plexus Publishing Smart Customer Service Speech Technology Streaming Media Streaming Media Europe Streaming Media Producer Unisphere Research



Magazines > Information Today > June 2004
Back Index Forward
 




SUBSCRIBE NOW!
Information Today

Vol. 21 No. 6 — June 2004

CONFERENCE CIRCUIT
The Ninth Annual Search Engine Meeting
By Nancy Garman

In Europe for the first time following a long run in Boston and one diversion to San Francisco, the Search Engine Meeting, held April 19­20 in The Hague, Netherlands, had a distinctly international flavor. The global mix of delegates reflected the event's European location at the expense, however, of the usual large American delegate block.

Conference organizer Harry Collier said that only a few Americans overcame the twin obstacles of distance and high exchange rates to attend the event. As a consequence, total attendance was down to around 100. Next year's meeting is scheduled for April 11 to 12 in Boston. If Collier can retain the international delegates and bring back the Americans, the 10th annual meeting could rebound successfully.

The small attendee count encouraged participation as researchers and developers exchanged ideas, presented solutions, and discussed current research during sessions and social events. The Search Engine Meeting is a unique conference where the leading edges of search engine research and development, categorization, indexing, natural language, and computer science converge. At this event, academics and researchers talk the same language, and the theoretical foreshadows the operational.

Opening Keynote

In her opening keynote "Quantity Versus Quality," Karen Spärck Jones of Cambridge University asked, "Has 50 years of research about search resulted in anything more than the ability to find tens of thousands of references about Britney Spears?" She contrasted the holy grail of search research—the world of information at our fingertips assisted by intelligent systems—to the reality of Google's 55 million searches in 2003. "What can information and language-processing research do for Web search engines?" she asked.

Jones then reviewed her research into computer support for human indexing. She said that current research efforts are attacking the challenges of the hidden Web and digital libraries. She believes that if you exploit the quantity correctly by using machine intelligence, you get quality.

Jones said that Web search engines were developed by computer scientists independent of the research that she and others conducted from the 1950s to the 1970s. However, over the past 10 years this meeting and the growth of the Web have brought those worlds together. The result is a win-win for researchers, developers, implementers, and end users as well as a partial answer to Jones' call for better connections and more interaction between researchers and search engine developers.

The reality of search and the intersection of intelligent search systems with human invention were aptly illustrated on the second day of the conference between Jones and Martin Belam from the U.K.'s BBCi Search. This dynamic, plus the juxtaposition of intelligent indexing and auto-categorization research with the reality of Web searching, were major themes of this meeting. Despite their historically deep involvement in controlled vocabularies and indexing, librarians and information professionals are peripheral to this realm of search research.

Research Meets Reality

In his session "Human Intervention in the Search Process," Belam described BBCi's Best Links, a program in which a team constantly reviews and adjusts search queries so that results match customers' expectations. BBCi monitors the top search terms, checks the results, and puts a directory on top of the spidering to vary terms for context and adjust for misspellings, thus increasing recall and precision. For instance, when the space shuttle Columbia was in the news, the increase in the number of searches for "Colombia" did not indicate a spike in interest about the South American country, but rather a common misspelling. BBCi adjusted its directory during that period so that searches on "Colombia" returned hits about the space shuttle disaster.

During the break following Belam's presentation, he and Jones discussed the why's and why nots of machine indexing and building directories based on human results monitoring. Jones offered assistance from the research community in automating the Best Links project. Research intersected with the real world as Jones scratched boxes and terms on the back of her conference notes, while Belam allowed that some of BBCi's monitoring might be automated. It's this level of personal networking that makes the Search Engine Meeting a special place for search engine developers, practitioners, and researchers.

Research on Search

On the first morning, Liz Liddy from Syracuse University and Donna Harman from TREC delivered the researchers' perspective on search. Liddy's current research focuses on some of the elusive aspects of textual retrieval, finding not just the topic but determining the opinion and attitudes behind the text. She showed some intriguing examples of affect-mining that can add value to retrieval. One was CiteSeer (http://www.neci.nj.nec.com/homepages/lawrence/citeseer.html), which groups together the context of citations to a given article. This allows researchers to easily see what's being said and why the article was cited. Harman reported on TREC's ongoing research projects sponsored by NIST, DARPA, and ARDA.

Delivering the search engine developer's point of view, Prabhakar Raghavan from Verity said that given the vast amount of unstructured data in today's business organizations, there's an imperative to develop tools to create or extract and then exploit structure. He discussed the challenges of XML querying and suggested that research needs to advance toward XML, text retrieval, and information integration.

Offering a different perspective, Endeca's Peter Bell said that faceted navigation (or guided navigation, as his company calls it) is a multidimensional browse capability that can be more efficient than taxonomies. Suggesting that there's some implicit structure in most types of unstructured business documents, Bell said that facets allow multiple sources of mixed content to coexist. He claims that Endeca's "search plus browse" approach results in new insights into less-structured heterogeneous content.

The Multilingual Web

Conference co-chair David Evans kicked off a panel session on CLIR (Cross-Language Information Retrieval) by asking how big the problem is, whether commercial CLIR can work on the Web, and whether we can use the Web itself to improve CLIR. He cited an April 12 Newsweek article about search engine translations in which the author suggested that the song "The Girl from Ipanema" might not have been a hit if songwriter Norman Gimbel had to depend on the Google machine translator to translate the song's original Portuguese lyrics.

The addition of 10 members to the European Union and the meeting's location in The Hague made multilingual search and retrieval an issue of more than just academic interest. Clearly, there's work to be done. Panelists Evans, Gregory Grefenstette from CEA, Joop van Gent and Piek Vossen from Irion, and Wessel Kraaij from TNO addressed various aspects of this challenge.

Grefenstette said that English speakers are now a minority on the Web (35.8 percent). He also discussed the results of his study to find out if the Web is still dominated by English-language content. This study used predictors to determine the frequency of English, Finnish, French, and German usage and found that English remains the predominent language. However, Grefenstette predicted that as broadband access grows, the amount of Web text in different languages will begin to equal the amount in various online language populations.

van Gent discussed the problems of selling CLIR products. He said that for most organizations, information retrieval, not language, is the first challenge. In addition, European governmental restrictions limit multilingual Web initiatives. Vossen delved into the nitty-gritty of what it takes to develop a cross-lingual retrieval system and CLIR semantics on the Web. He also discussed which strategies might be applicable in different circumstances.

van Gent then covered Irion's commercial answers for CLIR, which work best in structured environments. They involve training an automatic classification system with a multilingual data set and stimulating users to add more words or phrases by adding a dialogue model to the classification system.

Kraaij attacked the language issue from a different angle and discussed mining the Web for multilingual information, both by finding and dealing with multiple translations. He concluded that transitive translation is a viable approach to CLIR.

Search in the Enterprise

Late on the first afternoon, Sue Feldman from IDC foreshadowed the next morning's sessions in her presentation on enterprise search. As she described the information infrastructure she's seeing within organizations, she forecast the emergence of a new infrastructure or middleware layer that contains modules to acquire, manage, analyze, and create access to all kinds of information. She discussed the factors that are driving this trend and the next generation of search. On the horizon, Feldman sees linguistic capabilities embedded in other applications; rules and inference engines; interactive visualizations of information spaces, results, and relationships; and unified access to data plus content.

Search Gets Real

Steve Arnold began the second day of the conference with his presentation "Social Software and New Search," an outline of the search engine landscape. He talked about the "big four" in site search: Verity, Hummingbird, Convera, and FAST. Arnold said that no one size fits all, and he discussed newcomers such as Arikus, Delphes, and Odyssey ISYS. He believes that Lextek and dtSearch are companies worth watching.

As Arnold discussed the development platforms, he contrasted the old Sun/
Microsoft approach with the Google-influenced perspective: TCP/IP for everything. Picking up on the social networking theme, Arnold said that Google's new mail function moved the company squarely into the realm of social interaction, an area in which Yahoo! has already become a key player.

Later on the second morning, Kasper Vad, IT manager at InfoMedia Huset in Denmark, and Martin Belam addressed search engines on a practical level. This was much to the relief of the corporate IT managers in the audience who found the previous day's sessions interesting but academic.

Vad walked the audience through his selection and deployment of an in-house search platform. He talked about how he managed the process, dealt with expectations, and automatically converted 99.5 percent of his existing data. He reported that auto-categorization allowed him to process eight times as many documents—a huge increase in productivity, although at an equally huge cost. Vad said that he had not involved librarians in the search engine selection process. This affirmed Karen Spärck Jones' earlier observation that IR research and search engine development are the domains of computer scientists, not librarians, who carry the perceived baggage of traditional restraints.

Image Search

Ethan Munson (University of Wisconsin), Alan Smeaton (Dublin City University), and Sebastian Gilles (LTU Technologies) wrapped up the last afternoon with discussions about image searching, access-to-video archives, and visual-content-analysis technology that put a visual face on search-and-retrieval development.

In his studies, Munson confirmed that "image file name," "page title," and "page file name" are the most important fields for accurately retrieving images on the Web. The conference's underlying collaborative-research message was evoked when an audience member asked Munson if his data set would be retained and made available to researchers. Several attendees observed that the information would be valuable to others working in the field.

Smeaton's experiences proved streaming media's viability as he described how he uses the Físchlár News Stories site to keep up with newscasts while traveling. He showed how search could identify and personalize the retrieval of video images.

Partnership Opportunities

Several high-level product presentations were interspersed with the conference research papers. These were really technical reports from vendors that are conducting important research and development in the arena of search and retrieval and categorization.

One standout was Nigel Hamilton, CEO and founder of Turbo10.com, who put his new metasearch engine through its paces for the audience and then pointed out several unsolved issues. He invited potential partners to offer their development skills and help resolve these problems. Ask Jeeves, Convera, and Basis Technology also delivered important briefings on their product-development efforts.

Back to Boston in 2005

Google's IPO doesn't mean that search engines have matured and search research has slowed to a standstill. It's quite the opposite: The scale of the Web offers new challenges and research topics. Ongoing work on text analytics, retrieval issues, redundancy, reputation engines, faceted navigation, machine translation, and much more suggest that there's no shortage of topics for next year's Search Engine Meeting.


Nancy Garman is Information Today, Inc.'s director of conference program planning. Her e-mail address is ngarman@infotoday.com.
       Back to top