| CONFERENCE CIRCUIT The Ninth Annual Search Engine Meeting
 By Nancy Garman
 
 In Europe for the first time
  following a long run in Boston and one diversion to San Francisco, the Search
  Engine Meeting, held April 1920 in The Hague, Netherlands, had a distinctly
  international flavor. The global mix of delegates reflected the event's European
  location at the expense, however, of the usual large American delegate block.
  Conference organizer Harry Collier said that only a few Americans overcame
  the twin obstacles of distance and high exchange rates to attend the event.
  As a consequence, total attendance was down to around 100. Next year's meeting
  is scheduled for April 11 to 12 in Boston. If Collier can retain the international
  delegates and bring back the Americans, the 10th annual meeting could rebound
  successfully.
  The small attendee count encouraged participation as researchers and developers
  exchanged ideas, presented solutions, and discussed current research during
  sessions and social events. The Search Engine Meeting is a unique conference
  where the leading edges of search engine research and development, categorization,
  indexing, natural language, and computer science converge. At this event, academics
  and researchers talk the same language, and the theoretical foreshadows the
  operational.
  Opening Keynote In her opening keynote "Quantity Versus Quality," Karen Spärck Jones
  of Cambridge University asked, "Has 50 years of research about search resulted
  in anything more than the ability to find tens of thousands of references about
  Britney Spears?" She contrasted the holy grail of search researchthe
  world of information at our fingertips assisted by intelligent systemsto
  the reality of Google's 55 million searches in 2003. "What can information
  and language-processing research do for Web search engines?" she asked.
  Jones then reviewed her research into computer support for human indexing.
  She said that current research efforts are attacking the challenges of the
  hidden Web and digital libraries. She believes that if you exploit the quantity
  correctly by using machine intelligence, you get quality.
  Jones said that Web search engines were developed by computer scientists
  independent of the research that she and others conducted from the 1950s to
  the 1970s. However, over the past 10 years this meeting and the growth of the
  Web have brought those worlds together. The result is a win-win for researchers,
  developers, implementers, and end users as well as a partial answer to Jones'
  call for better connections and more interaction between researchers and search
  engine developers.
  The reality of search and the intersection of intelligent search systems
  with human invention were aptly illustrated on the second day of the conference
  between Jones and Martin Belam from the U.K.'s BBCi Search. This dynamic, plus
  the juxtaposition of intelligent indexing and auto-categorization research
  with the reality of Web searching, were major themes of this meeting. Despite
  their historically deep involvement in controlled vocabularies and indexing,
  librarians and information professionals are peripheral to this realm of search
  research.
  Research Meets Reality In his session "Human Intervention in the Search Process," Belam described
  BBCi's Best Links, a program in which a team constantly reviews and adjusts
  search queries so that results match customers' expectations. BBCi monitors
  the top search terms, checks the results, and puts a directory on top of the
  spidering to vary terms for context and adjust for misspellings, thus increasing
  recall and precision. For instance, when the space shuttle Columbia was
  in the news, the increase in the number of searches for "Colombia" did not
  indicate a spike in interest about the South American country, but rather a
  common misspelling. BBCi adjusted its directory during that period so that
  searches on "Colombia" returned hits about the space shuttle disaster.
  During the break following Belam's presentation, he and Jones discussed the
  why's and why nots of machine indexing and building directories based on human
  results monitoring. Jones offered assistance from the research community in
  automating the Best Links project. Research intersected with the real world
  as Jones scratched boxes and terms on the back of her conference notes, while
  Belam allowed that some of BBCi's monitoring might be automated. It's this
  level of personal networking that makes the Search Engine Meeting a special
  place for search engine developers, practitioners, and researchers.
  Research on Search On the first morning, Liz Liddy from Syracuse University and Donna Harman
  from TREC delivered the researchers' perspective on search. Liddy's current
  research focuses on some of the elusive aspects of textual retrieval, finding
  not just the topic but determining the opinion and attitudes behind the text.
  She showed some intriguing examples of affect-mining that can add value to
  retrieval. One was CiteSeer (http://www.neci.nj.nec.com/homepages/lawrence/citeseer.html),
  which groups together the context of citations to a given article. This allows
  researchers to easily see what's being said and why the article was cited.
  Harman reported on TREC's ongoing research projects sponsored by NIST, DARPA,
  and ARDA.
  Delivering the search engine developer's point of view, Prabhakar Raghavan
  from Verity said that given the vast amount of unstructured data in today's
  business organizations, there's an imperative to develop tools to create or
  extract and then exploit structure. He discussed the challenges of XML querying
  and suggested that research needs to advance toward XML, text retrieval, and
  information integration.
  Offering a different perspective, Endeca's Peter Bell said that faceted navigation
  (or guided navigation, as his company calls it) is a multidimensional browse
  capability that can be more efficient than taxonomies. Suggesting that there's
  some implicit structure in most types of unstructured business documents, Bell
  said that facets allow multiple sources of mixed content to coexist. He claims
  that Endeca's "search plus browse" approach results in new insights into less-structured
  heterogeneous content.
  The Multilingual Web Conference co-chair David Evans kicked off a panel session on CLIR (Cross-Language
  Information Retrieval) by asking how big the problem is, whether commercial
  CLIR can work on the Web, and whether we can use the Web itself to improve
  CLIR. He cited an April 12 Newsweek article about search engine translations
  in which the author suggested that the song "The Girl from Ipanema" might not
  have been a hit if songwriter Norman Gimbel had to depend on the Google machine
  translator to translate the song's original Portuguese lyrics.
  The addition of 10 members to the European Union and the meeting's location
  in The Hague made multilingual search and retrieval an issue of more than just
  academic interest. Clearly, there's work to be done. Panelists Evans, Gregory
  Grefenstette from CEA, Joop van Gent and Piek Vossen from Irion, and Wessel
  Kraaij from TNO addressed various aspects of this challenge.
  Grefenstette said that English speakers are now a minority on the Web (35.8
  percent). He also discussed the results of his study to find out if the Web
  is still dominated by English-language content. This study used predictors
  to determine the frequency of English, Finnish, French, and German usage and
  found that English remains the predominent language. However, Grefenstette
  predicted that as broadband access grows, the amount of Web text in different
  languages will begin to equal the amount in various online language populations.
  van Gent discussed the problems of selling CLIR products. He said that for
  most organizations, information retrieval, not language, is the first challenge.
  In addition, European governmental restrictions limit multilingual Web initiatives.
  Vossen delved into the nitty-gritty of what it takes to develop a cross-lingual
  retrieval system and CLIR semantics on the Web. He also discussed which strategies
  might be applicable in different circumstances.
  van Gent then covered Irion's commercial answers for CLIR, which work best
  in structured environments. They involve training an automatic classification
  system with a multilingual data set and stimulating users to add more words
  or phrases by adding a dialogue model to the classification system.
  Kraaij attacked the language issue from a different angle and discussed mining
  the Web for multilingual information, both by finding and dealing with multiple
  translations. He concluded that transitive translation is a viable approach
  to CLIR.
  Search in the Enterprise Late on the first afternoon, Sue Feldman from IDC foreshadowed the next morning's
  sessions in her presentation on enterprise search. As she described the information
  infrastructure she's seeing within organizations, she forecast the emergence
  of a new infrastructure or middleware layer that contains modules to acquire,
  manage, analyze, and create access to all kinds of information. She discussed
  the factors that are driving this trend and the next generation of search.
  On the horizon, Feldman sees linguistic capabilities embedded in other applications;
  rules and inference engines; interactive visualizations of information spaces,
  results, and relationships; and unified access to data plus content.
  Search Gets Real Steve Arnold began the second day of the conference with his presentation "Social
  Software and New Search," an outline of the search engine landscape. He talked
  about the "big four" in site search: Verity, Hummingbird, Convera, and FAST.
  Arnold said that no one size fits all, and he discussed newcomers such as Arikus,
  Delphes, and Odyssey ISYS. He believes that Lextek and dtSearch are companies
  worth watching.
  As Arnold discussed the development platforms, he contrasted the old Sun/Microsoft approach with the Google-influenced perspective: TCP/IP for everything.
    Picking up on the social networking theme, Arnold said that Google's new
    mail function moved the company squarely into the realm of social interaction,
    an area in which Yahoo! has already become a key player.
  Later on the second morning, Kasper Vad, IT manager at InfoMedia Huset in
  Denmark, and Martin Belam addressed search engines on a practical level. This
  was much to the relief of the corporate IT managers in the audience who found
  the previous day's sessions interesting but academic.
  Vad walked the audience through his selection and deployment of an in-house
  search platform. He talked about how he managed the process, dealt with expectations,
  and automatically converted 99.5 percent of his existing data. He reported
  that auto-categorization allowed him to process eight times as many documentsa
  huge increase in productivity, although at an equally huge cost. Vad said that
  he had not involved librarians in the search engine selection process. This
  affirmed Karen Spärck Jones' earlier observation that IR research and
  search engine development are the domains of computer scientists, not librarians,
  who carry the perceived baggage of traditional restraints.
  Image Search Ethan Munson (University of Wisconsin), Alan Smeaton (Dublin City University),
  and Sebastian Gilles (LTU Technologies) wrapped up the last afternoon with
  discussions about image searching, access-to-video archives, and visual-content-analysis
  technology that put a visual face on search-and-retrieval development.
  In his studies, Munson confirmed that "image file name," "page title," and "page
  file name" are the most important fields for accurately retrieving images on
  the Web. The conference's underlying collaborative-research message was evoked
  when an audience member asked Munson if his data set would be retained and
  made available to researchers. Several attendees observed that the information
  would be valuable to others working in the field.
  Smeaton's experiences proved streaming media's viability as he described
  how he uses the Físchlár News Stories site to keep up with newscasts
  while traveling. He showed how search could identify and personalize the retrieval
  of video images.
  Partnership Opportunities Several high-level product presentations were interspersed with the conference
  research papers. These were really technical reports from vendors that are
  conducting important research and development in the arena of search and retrieval
  and categorization.
  One standout was Nigel Hamilton, CEO and founder of Turbo10.com, who put
  his new metasearch engine through its paces for the audience and then pointed
  out several unsolved issues. He invited potential partners to offer their development
  skills and help resolve these problems. Ask Jeeves, Convera, and Basis Technology
  also delivered important briefings on their product-development efforts.
  Back to Boston in 2005 Google's IPO doesn't mean that search engines have matured and search research
  has slowed to a standstill. It's quite the opposite: The scale of the Web offers
  new challenges and research topics. Ongoing work on text analytics, retrieval
  issues, redundancy, reputation engines, faceted navigation, machine translation,
  and much more suggest that there's no shortage of topics for next year's Search
  Engine Meeting.
 Nancy
                          Garman is Information Today, Inc.'s director of conference
                      program planning. Her e-mail address is ngarman@infotoday.com.
 
 |