Does Taxonomy Matter in a New World of Search and Discovery
Suzanne BeDell and Libby Trudell
In a Google world, where people are accustomed to entering a few keywords into a web search box and retrieving relevant answers, even information professionals wonder if the traditional library information sources’ reliance on controlled vocabularies remains a viable, worthwhile, and cost-effective strategy.
Similar concerns arise with enterprise search. As many organizations begin, or extend, the process of building or choosing new platforms and applications to support search across the enterprise, information professionals assess the role of controlled vocabularies. How important are they in this world of keyword search?
At the SLA 2010 Annual Conference, a panel discussion that I led took on the pivotal question of “Does Taxonomy Matter in a New World of Search and Discovery?” Panelists (Jabe Wilson, Elsevier; Tim Mohler, Lexalytics; and Tyron Stading; Innography) considered how the industry is evolving, presenting their opinions on the value of investing in the creation of structured data. Do taxonomies still add value when keyword searching seems sufficient to many end users?
Information professionals and librarians rely on classification and controlled vocabularies to aid precision search; abstract and index (A&I) publishers make investments in indexing and thesauri to add value to their products. Given the costs of quality indexing, is an alternative technology available to provide similar value for achieving greater precision in search results? Many organizations are experimenting with semantic technologies in hopes of automatically extracting the meaning inherent in documents and supplementing, or even replacing, the human editorial process.
Although information professionals and many publishers believe in the power of indexing, most end users remain satisfied with simple text search and don’t recognize a need for controlled vocabularies. This article will look at the current state of the industry and will explore the outlook for blending taxonomy and other tools in the new world of search.
INDUSTRY GROWTH DRIVERS
Years ago, we used to speak of primary, secondary, and tertiary publishing (primary research journals, abstracting and indexing databases, and aggregated online search services) as driving the core of the information publishing industry.
All three elements remain important, but increasingly, growth comes from a new layer of analysis and data mining. This covers many applications. At its core, however, lies the use of current and past content, in conjunction with statistical, structural, or other analytics models and intelligence mining methods. Examples include Collexis (acquired by Elsevier in June 2010), Innography, LexisNexis TotalPatent, and Thomson Reuters’ ProfSoft. These hot new tools are driven by structured content.
Where does keyword search fit into this landscape? It’s the most common form of text search on the web. At the simplest, it’s any word on a webpage that tells a user something about the subject and content of the page.
Unless the author of a web document specifies the keywords, usually with metatags, it’s up to the search engine to determine them. To accomplish this, search engines pull out and index words that appear to be significant. Because keyword search does not require human indexing or mapping of vocabularies, it is one way to search across databases that have individualized taxonomies.
The next-generation Dialog beta version search engine puts filters on the left-hand side. One list shows keywords automatically extracted from the text of the documents in the set by the search engine software. The other comprises subject headings that were added through an editorial or indexing process. Although there is some overlap, there are different words on each list, and these can work together to provide a better search result than either search engine would have done alone.
Keyword search works well on unstructured content. Analytics can be used to identify nuggets of knowledge buried in unstructured content. This is particularly useful for content found both on the open web and behind enterprise firewalls. In the first step of an analytics process, a document is scanned using natural language processing (NLP) to identify meaningful terms and, if an ontology is being used, related concepts. According to Outsell (David Bousfield, “From ‘Text Mine!’ to Text Mining: STM Text Analytics Comes of Age,” Insights, May 10, 2010; for sale at www.outsellinc.com/store/insights/11189), the lack of suitable lexicons is one of the main factors limiting the expansion of this sector.
Conversely, the better defined the vocabulary, the more specific the queries can be. This is why we see many examples of NLP technology appearing in the STM publishing world. Linguamatics, for example, is offering solutions for healthcare and life sciences that can support very specific queries such as, “Which biological targets does enzyme xyz interact with?”
Publishers such as Elsevier, Nature Publishing Group, and the Royal Society of Chemistry use entity extraction to enhance digital articles. Nature Chemistry is using Temis’ technology to identify chemicals and then link to sources of molecular information stored by PubChem and ChemSpider.
Steps for NLP involve recognizing parts of speech, taking into account contextual information such as proximity. Increasingly, this relationship information is captured as a Resource Description Framework (RDF) triple. RDFs extend the linking structure of the web to use universal resource indicators (URIs) to name the relationship between things as well as the two ends of the link. Yahoo! and Google now include triples in their indexing.
For example, markup language can tell us that the object of joesmith.org is Joe’s homepage. Instead of inferring, the search engine knows for sure that joesmith.org is Joe’s homepage, the image on the page is a picture of Joe, and what rights are associated with these elements.
This brings us to semantic technology. As noted in a recent report from Outsell (Marc Strohlein, “2010—The Year of Reckoning: Five Crucial Technologies for Information Publishing,” Outsell CEO Topics, Feb. 2, 2010; for sale at www.outsellinc.com/store/products/908), semantic technology goes beyond descriptive tagging and “whatness” to encoding meaning extracted from content to infer “aboutness.” This is done using a variety of tools including entity extraction, classification, and categorization. Each of these elements is enhanced with the availability of appropriate ontologies.
These concepts are captured in the accompanying ontology spectrum graphic, which shows how increasing the metadata increases the search capability along a horizontal axis beginning at recovery, or the simplest form of indexing, to discovery, intelligence, answers, and reasoning. Many core concepts for this graphic were derived from presentations and articles by Leo Obrst, MITRE Corp., when he discusses the ontology spectrum and semantic models.
As you follow the spectrum, you see that integral to the concept of the semantic web is an ontology. Ontology techniques and tools, such the semantic web standard OWL language, enable relationships between vocabulary terms to be formally expressed and computed in ways that are valuable for efficiency and accuracy in metadata use. This spectrum argues that to go beyond simple recovery to intelligence and—ultimately—answers, controlled vocabularies are necessary. The mapping of these vocabularies will allow us to move beyond conceptual models to the semantic web.
To continue to bring value to our customers, we’ve committed to moving beyond retrieval to providing answers at Dialog and ProQuest. We’re seeking tools and technologies that link and leverage the controlled vocabularies provided by our information publishers to find hidden connections between companies, people, and technologies. We’re also looking for the best ways to extract keywords from natural language to enable access paths that make sense to a wide variety of searchers. It’s exciting to visualize a future in which both structured and unstructured content can be more valuable and more useful.
The panel discussion at SLA surfaced many important issues surrounding the value of taxonomies and ontologies. Each panelist had a slightly different take. But overall, the answer about whether taxonomies matter in this new world of search and discovery was a resounding yes.