Information Today, Inc. Corporate Site KMWorld CRM Media Streaming Media Faulkner Speech Technology DBTA/Unisphere
Other ITI Websites
American Library Directory Boardwalk Empire Database Trends and Applications DestinationCRM Faulkner Information Services Fulltext Sources Online InfoToday Europe KMWorld Literary Market Place Plexus Publishing Smart Customer Service Speech Technology Streaming Media Streaming Media Europe Streaming Media Producer Unisphere Research

Magazines > ONLINE > May/June 2007
Back Index Forward

Online Magazine

Vol. 31 No. 3 — May/June 2007

Information Professionals in the Text Mine
By Kathryn A. Lavengood and Pam Kiser

Text mining is growing rapidly as a technology for analyzing large volumes of unstructured textual documents. With the “information explosion” an accepted fact of living in the 21st century, those charged with maintaining awareness of developments in a given area of expertise face the problem of overlooking critical data. Everyone, not just information professionals, has been impacted by the exponentially increasing volumes of information available—as well as with changing attitudes toward electronic information.

With access to more information freely available, and with the help of Google and other search engines, patrons have become “information consumers” with very high expectations (access to multiple sources with a Google-like search box). As information delivery via email, blogs, wikis, and alerts expands, information consumers turn to the information professional with different demands. Today, they do not desire direction to a reference or an article but expect highly relevant information specific to their interests that has been analyzed, filtered, and summarized for them personally. They need to analyze much larger sets of data, do trends analyses, and visualize the results in new ways.

In response, a number of vendors have developed sophisticated commercially available text-mining applications. Many companies are beginning to look at ways to apply text mining to current business problems and to increase productivity. Yet text mining is frequently viewed as an information technology tool—build it, and people will use it. While there is a definite need for a solid technical infrastructure to support these tools, there is also a need for a semantic infrastructure that focuses on information quality and decision support. The information professional is ideally suited for this latter role.


One of the most promising technologies being developed to address the information problems of today is text mining. Text mining was envisioned and developed from computational linguistics (also known as NLP or natural language processing).The ultimate goal of computational linguistics is to develop mechanical language analysis tools that use statistical techniques to interpret and assign meaning to parts of the text. This provides researchers with the opportunity to review larger and larger volumes of text from multiple sources without having to read each and every line, yet still “discover knowledge.”

The interpretation of text is just the first step in making the information usable. Another key part is then organizing the resulting “text pieces” into some form of usable network. This is addressed by building taxonomies and ontologies that can be navigated to explore specific topics of interest. Finally, the results must be output in a format that can be interpreted and lead to knowledge discovery.


We generally think of text mining as being used in one of three ways:

1. Seeing the big picture (clustering)
2. Finding answers to very specific questions (question answering)
3. Hypothesis generation (concept linkages)

1. Clustering

“Seeing the big picture” refers to the possibilities of viewing large amounts of information from a high-level perspective—seeing the forest rather than the trees. Looking at large amounts of information in the aggregate can reveal interesting trends or arrangements of data that can point the researcher in a new direction.

An example of this can be shown by using a clustering visualization that appears below. Knowing nothing about Martha Stewart, a researcher could quickly look at a diagram and use the clusters to focus in on each of the aspects related to her.

2. Question Answering (Sophisticated Search)

Another potential application of text mining is investigating very specific questions at a level of detail not possible with standard text-searching techniques. Such a question might be “Which media companies have made acquisitions larger than $500 million in the past 10 years?” Even the best information-retrieval techniques (Boolean operators, MeSH descriptors) cannot get to the sentence or phrase level and/or identify a specific verbal connection (acquisition) between two companies or rank by acquisition price.

An example of this appears in the table below. While this search could have been run in a traditional Boolean search system, the searcher would need to manually input names of the companies, indicating company type = media, and manually remove purchase price contracts for under $500 million. Finally, the organization of the detail would have to have been done manually. Using a text-mining approach, media companies could be tagged as such, and the results can be output into a spreadsheet that is easily browsed for the information of interest.

3. Concept Linkages

The most powerful use of text mining is for hypothesis generation. The classic example of hypothesis generation came from Don Swanson’s approach to looking at the medical literature in the 1980s (“Fish Oil, Raynauds Syndrome, and Undiscovered Public Knowledge,” Perspectives in Biology and Medicine, V. 30, No. 1, 1986: pp. 7–18). Text-mining tools are available to help automate Swanson’s process. Swanson’s process is based on the idea of concept linking. Concept linkage connects related documents by identifying shared concepts between two unrelated data sets. The idea is that unexpected connections may become apparent by examining these shared concepts. Such connections may generate new hypotheses to be investigated in more depth and may lead to new knowledge.

This approach is most often described with the following ABC diagram displayed on page 19:

The relationships visualization shown in the diagram illustrates if there is a known relationship between A and B, and there is a separate but known relationship between A and C, then might there be a relationship between B and C <hypothesis>?

This methodology for hypothesis generation can be applied using more complex visual maps that display networks of relationships among a group of concepts in a set of documents. Working with such information visualization offers the possibility of discovering previously unknown and unconsidered relationships. Text mining provides a more efficient approach for applying Swanson’s manual methodologies.

The relationships visualization shown on page 19 illustrates how Swanson’s theories can be applied using text mining. From the literature, Marc Weeber and his colleagues (“Generating Hypotheses by Discovering Implicit Associations in the Literature: A Case Report of a Search for New Potential Therapeutic Uses for Thalidomide,” Journal of the American Informatics Association, V. 10, No. 3, May/June 2003: pp. 252–259) found a direct link between the compound thalidomide and animmunologic factor called interleukin 12. Separately, they found a direct link between interleukin 12 and the disease hepatitis C. Therefore, it would be reasonable to hypothesize that thalidomide has a direct relationship to hepatitis C. (More involved experimental work would be needed to test this hypothesis, as it may or may not be true. But this illustrates a method for generating hypotheses by visualizing indirect relationships among concepts.)

Don’t enter the mine without an Information Professional

Information professionals are natural partners for text mining because of their existing skill sets. At the top of the list is their knowledge and experience with the “information highway” and their ability to place information tools in context. Furthermore, they are knowledgeable about available products and information-retrieval techniques. Good information professionals have a blend of analytical and creativity skills, are adept at problem solving, and excel at dealing with ambiguity. Finally, information professionals have developed excellent consultative and listening skills and the ability to adapt and try different approaches to problems. All of these skills are critical in text-mining efforts.

Specific roles for information professionals in text-mining projects can include the following:

Facilitate conversations between internal teams and vendors. Our experience clearly illustrated that scientists and vendors speak very different languages. Someone needs to negotiate and articulate the various needs and desires and get everyone in agreement.

Place the text-mining tool in context of other information sources. It can be difficult to understand what a text-mining tool offers versus a familiar search tool. Customers need help understanding what they should expect and how it will be different from other search outputs.

Advise vendors and customers on source selection. From the customer viewpoint, it can be difficult to understand why certain commercial database information cannot be included in the text-mining effort. Licensing and copyright issues are best addressed by an information professional. Vendors likely won’t have the expertise in sources for your specific area of interest. It can make a big difference in the output based on the sources used for the input.

Advise on search strategies to retrieve the content set. Even if the vendor is going to use a content source that is familiar to all, such as PubMed, the search strategies used to retrieve the corpus are of critical importance. We have either required documentation of the exact search strategies used by the vendor or we have provided the search strategies to be used.

Consult on appropriate taxonomies and ontologies. Again, the vendor may not be familiar with taxonomies specific to your area of interest. The categorization and the organization of the text can be useful (or not) in manipulating results: Be sure the taxonomies will be useful for your data. In some very specialized areas of focus, it may be necessary to create and provide the vendor with some or all of a taxonomy. In one case, although we were using the MeSH taxonomy, we built out a specific area of interest in much greater detail, as that was the focus of our research.

Evaluate what’s “under the hood.” The information professional has a key responsibility to do some quality control around the text-mining tool. Your scientists will be interested in the output and what they can learn from the project. It is up to you to actually look at the structure and the underpinnings of the text-mining tool. Remember, you don’t see the actual “text mining” applied, but you do see the results—in the taxonomy, in the references, and in the organization of the output.

Identify application areas for text mining, and set appropriate expectations. Not all questions are appropriate for text mining. At times, it is difficult to determine that up front. However, because information professionals are so familiar with searching and can truly grasp the difference of a text-mining tool, they can advise and help with selecting and phrasing questions while setting reasonable expectations about potential results.

Help customers evaluate and manipulate results. Most scientists are already so overloaded with job responsibilities that they don’t have the time or the inclination to invest in learning to use a new tool. It is important for the information professional to facilitate the usability and to help them gain value from the output. It may be that the information professional will have to act as an intermediary—using the tool and producing output to which the scientist can then react.

Text-mining technologies and applications are gaining ground and are growing rapidly. While there is a need for a technical infrastructure—connections between data sources and usable interfaces—there is also a need for someone to oversee and manage the semantic infrastructure. Aspects of this include guidance on content selection and information quality control; communication, education, and interaction with end users; building bridges between otherwise siloed organizations (chemistry and biology, for example); and thinking about and creating new applications for the technology. This is really just an extension of what information professionals have always done and why our role is critical as these technologies develop and are implemented.


Linguistics for Librarians

The first thing you encounter when entering the world of text mining is a very intimidating and seemingly enormous wall of jargon. While the language of linguistics can be confusing, don’t be intimidated! All you really need for your role as an information professional in text mining is a basic understanding of the key concepts.

As an information professional, you need to understand just enough about linguistics to be able to do the following:

• Understand the underlying methodology used by the tool.
• Influence the initial build structure of the specific project.
• Evaluate the output and help interpret results of a text-mining project.

Under the Hood of a Text-Mining System

1. Breaking down text into “parts” (parsing)
2. Information extraction (tagging)
3. Organization of the “parts” (taxonomies and ontologies)

Breaking Down Text into ‘Parts’

Segmentation into wordlike units (“tokenizing”)—Recognizes that <icecream>, <ice-cream>, and <ice cream> are all the same

Variant forms of the same term (“morphology & stemming”)—<to be>, <will>, <was>, and <were> essentially represent the same term; relates singular and plural versions of the same term: <mouse> and <mice>

Understanding sentence structure (“syntax”)—recognizes that <Al hit Bill> and <Bill hit Al> convey entirely different meanings; recognizes the different meanings of these three phrases: <memory bank>, <food bank>, and <blood bank>

Clarifying context (“disambiguating”)—recognizes the correct sense of a word in context—that <cool> can mean “hip” or “neat” or “not warm,” or reflect agreement as in “that’s cool with me”; recognizes that <heart attack> and <myocardial infarction> and even, in some contexts, <MI> all have the same meaning

Information Extraction

Understanding sentence structure (“POS tagging”)—This entails tagging concepts by their “parts of speech,” also known as POS.

Identify entity types (“semantic tagging”)—<Lilly>, if used in the context of referring to Eli Lilly and Co., should be tagged as a Corporate Name.

Identify relationships between entities (“relationship tagging”)—The goal here is to hone in on the specific relationship described between entities. For example, in the sentence <Strep is a common cause of sore throats.>, the relationship between sore throats and strep is causal. Often, there may be phrases associated with the verb that indicates potential for or possibility of causing that conveys an entirely less confirmed relationship. Such contexts are equally important in understanding meaning.

Organization of the ‘Parts’ (‘Data Integration’)

After the text has been broken down into useful parts, these parts must be organized in some structure that will be useful for navigation. How well these structures represent the underlying subject areas will heavily influence the usefulness of the network.

The level to which data integration is done is dependant on the text-mining tool you choose. Structure begins with straightforward lists of items—possibly lists of compounds, body parts, proteins, etc. Items within the list will likely have associated synonyms, and this you will recognize as a thesaurus. The thesaurus lists can be organized into some hierarchical categorization—displaying relationships between and among similar items, a taxonomy.

For very powerful text-mining opportunities, an even more complex structure is then built, which is called an ontology. Ontologies are built to display relationships between different categories, as opposed to the taxonomy that displays relationships within a category. Once this distinction is understood, it is easy to see how very complex and difficult such structures are to build and to maintain—and how labor-intensive they can be.

Kathryn A. Lavengood ( is associate information consultant at Eli Lilly and Co., and Pam Kiser ( is associate information consultant at Eli Lilly and Co.

Comments? E-mail letters to the editor to


Cohen, A. M.; Hersh, W. R. “A survey of current work in biomedical text mining.” Briefings in Bioinformatics, 6(1):57–71, March 2005.

Davies, K. “Search and Deploy,” Bio-IT World, V. 5 No. 8, October 2006, pp. 24–33.

Dickman, S. “Tough Mining: The challenges of searching the scientific literature.” PloS Biology, 1(2):144–147, November 2003.

Feldman, S. “The high cost of not finding information.” KMWorld, Posted March 1, 2004.

Fickett, J.; Hayes, W. “Text mining for drug discovery.” European Pharmaceutical Contractor, Autumn 2004.

Gardner, S. P. “Ontologies and semantic data integration.” Drug Discovery Today, 10(14):1001–1007, July 2005.

Grimes, S. “The Developing text mining market.” A white paper prepared for Text Mining Summit 2005 (

Guernsey, L. “Digging for nuggets of wisdom.” The New York Times, Oct. 16, 2003.

Hayes, H. “The role of libraries in the knowledge economy.” Serials, 17(3):231–238, November 2004.

Hearst, M. “What Is Text Mining?” Essay written Oct. 17, 2003 (

Hersh, W. “Evaluation of biomedical text-mining systems: Lessons learned from information retrieval.” Briefings in Bioinformatics, 6(4):344–356, December 2005.

Jensen, L. J.; Saric, J.; Bork, P. “Literature mining for the biologist: from information retrieval to biological discovery.” Nature Reviews Genetics, 7:119–129, February 2006.

Krallinger, M.; Alonso-Allende Erhardt, R.; Valencia, A. “Text-mining approaches in molecular biology and biomedicine.” Drug Discovery Today, 10(6): 439–445, March 2005.

Mack, R.; Mukherjea, S.; Soffer, A.; et. al. “Text analytics for life science using the Unstructured Information Management Architecture.” IBM Systems Journal, 43(3):490–515, 2004.

McFadden, T. “Text analytics: The promise, the proofs, and the pitfalls.” Presentation at 2005 Information Intelligence Summit (slides available at

Roberts, P. M.; Hayes, W. S. “Advances in text analytics for drug discovery.” Current Opinion in Drug Discovery & Development, 8(3):323–328, 2005.

Searls, D. B. “Data integration: challenges for drug discovery.” Nature Reviews Drug Discovery, 4:45–58, January 2005.

Soldatova, L. N.; King, R. D. “Are the current ontologies in biology good ontologies?” Nature Biotechnology, 23(9):1095–1098, September 2005.

       Back to top