Information Professionals in the Text Mine
By Kathryn A. Lavengood and Pam Kiser
Text mining is growing rapidly as a technology for analyzing large volumes of unstructured textual documents. With the “information explosion” an accepted fact of living in the 21st century, those charged with maintaining awareness of developments in a given area of expertise face the problem of overlooking critical data. Everyone, not just information professionals, has been impacted by the exponentially increasing volumes of information available—as well as with changing attitudes toward electronic information.
With access to more information freely available, and with the help of Google and other search engines, patrons have become “information consumers” with very high expectations (access to multiple sources with a Google-like search box). As information delivery via email, blogs, wikis, and alerts expands, information consumers turn to the information professional with different demands. Today, they do not desire direction to a reference or an article but expect highly relevant information specific to their interests that has been analyzed, filtered, and summarized for them personally. They need to analyze much larger sets of data, do trends analyses, and visualize the results in new ways.
In response, a number of vendors have developed sophisticated commercially available text-mining applications. Many companies are beginning to look at ways to apply text mining to current business problems and to increase productivity. Yet text mining is frequently viewed as an information technology tool—build it, and people will use it. While there is a definite need for a solid technical infrastructure to support these tools, there is also a need for a semantic infrastructure that focuses on information quality and decision support. The information professional is ideally suited for this latter role.
MINING TEXTUAL INFORMATION
One of the most promising technologies being developed to address the information problems of today is text mining. Text mining was envisioned and developed from computational linguistics (also known as NLP or natural language processing).The ultimate goal of computational linguistics is to develop mechanical language analysis tools that use statistical techniques to interpret and assign meaning to parts of the text. This provides researchers with the opportunity to review larger and larger volumes of text from multiple sources without having to read each and every line, yet still “discover knowledge.”
The interpretation of text is just the first step in making the information usable. Another key part is then organizing the resulting “text pieces” into some form of usable network. This is addressed by building taxonomies and ontologies that can be navigated to explore specific topics of interest. Finally, the results must be output in a format that can be interpreted and lead to knowledge discovery.
WHAT YOU CAN DO WITH TEXT MINING
We generally think of text mining as being used in one of three ways:
1. Seeing the big picture (clustering)
2. Finding answers to very specific questions (question answering)
3. Hypothesis generation (concept linkages)
“Seeing the big picture” refers to the possibilities of viewing large amounts of information from a high-level perspective—seeing the forest rather than the trees. Looking at large amounts of information in the aggregate can reveal interesting trends or arrangements of data that can point the researcher in a new direction.
An example of this can be shown by using a clustering visualization that appears below. Knowing nothing about Martha Stewart, a researcher could quickly look at a diagram and use the clusters to focus in on each of the aspects related to her.
2. Question Answering (Sophisticated Search)
Another potential application of text mining is investigating very specific questions at a level of detail not possible with standard text-searching techniques. Such a question might be “Which media companies have made acquisitions larger than $500 million in the past 10 years?” Even the best information-retrieval techniques (Boolean operators, MeSH descriptors) cannot get to the sentence or phrase level and/or identify a specific verbal connection (acquisition) between two companies or rank by acquisition price.
An example of this appears in the table below. While this search could have been run in a traditional Boolean search system, the searcher would need to manually input names of the companies, indicating company type = media, and manually remove purchase price contracts for under $500 million. Finally, the organization of the detail would have to have been done manually. Using a text-mining approach, media companies could be tagged as such, and the results can be output into a spreadsheet that is easily browsed for the information of interest.
3. Concept Linkages
The most powerful use of text mining is for hypothesis generation. The classic example of hypothesis generation came from Don Swanson’s approach to looking at the medical literature in the 1980s (“Fish Oil, Raynauds Syndrome, and Undiscovered Public Knowledge,” Perspectives in Biology and Medicine, V. 30, No. 1, 1986: pp. 7–18). Text-mining tools are available to help automate Swanson’s process. Swanson’s process is based on the idea of concept linking. Concept linkage connects related documents by identifying shared concepts between two unrelated data sets. The idea is that unexpected connections may become apparent by examining these shared concepts. Such connections may generate new hypotheses to be investigated in more depth and may lead to new knowledge.
This approach is most often described with the following ABC diagram displayed on page 19:
The relationships visualization shown in the diagram illustrates if there is a known relationship between A and B, and there is a separate but known relationship between A and C, then might there be a relationship between B and C <hypothesis>?
This methodology for hypothesis generation can be applied using more complex visual maps that display networks of relationships among a group of concepts in a set of documents. Working with such information visualization offers the possibility of discovering previously unknown and unconsidered relationships. Text mining provides a more efficient approach for applying Swanson’s manual methodologies.
The relationships visualization shown on page 19 illustrates how Swanson’s theories can be applied using text mining. From the literature, Marc Weeber and his colleagues (“Generating Hypotheses by Discovering Implicit Associations in the Literature: A Case Report of a Search for New Potential Therapeutic Uses for Thalidomide,” Journal of the American Informatics Association, V. 10, No. 3, May/June 2003: pp. 252–259) found a direct link between the compound thalidomide and animmunologic factor called interleukin 12. Separately, they found a direct link between interleukin 12 and the disease hepatitis C. Therefore, it would be reasonable to hypothesize that thalidomide has a direct relationship to hepatitis C. (More involved experimental work would be needed to test this hypothesis, as it may or may not be true. But this illustrates a method for generating hypotheses by visualizing indirect relationships among concepts.)
Don’t enter the mine without an Information Professional
Information professionals are natural partners for text mining because of their existing skill sets. At the top of the list is their knowledge and experience with the “information highway” and their ability to place information tools in context. Furthermore, they are knowledgeable about available products and information-retrieval techniques. Good information professionals have a blend of analytical and creativity skills, are adept at problem solving, and excel at dealing with ambiguity. Finally, information professionals have developed excellent consultative and listening skills and the ability to adapt and try different approaches to problems. All of these skills are critical in text-mining efforts.
Specific roles for information professionals in text-mining projects can include the following:
• Facilitate conversations between internal teams and vendors. Our experience clearly illustrated that scientists and vendors speak very different languages. Someone needs to negotiate and articulate the various needs and desires and get everyone in agreement.
• Place the text-mining tool in context of other information sources. It can be difficult to understand what a text-mining tool offers versus a familiar search tool. Customers need help understanding what they should expect and how it will be different from other search outputs.
• Advise vendors and customers on source selection. From the customer viewpoint, it can be difficult to understand why certain commercial database information cannot be included in the text-mining effort. Licensing and copyright issues are best addressed by an information professional. Vendors likely won’t have the expertise in sources for your specific area of interest. It can make a big difference in the output based on the sources used for the input.
• Advise on search strategies to retrieve the content set. Even if the vendor is going to use a content source that is familiar to all, such as PubMed, the search strategies used to retrieve the corpus are of critical importance. We have either required documentation of the exact search strategies used by the vendor or we have provided the search strategies to be used.
• Consult on appropriate taxonomies and ontologies. Again, the vendor may not be familiar with taxonomies specific to your area of interest. The categorization and the organization of the text can be useful (or not) in manipulating results: Be sure the taxonomies will be useful for your data. In some very specialized areas of focus, it may be necessary to create and provide the vendor with some or all of a taxonomy. In one case, although we were using the MeSH taxonomy, we built out a specific area of interest in much greater detail, as that was the focus of our research.
• Evaluate what’s “under the hood.” The information professional has a key responsibility to do some quality control around the text-mining tool. Your scientists will be interested in the output and what they can learn from the project. It is up to you to actually look at the structure and the underpinnings of the text-mining tool. Remember, you don’t see the actual “text mining” applied, but you do see the results—in the taxonomy, in the references, and in the organization of the output.
• Identify application areas for text mining, and set appropriate expectations. Not all questions are appropriate for text mining. At times, it is difficult to determine that up front. However, because information professionals are so familiar with searching and can truly grasp the difference of a text-mining tool, they can advise and help with selecting and phrasing questions while setting reasonable expectations about potential results.
• Help customers evaluate and manipulate results. Most scientists are already so overloaded with job responsibilities that they don’t have the time or the inclination to invest in learning to use a new tool. It is important for the information professional to facilitate the usability and to help them gain value from the output. It may be that the information professional will have to act as an intermediary—using the tool and producing output to which the scientist can then react.
Text-mining technologies and applications are gaining ground and are growing rapidly. While there is a need for a technical infrastructure—connections between data sources and usable interfaces—there is also a need for someone to oversee and manage the semantic infrastructure. Aspects of this include guidance on content selection and information quality control; communication, education, and interaction with end users; building bridges between otherwise siloed organizations (chemistry and biology, for example); and thinking about and creating new applications for the technology. This is really just an extension of what information professionals have always done and why our role is critical as these technologies develop and are implemented.