Text Mining for Reputations: SCOUG Spring Workshop 2004

Online

KMWorld

CRM Media, LLC

Streaming Media Inc

Faulkner

Speech Technology

Other ITI Websites

American Library Directory Boardwalk Empire Database Trends and Applications DestinationCRM Enterprise AI World Faulkner Information Services Fulltext Sources Online InfoToday Europe KMWorld Literary Market Place Plexus Publishing Smart Customer Service Speech Technology Streaming Media Streaming Media Europe Unisphere Research

Magazines > Searcher > July/August 2004
Back Index Forward

SUBSCRIBE NOW!

Vol. 12 No. 7 — July/August 2004

FEATURE
Text Mining for Reputations: SCOUG Spring Workshop 2004
by Amelia Kassel

The annual, day-long SCOUG (Southern California Online Users Group, http://www.scougweb.org) meeting on April 16th, "The New Gold Rush: Text Mining Finds the Motherlodes," propelled wide-eyed information professionals into the exciting future of new and emerging technologies. Held amidst fields of Arabian horses — and chickens — at the beautiful Kellogg West Conference Center on the Cal Poly Pomona campus in southern California, attendees saw previews of some of the latest technologies, in development for some years, but now surfacing with applications of particular interest to information professionals. Linnea Christiani, SCOUG's Mistress of Ceremonies, moderated.

When initially invited to participate on the Reactors Panel, I didn't know what this new world was all about, but I said yes in anticipation of learning more about the next chapter about the online world. The program did not let me down.

Before I recount the lessons learned from a day focused on text mining, let's look at pertinent definitions. According to the Wikipedia Encyclopedia:

Text mining, also known as knowledge-discovery in text (KDT), refers generally to the process of extracting interesting and nontrivial information and knowledge from unstructured text. Text mining is a young interdisciplinary field, which draws on information retrieval, data mining, machine learning, statistics, and computational linguistics. As most information (over 80 percent) is stored as text, text mining is believed to have a high commercial potential value.

From http://en.wikipedia.org/wiki/Text_mining

Loretta Auvil and Duane Searsmith (Automated Learning Group, National Center for Supercomputing Applications, University of Illinois) suggest that the term "text mining" has several definitions and add:

Text mining is an exploration and analysis of textual (natural-language) data by automatic and semi-automatic means to discover new knowledge. Strictly speaking, previously unknown information is information that not even the writer knows, whereas a lenient definition is that it rediscovers the information that the author encoded in the text.

Using Text Mining for Spam Filtering,
Supercomputing 2003,
http://algdocs.ncsa.uiuc.edu/PR-20031116-3.ppt

But what are the implications for librarians and info pros with regard to text-mining applications? Randy Marcinko, one of two keynote speakers, succinctly and simply set the context when he explained text mining as another form of search and retrieval during his afternoon presentation, "Text Mining: The New Tools Behind the Screen." Randy described software and hardware advances that are creating a new generation of "answer products." His presentation used the term data mining, leading me to question the differences between text mining and data mining. Date mining is defined as:

An information extraction activity whose goal is to discover hidden facts contained in databases. Using a combination of machine learning, statistical analysis, modeling techniques, and database technology, data mining finds patterns and subtle relationships in data and infers rules that allow the prediction of future results. Typical applications include market segmentation, customer profiling, fraud detection, evaluation of retail promotions, and credit risk analysis.

—From http://www.two crows.com/glossary.htm

For a scholarly discussion of text mining and data mining, too much to sort out here just now, see "Untangling Text Data Mining" by Marti A. Hearst, School of Information Management & Systems, University of California, Berkeley. This paper is part of the Proceedings of the 37th Conference on Association for Computational Linguistics, 1999 [http://acl.ldc.upenn.edu//P/P99/P99-1001.pdf]. For a description and examples of text- and data-mining applications, see Combine Text Mining With Data Mining to Turn Your Information into Business Intelligence from statistical software vendor SPSS [http://www.spss.com/pdfs/TMCLMINS-0702.pdf].

Randy marched us through a bit of history with examples outlining how "slow and cumbersome Boolean (commercial vendors) begot fast and powerful pseudo-Boolean (Google), but quality of answers remained the same." Randy pointed out, however, that now there is hope, because tools for data mining are becoming available, made possible by:

• automated indexing/ categorization

• automated classification

• entity extraction

• query by example = "More like this"

• linguistic-based search enhancement

At this juncture, however, advanced data mining, he says, is still only the Holy Grail. Kevin Mann, a panelist from IBM's WebFountain, said that at some point we will find answers not documents. As a reactor, I thought this a fine idea, but it left me with a lingering concern as to whether the answers would include sources. Will we know where the information comes so we can judge its quality? Without documents or complete citations, how could we apply quality criteria? How this all comes together remains to be seen.

Barbie Keiser, our second keynote speaker, described specific text-mining applications in her presentation about "Reputation Monitors," content services that should appeal to information professionals and corporate librarians, in particular. Karin Borchert (chief product officer at Factiva), Julie Stock (CEO and president of Nexcerpt Inc.), and Scott Larson (eWatch product specialist, eWatch/PR Newswire) comprised a vendor panel that described in lively fashion some of the new reputation monitor services offered by their firms.

In her introduction preceding the panel, Barbie suggested that tools which can automatically monitor people, companies, products, and activities across both traditional and Web-only media are gaining momentum. Results are used in public relations, sales and marketing, competitive intelligence, and to discover emerging trends. Various online services survey and track key sources across the Web, including e-zines, newspapers, blogs, and trade press. Players in the field of reputation monitoring include QuickBrowse, CyberAlert, Tracerlock, Nexcerpt, and eWatch. In a major development, Factiva recently partnered with IBM's WebFountain to create Reputation Manager, a high-end service for corporate settings. The minimum entry level cost is $150,000, but Clare Hart, Factiva's CEO, says the service will make its way down to the individual user. ["Protection Money: Clare Hart Claims That Her Online Cuttings Service Has the Power to Stop Damaging Rumours Before They Start — But You'll Have to Pay for the Privilege," Kate Bulkley, The Guardian, April 26, 2004].

Barbie reminded us how important a company's reputation is in the business world and how vulnerable a reputation can be — here today but gone tomorrow. Some external forces that affect a company's reputation come from stakeholders, lobby groups, labor unions, customers, media, and analysts. Companies must track what's said about them and who's listening, as well as discovering emerging trends in their areas of interest. In the past, doing this often involved using commercial database aggregators such as Factiva, Dialog, and LexisNexis. Gathering data from the Web has become equally, if not more, important. Companies that need to scan Web news sources, message boards, blogs, key Web sites, etc., can now use some of the newer-technology products such as Factiva's new Reputation Manager, Nexcerpt, and eWatch from PR Newswire.

Factiva's Reputation Manager combines searches from 9,000 sources in 22 languages and 118 countries and also tracks information on blogs and bulletin boards. It promises to identify quickly any damaging rumors that may first appear on the Net, e.g., in chat rooms. Text analytics solutions are built on IBM's WebFountain platform. Factiva is the first company to license the WebFountain technology, a Web-scale mining and discovery platform that extracts trends, patterns, and relationships from massive amounts of unstructured and semi-structured text. In this first application, the Factiva tool tracks corporate reputation:

...offering an external view of a company's reputation by analyzing information from a comprehensive collection of Factiva sources, Internet, pages and newsgroups. The resulting analysis is presented in a report that clearly shows the information in context, providing a view on relevant business issues, showing new industry trends, and exposing relationships.

—"WebFountain Rewrites the Rules of Business,"
San Jose, Calif., Business Wire,
September 18, 2003

Nexcerpt offers more than the average clipping service at affordable prices ($200 a month for up to 10 search profiles) and monitors 5,400 selected Web-based sources tailored to user requirements. It "intelligently extracts relevant excerpts, capturing context, along with links to the original Web sources." Using Nexcerpt, users may also select and annotate excerpts, contribute the benefit of their expertise, and then e-mail or publish the enhanced results directly to any audience. You can add comments and "brand" your work and e-mail to multiple users, publish to a Web site or intranet, or send as an XML feed.

eWatch, an "Internet monitoring service" from PR Newswire, pioneered this kind of Net monitoring. The company began operation in 1994 and was later purchased by PR Newswire. It monitors 8,000 message boards, news sources, and media sites for rumors, reactions to company announcements, and more. eWatch continually scans Web publications and hundreds of thousands of online publications, Web pages, bulletin boards, and e-mail discussion groups in over 30 countries, based upon customized search criteria. The newest version has an in-box feature that enables users to sort links to articles by subject, client, date, search term matches, or readership data. Customers can flag links to share with their clients and colleagues and, with one click, e-mail individual article links or an entire page of article links with highlighted search terms. eWatch also allows customers to measure the number of potential viewers and possible impact.

Afternoon vendor panelists were Kevin Mann (marketing strategist, Web Fountain project, IBM Research), Dr. Christian Toelg (director, business development, NEC Laboratories America, Inc.), and Andrew McKay (senior vice-president of technical sales, FastSearch.com). All delivered fascinating presentations about underlying technologies related to information infrastructures and retrieval. I described the WebFountain project with Factiva above. CiteSeer software [http://www.neci.nj.nec.com/homepages/lawrence/citeseer.html] from NEC Laboratories America [http://www.nec-labs.com] is a "scientific digital library system that implements Autonomous Citation Indexing (ACI)" and is freely available. ACI automates the construction of citation indexes (similar to the Science Citation Index) and groups a collection of citations to a given article, allowing researchers to easily see what is being said and why the article was cited. The fact that ACI is completely automatic means that it requires no manual effort, which in turn should result in lower cost and wider availability. CiteSeer locates papers on the Web using search engines, heuristics, and Web crawling. Other means of locating papers include the indexing of existing archives through agreements with publishers and user submission. Once located, CiteSeer extracts individual citations. CiteSeer and Thomson ISI [http://www.isinet.com] are collaborating to "create a comprehensive, multidisciplinary citation index for Web-based scholarly resources, due out in early 2005." [For additional information, see the Information Today Inc. Newsbreak, "Thomson ISI to Track Web-Based Scholarship with NEC's CiteSeer," by Barbara Quint, March 1, 2004, https://www.infotoday.com/newsbreaks/nb040301-1.shtml.]

Fastsearch.com develops and licenses Fast ESP, an enterprise search platform software. For an excellent overview and description, see the IDC White Paper "A 360 Degree View of Enterprise Information" [http://fastsearch.com/us/verticals/more], by Susan Feldman and Chris Sherman, April 2004. Fast ESP incorporates the following:

• Search and query

• Data and text mining and analysis

• Exploration and static reporting

• Content repositories and data warehouses

What Does It All Mean to Info Pros?

At the end of the day, a Reactor's Panel, "What Does All This Mean?" composed of Barbie Keiser, Randy Marcinko, Karin Borchert, Factiva, and yours truly, discussed what this means to us. New technology and tools undoubtedly signify the coming of a new era and change for information professionals. Ideally, these technologies will create cost- and time-effective products that revolutionize the way we search and gather information. Although much of this technology is currently geared to the enterprise, we are already seeing affordable and pioneering products, such as Nexcerpt, reach the broader markets.

The disparate dichotomy of free and fee may narrow for some applications. At the same time, corporate environments will acquire better tools for storing and retrieving information that promote greater overall organizational success. Specialization is leading to partnering, exemplified by the Factiva and WebFountain collaboration, as well as that of CiteSeer and Thomson ISI.

At the high end, the products promise to include answers to questions the requester did not even know to ask, but doubts still remain on how we will judge the truth and relevance of the answers. Remembering past experiences with other new and emerging technologies, I've often been first dazzled, then disappointed. The brainpower displayed at the SCOUG meeting and the cooperative endeavors reflected by speakers, however, are rays of hope, as traditional companies such as Factiva, IBM, NEC, and Thomson fuse resources and technology and new age companies such as Fast.com and Nexcerpt construct novel and advanced products. We can look to them for offering working tools that provide superior information and knowledge, increase productivity, and lower costs.

Exhibitors

Exhibitors, listed below, included many of the same companies always present at online shows. Each displayed a new face with fresh products or services. Altogether, the SCOUG meeting presented many exciting possibilities for the future of our profession and I was fortunate to have the opportunity to attend and participate.

Advanced Information Management

Association of Independent Information Professionals

Dialog

EBSCO

Factiva

Library Associates

Mergent

Nexcerpt, Inc.

Swets

H. W. Wilson

When You're Hot, You're Hot

In the weeks following the workshop, it seemed like I could see reputation monitors and text-mining applications springing up everywhere. Perhaps the session had just made me more aware. Or could it be that the workshop planners had truly reached the "bleeding edge"?
Intelliseek launches BlogPulse [http://www.blogpulse.com], a free monitoring tool to track trends, names, phrases, links, etc., within the network of key blogs.
(For a description, read the press release, http://www.intelliseek.com/releases2.asp?id=104.)
Outsell Inc. announced a "BrainGains" meeting in San Mateo on May 27th devoted to "Text Mining and Visualization Tools: Science or Science Fiction"
[http://www.outsellinc.com/braingain/textmining.htm].
Inxight sponsors a KMWorld live e-Broadcast event on May 18th on how to "discover the true value of information," devoted to understanding the power of text mining and data visualization for event analysis and discovery issues.
ComputerWorld runs a major article on reputation monitors by Alan Earls in the April 5th issue: "Winning the Name Game"and "Sidebar: Reputations Caught in the Web."
[See http://www.computerworld.com/developmenttopics/
websitemgmt/story/0,10801,91840,00.html and http://www.computerworld.com/softwaretopics/
software/story/0,10801,91865,00.html, respectively.]

Back to top