Text Mining for Reputations: SCOUG
Spring Workshop 2004
by Amelia Kassel
The annual, day-long SCOUG (Southern California Online
Users Group, http://www.scougweb.org) meeting on April
16th, "The New Gold Rush: Text Mining Finds the Motherlodes," propelled
wide-eyed information professionals into the exciting
future of new and emerging technologies. Held amidst
fields of Arabian horses and chickens at
the beautiful Kellogg West Conference Center on the
Cal Poly Pomona campus in southern California, attendees
saw previews of some of the latest technologies, in
development for some years, but now surfacing with
applications of particular interest to information
professionals. Linnea Christiani, SCOUG's Mistress
of Ceremonies, moderated.
When initially invited to participate on the Reactors
Panel, I didn't know what this new world was all about,
but I said yes in anticipation of learning more about
the next chapter about the online world. The program
did not let me down.
Before I recount the lessons learned from a day focused
on text mining, let's look at pertinent definitions.
According to the Wikipedia Encyclopedia:
Text mining, also known as knowledge-discovery in
text (KDT), refers generally to the process of extracting
interesting and nontrivial information and knowledge
from unstructured text. Text mining is a young interdisciplinary
field, which draws on information retrieval, data mining,
machine learning, statistics, and computational linguistics.
As most information (over 80 percent) is stored as
text, text mining is believed to have a high commercial
Loretta Auvil and Duane Searsmith (Automated Learning
Group, National Center for Supercomputing Applications,
University of Illinois) suggest that the term "text
mining" has several definitions and add:
Text mining is an exploration and analysis of textual
(natural-language) data by automatic and semi-automatic
means to discover new knowledge. Strictly speaking,
previously unknown information is information that
not even the writer knows, whereas a lenient definition
is that it rediscovers the information that the author
encoded in the text.
Using Text Mining for Spam Filtering,
But what are the implications for librarians and
info pros with regard to text-mining applications?
Randy Marcinko, one of two keynote speakers, succinctly
and simply set the context when he explained text mining
as another form of search and retrieval during his
afternoon presentation, "Text Mining: The New Tools
Behind the Screen." Randy described software and hardware
advances that are creating a new generation of "answer
products." His presentation used the term data mining,
leading me to question the differences between text
mining and data mining. Date mining is defined as:
An information extraction activity whose goal is
to discover hidden facts contained in databases. Using
a combination of machine learning, statistical analysis,
modeling techniques, and database technology, data
mining finds patterns and subtle relationships in data
and infers rules that allow the prediction of future
results. Typical applications include market segmentation,
customer profiling, fraud detection, evaluation of
retail promotions, and credit risk analysis.
For a scholarly discussion of text mining and data
mining, too much to sort out here just now, see "Untangling
Text Data Mining" by Marti A. Hearst, School of Information
Management & Systems, University of California,
Berkeley. This paper is part of the Proceedings
of the 37th Conference on Association for Computational
Linguistics, 1999 [http://acl.ldc.upenn.edu//P/P99/P99-1001.pdf].
For a description and examples of text- and data-mining
applications, see Combine Text Mining With Data
Mining to Turn Your Information into Business Intelligence from
statistical software vendor SPSS [http://www.spss.com/pdfs/TMCLMINS-0702.pdf].
Randy marched us through a bit of history with examples
outlining how "slow and cumbersome Boolean (commercial
vendors) begot fast and powerful pseudo-Boolean (Google),
but quality of answers remained the same." Randy pointed
out, however, that now there is hope, because tools
for data mining are becoming available, made possible
query by example = "More like this"
At this juncture, however, advanced data mining,
he says, is still only the Holy Grail. Kevin Mann,
a panelist from IBM's WebFountain, said that at some
point we will find answers not documents. As a reactor,
I thought this a fine idea, but it left me with a lingering
concern as to whether the answers would include sources.
Will we know where the information comes so we can
judge its quality? Without documents or complete citations,
how could we apply quality criteria? How this all comes
together remains to be seen.
Barbie Keiser, our second keynote speaker, described
specific text-mining applications in her presentation
about "Reputation Monitors," content services that
should appeal to information professionals and corporate
librarians, in particular. Karin Borchert (chief product
officer at Factiva), Julie Stock (CEO and president
of Nexcerpt Inc.), and Scott Larson (eWatch product
specialist, eWatch/PR Newswire) comprised a vendor
panel that described in lively fashion some of the
new reputation monitor services offered by their firms.
In her introduction preceding the panel, Barbie suggested
that tools which can automatically monitor people,
companies, products, and activities across both traditional
and Web-only media are gaining momentum. Results are
used in public relations, sales and marketing, competitive
intelligence, and to discover emerging trends. Various
online services survey and track key sources across
the Web, including e-zines, newspapers, blogs, and
trade press. Players in the field of reputation monitoring
include QuickBrowse, CyberAlert, Tracerlock, Nexcerpt,
and eWatch. In a major development, Factiva recently
partnered with IBM's WebFountain to create Reputation
Manager, a high-end service for corporate settings.
The minimum entry level cost is $150,000, but Clare
Hart, Factiva's CEO, says the service will make its
way down to the individual user. ["Protection Money:
Clare Hart Claims That Her Online Cuttings Service
Has the Power to Stop Damaging Rumours Before They
Start But You'll Have to Pay for the Privilege," Kate
Bulkley, The Guardian, April 26, 2004].
Barbie reminded us how important a company's reputation
is in the business world and how vulnerable a reputation
can be here today but gone tomorrow. Some external
forces that affect a company's reputation come from
stakeholders, lobby groups, labor unions, customers,
media, and analysts. Companies must track what's said
about them and who's listening, as well as discovering
emerging trends in their areas of interest. In the
past, doing this often involved using commercial database
aggregators such as Factiva, Dialog, and LexisNexis.
Gathering data from the Web has become equally, if
not more, important. Companies that need to scan Web
news sources, message boards, blogs, key Web sites,
etc., can now use some of the newer-technology products
such as Factiva's new Reputation Manager, Nexcerpt,
and eWatch from PR Newswire.
Factiva's Reputation Manager combines searches from
9,000 sources in 22 languages and 118 countries and
also tracks information on blogs and bulletin boards.
It promises to identify quickly any damaging rumors
that may first appear on the Net, e.g., in chat rooms.
Text analytics solutions are built on IBM's WebFountain
platform. Factiva is the first company to license the
WebFountain technology, a Web-scale mining and discovery
platform that extracts trends, patterns, and relationships
from massive amounts of unstructured and semi-structured
text. In this first application, the Factiva tool tracks
...offering an external view of a company's reputation
by analyzing information from a comprehensive collection
of Factiva sources, Internet, pages and newsgroups.
The resulting analysis is presented in a report that
clearly shows the information in context, providing
a view on relevant business issues, showing new industry
trends, and exposing relationships.
—"WebFountain Rewrites the Rules of Business,"
San Jose, Calif., Business Wire,
September 18, 2003
Nexcerpt offers more than the average clipping service
at affordable prices ($200 a month for up to 10 search
profiles) and monitors 5,400 selected Web-based sources
tailored to user requirements. It "intelligently extracts
relevant excerpts, capturing context, along with links
to the original Web sources." Using Nexcerpt, users
may also select and annotate excerpts, contribute the
benefit of their expertise, and then e-mail or publish
the enhanced results directly to any audience. You
can add comments and "brand" your work and e-mail to
multiple users, publish to a Web site or intranet,
or send as an XML feed.
eWatch, an "Internet monitoring service" from PR
Newswire, pioneered this kind of Net monitoring. The
company began operation in 1994 and was later purchased
by PR Newswire. It monitors 8,000 message boards, news
sources, and media sites for rumors, reactions to company
announcements, and more. eWatch continually scans Web
publications and hundreds of thousands of online publications,
Web pages, bulletin boards, and e-mail discussion groups
in over 30 countries, based upon customized search
criteria. The newest version has an in-box feature
that enables users to sort links to articles by subject,
client, date, search term matches, or readership data.
Customers can flag links to share with their clients
and colleagues and, with one click, e-mail individual
article links or an entire page of article links with
highlighted search terms. eWatch also allows customers
to measure the number of potential viewers and possible
Afternoon vendor panelists were Kevin Mann (marketing
strategist, Web Fountain project, IBM Research), Dr.
Christian Toelg (director, business development, NEC
Laboratories America, Inc.), and Andrew McKay (senior
vice-president of technical sales, FastSearch.com).
All delivered fascinating presentations about underlying
technologies related to information infrastructures
and retrieval. I described the WebFountain project
with Factiva above. CiteSeer software [http://www.neci.nj.nec.com/homepages/lawrence/citeseer.html] from NEC Laboratories America [http://www.nec-labs.com] is a "scientific digital library system that implements
Autonomous Citation Indexing (ACI)" and is freely available.
ACI automates the construction of citation indexes
(similar to the Science Citation Index) and groups
a collection of citations to a given article, allowing
researchers to easily see what is being said and why
the article was cited. The fact that ACI is completely
automatic means that it requires no manual effort,
which in turn should result in lower cost and wider
availability. CiteSeer locates papers on the Web using
search engines, heuristics, and Web crawling. Other
means of locating papers include the indexing of existing
archives through agreements with publishers and user
submission. Once located, CiteSeer extracts individual
citations. CiteSeer and Thomson ISI [http://www.isinet.com] are collaborating to "create a comprehensive, multidisciplinary
citation index for Web-based scholarly resources, due
out in early 2005." [For additional information, see
the Information Today Inc. Newsbreak, "Thomson ISI
to Track Web-Based Scholarship with NEC's CiteSeer," by
Barbara Quint, March 1, 2004, https://www.infotoday.com/newsbreaks/nb040301-1.shtml.]
Fastsearch.com develops and licenses Fast ESP, an
enterprise search platform software. For an excellent
overview and description, see the IDC White Paper "A
360 Degree View of Enterprise Information" [http://fastsearch.com/us/verticals/more],
by Susan Feldman and Chris Sherman, April 2004. Fast
ESP incorporates the following:
Search and query
Data and text mining
Exploration and static reporting
Content repositories and
What Does It All Mean to Info Pros?
At the end of the day, a Reactor's Panel, "What Does
All This Mean?" composed of Barbie Keiser, Randy Marcinko,
Karin Borchert, Factiva, and yours truly, discussed
what this means to us. New technology and tools undoubtedly
signify the coming of a new era and change for information
professionals. Ideally, these technologies will create
cost- and time-effective products that revolutionize
the way we search and gather information. Although
much of this technology is currently geared to the
enterprise, we are already seeing affordable and pioneering
products, such as Nexcerpt, reach the broader markets.
The disparate dichotomy of free and fee may narrow
for some applications. At the same time, corporate
environments will acquire better tools for storing
and retrieving information that promote greater overall
organizational success. Specialization is leading to
partnering, exemplified by the Factiva and WebFountain
collaboration, as well as that of CiteSeer and Thomson
At the high end, the products promise to include
answers to questions the requester did not even know
to ask, but doubts still remain on how we will judge
the truth and relevance of the answers. Remembering
past experiences with other new and emerging technologies,
I've often been first dazzled, then disappointed. The
brainpower displayed at the SCOUG meeting and the cooperative
endeavors reflected by speakers, however, are rays
of hope, as traditional companies such as Factiva,
IBM, NEC, and Thomson fuse resources and technology
and new age companies such as Fast.com and Nexcerpt
construct novel and advanced products. We can look
to them for offering working tools that provide superior
information and knowledge, increase productivity, and
Exhibitors, listed below, included many of
the same companies always present at online shows.
Each displayed a new face with fresh products
or services. Altogether, the SCOUG meeting presented
many exciting possibilities for the future of
our profession and I was fortunate to have the
opportunity to attend and participate.
Advanced Information Management
Association of Independent Information Professionals
H. W. Wilson
When You're Hot, You're Hot
In the weeks following the workshop, it seemed
like I could see reputation monitors and text-mining
applications springing up everywhere. Perhaps
the session had just made me more aware. Or could
it be that the workshop planners had truly reached
the "bleeding edge"?
Intelliseek launches BlogPulse [http://www.blogpulse.com],
a free monitoring tool to track trends,
names, phrases, links, etc., within the network
of key blogs.
(For a description, read the press release,
Outsell Inc. announced a "BrainGains" meeting
in San Mateo on May 27th devoted to "Text Mining
and Visualization Tools: Science or Science Fiction"
Inxight sponsors a KMWorld live
e-Broadcast event on May 18th on how to "discover the true value of information," devoted
to understanding the power of text mining and data visualization for event
analysis and discovery issues.
ComputerWorld runs a major article on
reputation monitors by Alan Earls in the April
5th issue: "Winning the Name Game"and "Sidebar:
Reputations Caught in the Web."
websitemgmt/story/0,10801,91840,00.html and http://www.computerworld.com/softwaretopics/