in Searcher I wrote an article that provided an overview of the
Web search scene as of Fall 2001 ["Web Search Engine FAQs: Questions, Answers,
and Issues, " October 2001, http://www.infotoday.com/searcher/oct01/price.htm].
I'm thrilled that so many of you read and used the article.
This year it's
time to break away from general search tools and talk about specialized
search resources. However, in this fast-moving field where information
can go out of date so quickly, this article will start a series of discussions
scheduled for future issues of Searcher. Each article will cover
news and information about many specialized search tools, how to best use
them, and background on the people who create them.
engines like Google, AllTheWeb, and Teoma remain absolutely necessary,
as do fee-based databases and print. Web search, like research in general,
is never a one-tool concept. A good researcher considers myriad resources
(Web and non-Web) to provide the best results possible.
Just as librarians
have specialized reference tools on their reference shelves along with
general encyclopedias, newspapers, and periodicals, the same is true here.
It's simply the right tool at the right time for the job at hand. Nothing
more, nothing less.
Defining the Specialized Search
Here we go again!
And defining the specialized search engine becomes even more challenging
in the dynamic field of Web searching, not to mention the demanding expectations
of anything called "special."
In many cases,
though not all, the data and potential answers contained in these databases
meet the standards of what we and our patrons care the most about — accuracy,
currency, authoritativeness. Between general Web search engines and specialized
tools, including those that allow you to access material on the Invisible
or Hidden Web, you could spend all your time defining and explaining and
not searching. People love things that fit snuggly into categories. Unfortunately,
snug categories and Web search don't often go together.
key issues such as the value of the content, the ease of access, the authority
of the information, and the possibility of more precise and current results
are much more important than issues of categories and definitions. However,
this article — and those that follow — will categorize "specialized search"
in five ways. Of course, the reader must allow us the freedom to change
these definitions and categories at will.
Type One. A
"chunk" of a general Web engine focusing on a certain subject area or domain,
with content often accessible through a specialized interface.
Type Two. A
focused or targeted Web crawler that looks for and provides access to material
on a specific subject or domain from material found on the open Web.
Search engines that provide access to material in general Web directories,
such as the Librarians' Index to the Internet [http://www.lii.org]
and INFOMINE [http://infomine.ucr.edu].
Often, these databases provide the user with search capabilities not found
elsewhere. For example, the LII allows you to search entries by subject
using Library of Congress Subject Headings.
Type Four. Search
engines that provide access to material for a specific site or tool. In
many ways identical to type three. An example? The Internet Movie Database.
Yes, many of these pages are now indexed in Google and AllTheWeb, but the
advanced search interface direct from the IMDB provides many specialized
search options [http://us.imdb.com/list].
As a co-author with Chris Sherman of a book on the topic, I have to include
the Invisible or Deep Web in the world of specialized search. Again, let's
not get bogged down in definitions, but one narrow definition of the Invisible
Web consists of material that you cannot access directly from a general
search engine. Think of the hundreds of interactive databases out there.
Sites where you fill in a form, pull down a menu, and then click-search.
Very often, the data that you see then comes from an unformatted database
and only exists on a Web page for the time it takes you to look at it.
Once you close the page, the page no longer exists.
So why do we need
to know about specialized engines? Why waste time with this "stuff" when,
in many cases, some or all of this material appears in general search tools?
Here are a few of what could be hundreds of reasons.
In some cases, using a "specialized" database focusing on a specific topic,
type of material, etc., can help lower recall and improve precision. It's
that simple. Instead of searching everything, you've begun to limit
your search just in the selection of your source. Think of it as picking
a OneSearch category in Dialog or a Library in LexisNexis.
content accessible from specialized engines is also available from general
engines like Google or AltaVista or AllTheWeb, but using specialized tools
can help limit your search. Of course, in some cases, advanced searchers
can turn a general search tool into a "pseudo" specialized tool by limiting
to a specific domain, country code, or file format.
that no search engine contains everything available. Utilizing specialized
search tools follows the wise practice of using more than a single resource.
One final point:
Using multiple tools can help you identify more potentially useful content.
Even if every search engine contained the identical data (which isn't the
case), you would still see different results, even using the same search
strategy. Why? Each search engine uses a different page-ranking algorithm.
Therefore, unless you scrolled through an entire list of results, seeing
the same base content presented and organized in a different manner can
expand a search's utility.
These tools can support the needs of users who do not have the time or
interest to learn how to use the general engines at an advanced level.
You know who I mean — the "type in two search terms, click search, and
hope for the best" searcher. Many users who are not information professionals
want a quick, best place to go. Nothing more. For them, starting at a high-precision
search engine site may work best.
Specialized tools, particularly in the world of news, can provide rapid
re-crawl (the crawler returns in some cases every few minutes), providing
new content to the database that you search. General search engines are
often weeks behind in crawling and making newly identified content accessible
to searchers. When it comes to the Invisible Web, the content is only accessible
by using a specialized interface to the material.
Finally, specialized search tools provide extra usability features. For
example, the database might provide special ways of limiting the dataset,
specific to the information in that database. The same might be true with
various sorting options.
These are a
few of the reasons why specialized search engines, just like specialized
reference books, will continue to be vital. If "everything" were in one
a single database, how easy would it be to find anything? As Chris Sherman
and I say in our book, you wouldn't start searching for a person's telephone
number in the Encyclopaedia Britannica.
So where do we
start? With the usual suspects? For this first article, let's look at the
many specialized resources that the major search engines offer. Google,
AllTheWeb, and AltaVista all provide searchers with specialized resources
that present specific types of information in a rapid manner.
For example, speaking
of phone numbers, you can now find certain U.S. telephone numbers and addresses
by entering a person's or business' name, plus some extra info, into the
Google search box. However, at the time of writing, Google only offered
this service for U.S. numbers and only queried one online telephone directory.
In some cases, a specialized interface accessed directly from an online
telephone directory could provide more useful answers. We all know that
you often have to query several of these sources to find what you want.
Specialized Web Search Resources
Offered by Major Web Search Companies
general engine crawls the Web in its own manner. Therefore, each database
holds material that you cannot access from the others. Besides going directly
to the URLs provided, searchers can access some of the tools and specialized
interfaces listed here via tabs on the search engine's primary page. However,
you'll have to go directly to the resource to access any advanced search
FAST Search and Transfer provides access to several specialized tools.
In addition to providing customized interfaces for each of them, AllTheWeb
also integrates results from several of these "catalogs" into the results
page of a simple Web search.
Domain, Path, Size
the contents of FTP servers. However, you cannot search keywords found
in these files, only file names and types.
Domain (e.g., CNN.Com), Time (e.g., only articles indexed in the past 12
hours), and Type of news source (U.S., International, Sports, etc.)
Comment: The AllTheWeb
News spider crawls 3,000 news site from around the globe (English and Non-English
language content) continuously. The database is built from a separate "news
only" of the Web. URLs remain in the database for 5 days. I wish ATW would
add the ability to limit to news sources from specific states and countries.
Access to approximately
118 million images, videos, and sound files. Make sure to apprise yourself
of any and all copyright issues that might apply to using this material.
Material located in these special databases is not directly accessible
from AllTheWeb's primary search result sets. ATW uses a specialized crawl
of the Web to build them.
File Format (jpg, gif, bmp), Type (color, b&w, line art), Background
Format (Real, QuickTime, AVI, etc.) and Stream/Download
image search tools do NOT find words embedded in the image. The topic is
determined by various factors, including image file name, words surrounding
the image on the Web page, etc.
Limits: Not Available
Comment: A large
amount of MP3 material is not kept on Web servers these days but is accessible
on peer-to-peer networks like Kazaa and Morpheus.
In the past few
months, AltaVista has begun to work towards returning to its one-time prominence
on the Web search scene. It recently announced a faster re-crawl/update
rate and the AltaVista Prisma term suggestion tool. Let's hope the improvements
Comment: AV News
receives its content from Moreover, a well-known news aggregator. New content
from over 3,000 sources updates every 15 minutes. Results can be sorted
by either relevance or date. Through a pull-down menu, searchers can limit
searches to specific news categories. AltaVista's advanced searching syntax
will not work with this database. Multiple terms in the search box receive
an implied "AND." Links remain in this database for about 2 weeks.
Access to approximately
118 million images, videos, and sound files. Again, stay on top of
all copyright issues before using this material. Most of the material found
in these databases is not directly accessible via the primary AV interface.
Some of AV's advanced syntax will work with the multimedia engines. In
addition, AltaVista has a few paying partners who provide a direct feed
into the database. For example, a search of the video database will find
new video content from MSNBC.
Type (color, b&w, banners)
Format (mp3, wav, etc.) and Stream/Download, Duration (Less or Greater
than 1 Minute)
Format (Avi, Quicktime, MPEG, etc.) and Stream/Download, Duration (Less
or Greater than 1 Minute)
Comment: If you
want video content from events in the news, use the Advanced Video Search
interface and limit your search to only MSNBC material.
Comment: The Directory
can be searched and browsed. Content comes from LookSmart. Some of AV's
advanced syntax will work with a directory search.
Along with the
well-known primary Google Interface [http://www.google.com], the folks
at Google provide many specialized tools and interfaces, some of which
focus on a specific subject. A few of these specialized tools can be accessed
via tabs on the primary search page, others are only accessible through
a specific URL.
Type of Merchandise, Date (Current or Older Catalogs)
and access the full-text from over 4,500 mail-order catalogs from U.S.
companies.Material is browsable via a list of available catalogs. What
makes this beta unique is its use of optical character recognition (OCR)
— the first time we've see this technology used by a major public search
tool. How does it work? Instead of ripping the catalogs apart and reentering
all the text on the catalog page, Google scans each page and creates an
image file. Then, OCR technology finds keyword search terms embedded in
the scanned image files. Not only interesting but fun!
Comment: The content
in the Google Directory comes from the Open Directory Project (ODP), which
allows anyone with an interest to "edit" a section of the directory. ODP
data is the basis for many Web directories. Here is a list of some of them:
The Google Directory
enhances the basic data with several features only available from Google.
These features include the option to view pages in alphabetical or Google
Page Rank [http://www.google.com/technology/index.html]
order. Finally, content from the Google Directory results appear incorporated
into result sets from the main database. However, searching and browsing
the directory itself might still help when you want to build your own collection
of sites or study any possible relationship between categories and subcategories.
It also allows you to limit your search to within a specific category.
Date, Subject, Language
to over 20 years and 700 million messages and USENET postings. Material
can also be browsed by beginning with the USENET group name. The content
of this database does NOT appear in the primary Google database. While
we do not recommend limiting by date for general Web searching, limiting
by date with USENET material works well, since each posting receives a
date-stamp when posted. Messages are accessible via Google Groups anywhere
between 1-5 hours after posting. You can also post to any of these groups
using Google. Free registration is required.
Site/Domain, Size, File types, Coloration
to over 390,000,000 images. Watch those copyright issues! Like
with AllTheWeb, you do NOT find words embedded in the image. The topic
is determined by various factors, including image file name, words surrounding
the image on the Web page, etc. Note that the images in this database are
not collected as Google's spider, named Googlebot, crawls the Web. This
primary crawl only accesses HTML material. Instead, the Google spider retrieves
the URLs to images that appear on the HTML pages in a second crawl.
re-crawl and refresh of about 100 English-Language news sources.This
index refreshes about once an hour with new material from a specialized
news crawl of the Web. URLs archive for 1 week and then disappear from
the news database. Although no limits are available, searchers can limit
to a specific news site (if available) by using Google's syntax and entering
site:<domain> You can limit to words in the title or headline by using
the syntax, intitle:<search terms>.
the time you read this article, Google News may have changed significantly
from the description above. We hear that it may add more news sources,
speed up its refresh rate (new material indexed about once every 15 minutes),
and provide a new user interface.
In addition to
these special engines, Google provides access to several, "restricted"
specialized tools that search on a specific "chunk" of the Google database
devoted to a domain or topic.
Limits your search
to .gov, .mil sites, and some state material.
search to a specific school Web site." Restricted searches to Web domains
for hundreds of universities are available.
specialized engines do not restrict to a specific domain but use restrictions
based on word lists and other factors. For example, the Apple Macintosh
search looks for Apple trademarked terms, other Apple Computer mentions,
etc. on a Web page. If the page meets these and other criteria, it's added
to the restricted search database.
"Search for all
sites using Google."
"Search Web pages
about the BSD operating system."
"Search all penguin-friendly
[Linux related] pages."
part of AskJeeves.Com, does not offer any specialized search databases
and interfaces like the others mentioned in the column. However, because
of Teoma's unique page ranking algorithm, I consider it worthy of mention.
Unlike other Web search products, Teoma's relevancy-ranking algorithm includes
a measure of a given Web page versus similar pages on the SAME topic. Teoma
bills this as "Subject-Specific Popularity."
In a nutshell,
instead of ranking potentially useful pages against all other pages in
the database, Teoma ranks results against other pages in the same "community"
of pages. These "communities" are built dynamically, in real time.
I've also found
Teoma's "Resources" feature useful when trying to locate pages with large
amounts of quality links on a specific subject. Finally, I suggest you
start with a broad search and then use Teoma's "Refine" feature to help
focus your search. Teoma continues to be a product worthy of close attention.
By the way, AskJeeves results labeled as, "You may find my search results
helpful," use Teoma's page ranking algorithm.
We've only scratched
the surface. Stay tuned.
Five Reasons to
Consider Specialized Web Search Engines
1. Instead of searching
the entire Web, you search a smaller portion of Web space limited
to a specific subject, domain, format. This should increase precision and
engines can be worthwhile tools to share with less sophisticated searchers
or those who do not have the time to learn to use advanced searching options.
3. In some cases,
the content updates more frequently than general engines.
engines often provide access to content not crawled by general search engines.
Remember, no search engine is complete. Using specialized tools helps reinforce
the need to use more than a single search engine.
engines drill down to material not directly accessible via a general search
engine, material commonly referred to as the Invisible Web.