Specialized Search Engine FAQs: More Questions, Answers and Issues

Vol. 10 No. 9 — October 2002

• FEATURE •
Specialized Search Engine FAQs: More Questions, Answers and Issues by Gary D. Price, MLIS • Librarian, Gary Price Library Research and Internet Consulting

Table of Contents

Previous Issues

Subscribe Now!

ITI Home

Last year in Searcher I wrote an article that provided an overview of the Web search scene as of Fall 2001 ["Web Search Engine FAQs: Questions, Answers, and Issues, " October 2001, https://www.infotoday.com/searcher/oct01/price.htm]. I'm thrilled that so many of you read and used the article.

This year it's time to break away from general search tools and talk about specialized search resources. However, in this fast-moving field where information can go out of date so quickly, this article will start a series of discussions scheduled for future issues of Searcher. Each article will cover news and information about many specialized search tools, how to best use them, and background on the people who create them.

General search engines like Google, AllTheWeb, and Teoma remain absolutely necessary, as do fee-based databases and print. Web search, like research in general, is never a one-tool concept. A good researcher considers myriad resources (Web and non-Web) to provide the best results possible.

Just as librarians have specialized reference tools on their reference shelves along with general encyclopedias, newspapers, and periodicals, the same is true here. It's simply the right tool at the right time for the job at hand. Nothing more, nothing less.

Defining the Specialized Search Engine
Here we go again! And defining the specialized search engine becomes even more challenging in the dynamic field of Web searching, not to mention the demanding expectations of anything called "special."

In many cases, though not all, the data and potential answers contained in these databases meet the standards of what we and our patrons care the most about — accuracy, currency, authoritativeness. Between general Web search engines and specialized tools, including those that allow you to access material on the Invisible or Hidden Web, you could spend all your time defining and explaining and not searching. People love things that fit snuggly into categories. Unfortunately, snug categories and Web search don't often go together.

For searchers, key issues such as the value of the content, the ease of access, the authority of the information, and the possibility of more precise and current results are much more important than issues of categories and definitions. However, this article — and those that follow — will categorize "specialized search" in five ways. Of course, the reader must allow us the freedom to change these definitions and categories at will.

Type One. A "chunk" of a general Web engine focusing on a certain subject area or domain, with content often accessible through a specialized interface.

Type Two. A focused or targeted Web crawler that looks for and provides access to material on a specific subject or domain from material found on the open Web.

Type Three. Search engines that provide access to material in general Web directories, such as the Librarians' Index to the Internet [http://www.lii.org] and INFOMINE [http://infomine.ucr.edu]. Often, these databases provide the user with search capabilities not found elsewhere. For example, the LII allows you to search entries by subject using Library of Congress Subject Headings.

Type Four. Search engines that provide access to material for a specific site or tool. In many ways identical to type three. An example? The Internet Movie Database. Yes, many of these pages are now indexed in Google and AllTheWeb, but the advanced search interface direct from the IMDB provides many specialized search options [http://us.imdb.com/list].

Type Five. As a co-author with Chris Sherman of a book on the topic, I have to include the Invisible or Deep Web in the world of specialized search. Again, let's not get bogged down in definitions, but one narrow definition of the Invisible Web consists of material that you cannot access directly from a general search engine. Think of the hundreds of interactive databases out there. Sites where you fill in a form, pull down a menu, and then click-search. Very often, the data that you see then comes from an unformatted database and only exists on a Web page for the time it takes you to look at it. Once you close the page, the page no longer exists.

Why Care?
So why do we need to know about specialized engines? Why waste time with this "stuff" when, in many cases, some or all of this material appears in general search tools? Here are a few of what could be hundreds of reasons.

Reason One. In some cases, using a "specialized" database focusing on a specific topic, type of material, etc., can help lower recall and improve precision. It's that simple. Instead of searching everything, you've begun to limit your search just in the selection of your source. Think of it as picking a OneSearch category in Dialog or a Library in LexisNexis.

Remember, some content accessible from specialized engines is also available from general engines like Google or AltaVista or AllTheWeb, but using specialized tools can help limit your search. Of course, in some cases, advanced searchers can turn a general search tool into a "pseudo" specialized tool by limiting to a specific domain, country code, or file format.

Always remember that no search engine contains everything available. Utilizing specialized search tools follows the wise practice of using more than a single resource.

One final point: Using multiple tools can help you identify more potentially useful content. Even if every search engine contained the identical data (which isn't the case), you would still see different results, even using the same search strategy. Why? Each search engine uses a different page-ranking algorithm. Therefore, unless you scrolled through an entire list of results, seeing the same base content presented and organized in a different manner can expand a search's utility.

Reason Two. These tools can support the needs of users who do not have the time or interest to learn how to use the general engines at an advanced level. You know who I mean — the "type in two search terms, click search, and hope for the best" searcher. Many users who are not information professionals want a quick, best place to go. Nothing more. For them, starting at a high-precision search engine site may work best.

Reason Three. Specialized tools, particularly in the world of news, can provide rapid re-crawl (the crawler returns in some cases every few minutes), providing new content to the database that you search. General search engines are often weeks behind in crawling and making newly identified content accessible to searchers. When it comes to the Invisible Web, the content is only accessible by using a specialized interface to the material.

Reason Four. Finally, specialized search tools provide extra usability features. For example, the database might provide special ways of limiting the dataset, specific to the information in that database. The same might be true with various sorting options.

These are a few of the reasons why specialized search engines, just like specialized reference books, will continue to be vital. If "everything" were in one a single database, how easy would it be to find anything? As Chris Sherman and I say in our book, you wouldn't start searching for a person's telephone number in the Encyclopaedia Britannica.

Getting Started
So where do we start? With the usual suspects? For this first article, let's look at the many specialized resources that the major search engines offer. Google, AllTheWeb, and AltaVista all provide searchers with specialized resources that present specific types of information in a rapid manner.

For example, speaking of phone numbers, you can now find certain U.S. telephone numbers and addresses by entering a person's or business' name, plus some extra info, into the Google search box. However, at the time of writing, Google only offered this service for U.S. numbers and only queried one online telephone directory. In some cases, a specialized interface accessed directly from an online telephone directory could provide more useful answers. We all know that you often have to query several of these sources to find what you want.

Specialized Web Search Resources Offered by Major Web Search Companies
Remember, each general engine crawls the Web in its own manner. Therefore, each database holds material that you cannot access from the others. Besides going directly to the URLs provided, searchers can access some of the tools and specialized interfaces listed here via tabs on the search engine's primary page. However, you'll have to go directly to the resource to access any advanced search capabilities.

AllTheWeb
AllTheWeb from FAST Search and Transfer provides access to several specialized tools. In addition to providing customized interfaces for each of them, AllTheWeb also integrates results from several of these "catalogs" into the results page of a simple Web search.

AllTheWeb FTP Search
http://www.alltheweb.com/?cat=ftp&cs=utf-8&l=any&q=
Advanced interface: http://www.alltheweb.com/advanced?c=ftp&l=any&cs=utf-8
Limits include: Domain, Path, Size
Comment: Search the contents of FTP servers. However, you cannot search keywords found in these files, only file names and types.

AllTheWeb News
http://www.alltheweb.com/?cat=news&cs=utf-8&l=any&q=
Advanced interface: http://www.alltheweb.com/advanced?c=news&l=any&cs=utf-8
Limits include: Domain (e.g., CNN.Com), Time (e.g., only articles indexed in the past 12 hours), and Type of news source (U.S., International, Sports, etc.)
Comment: The AllTheWeb News spider crawls 3,000 news site from around the globe (English and Non-English language content) continuously. The database is built from a separate "news only" of the Web. URLs remain in the database for 5 days. I wish ATW would add the ability to limit to news sources from specific states and countries.

AllTheWeb's Multimedia Catalogs
Access to approximately 118 million images, videos, and sound files. Make sure to apprise yourself of any and all copyright issues that might apply to using this material. Material located in these special databases is not directly accessible from AllTheWeb's primary search result sets. ATW uses a specialized crawl of the Web to build them.

Pictures
http://www.alltheweb.com/?cat=img&cs=utf-8&l=any&q=
Advanced interface: http://www.alltheweb.com/advanced?c=img&l=any&cs=utf-8
Limits include: File Format (jpg, gif, bmp), Type (color, b&w, line art), Background (Transparent, Non-Transparent)

Videos
http://www.alltheweb.com/?c=vid&cs=utf-8
Advanced interface: http://www.alltheweb.com/advanced?c=vid&l=any&cs=utf-8
Limits include: Format (Real, QuickTime, AVI, etc.) and Stream/Download
Comment: Most image search tools do NOT find words embedded in the image. The topic is determined by various factors, including image file name, words surrounding the image on the Web page, etc.

MP3 Files
http://www.alltheweb.com/?cat=mp3&cs=utf-8&l=any&q=
Advanced interface: Not Available
Limits: Not Available
Comment: A large amount of MP3 material is not kept on Web servers these days but is accessible on peer-to-peer networks like Kazaa and Morpheus.

AltaVista
In the past few months, AltaVista has begun to work towards returning to its one-time prominence on the Web search scene. It recently announced a faster re-crawl/update rate and the AltaVista Prisma term suggestion tool. Let's hope the improvements continue.

AltaVista News
http://news.altavista.com/default
Advanced interface: Not Available
Limits include: Not Available
Comment: AV News receives its content from Moreover, a well-known news aggregator. New content from over 3,000 sources updates every 15 minutes. Results can be sorted by either relevance or date. Through a pull-down menu, searchers can limit searches to specific news categories. AltaVista's advanced searching syntax will not work with this database. Multiple terms in the search box receive an implied "AND." Links remain in this database for about 2 weeks.

AltaVista Multimedia Search
Access to approximately 118 million images, videos, and sound files. Again, stay on top of all copyright issues before using this material. Most of the material found in these databases is not directly accessible via the primary AV interface. Some of AV's advanced syntax will work with the multimedia engines. In addition, AltaVista has a few paying partners who provide a direct feed into the database. For example, a search of the video database will find new video content from MSNBC.

AltaVista Images
http://www.altavista.com/sites/search/simage
Advanced interface: http://www.altavista.com/sites/search/advimg
Limits include: Type (color, b&w, banners)

AltaVista Audio
http://www.altavista.com/sites/search/saudio
Advanced interface: http://www.altavista.com/sites/search/advaud
Limits include: Format (mp3, wav, etc.) and Stream/Download, Duration (Less or Greater than 1 Minute)

AltaVista Video
http://www.altavista.com/sites/search/svideo
Advanced interface: http://www.altavista.com/sites/search/advvid
Limits include: Format (Avi, Quicktime, MPEG, etc.) and Stream/Download, Duration (Less or Greater than 1 Minute)
Comment: If you want video content from events in the news, use the Advanced Video Search interface and limit your search to only MSNBC material.

Directory
http://dir.altavista.com
Advanced interface: Not Available
Limits include: Not Available
Comment: The Directory can be searched and browsed. Content comes from LookSmart. Some of AV's advanced syntax will work with a directory search.

Google
Along with the well-known primary Google Interface [http://www.google.com], the folks at Google provide many specialized tools and interfaces, some of which focus on a specific subject. A few of these specialized tools can be accessed via tabs on the primary search page, others are only accessible through a specific URL.

Google Catalogs (Beta)
http://catalogs.google.com
Advanced interface: http://images.google.com/advanced_image_search?hl=en
Limits include: Type of Merchandise, Date (Current or Older Catalogs)
Comment: Search and access the full-text from over 4,500 mail-order catalogs from U.S. companies.Material is browsable via a list of available catalogs. What makes this beta unique is its use of optical character recognition (OCR) — the first time we've see this technology used by a major public search tool. How does it work? Instead of ripping the catalogs apart and reentering all the text on the catalog page, Google scans each page and creates an image file. Then, OCR technology finds keyword search terms embedded in the scanned image files. Not only interesting but fun!

Google Directory
http://directory.google.com
Advanced interface: Not Available
Limits include: Not Available
Comment: The content in the Google Directory comes from the Open Directory Project (ODP), which allows anyone with an interest to "edit" a section of the directory. ODP data is the basis for many Web directories. Here is a list of some of them: http://dmoz.org/Computers/Internet/Searching/Directories/Open_Directory_Project/Sites_Using_ODP_Data/.

The Google Directory enhances the basic data with several features only available from Google. These features include the option to view pages in alphabetical or Google Page Rank [http://www.google.com/technology/index.html] order. Finally, content from the Google Directory results appear incorporated into result sets from the main database. However, searching and browsing the directory itself might still help when you want to build your own collection of sites or study any possible relationship between categories and subcategories. It also allows you to limit your search to within a specific category.

Google Groups
http://groups.google.com
Advanced interface: http://groups.google.com/advanced_group_search?hl=en)
Limits include: Date, Subject, Language
Comment: Access to over 20 years and 700 million messages and USENET postings. Material can also be browsed by beginning with the USENET group name. The content of this database does NOT appear in the primary Google database. While we do not recommend limiting by date for general Web searching, limiting by date with USENET material works well, since each posting receives a date-stamp when posted. Messages are accessible via Google Groups anywhere between 1-5 hours after posting. You can also post to any of these groups using Google. Free registration is required.

Google Images
http://images.google.com
Advanced interface: http://images.google.com/advanced_image_search?hl=en
Limits include: Site/Domain, Size, File types, Coloration
Comment: Access to over 390,000,000 images. Watch those copyright issues! Like with AllTheWeb, you do NOT find words embedded in the image. The topic is determined by various factors, including image file name, words surrounding the image on the Web page, etc. Note that the images in this database are not collected as Google's spider, named Googlebot, crawls the Web. This primary crawl only accesses HTML material. Instead, the Google spider retrieves the URLs to images that appear on the HTML pages in a second crawl.

Google News
http://news.google.com
Advanced interface: Not Available
Comment: Rapid re-crawl and refresh of about 100 English-Language news sources.This index refreshes about once an hour with new material from a specialized news crawl of the Web. URLs archive for 1 week and then disappear from the news database. Although no limits are available, searchers can limit to a specific news site (if available) by using Google's syntax and entering site:<domain> You can limit to words in the title or headline by using the syntax, intitle:<search terms>.

Note: By the time you read this article, Google News may have changed significantly from the description above. We hear that it may add more news sources, speed up its refresh rate (new material indexed about once every 15 minutes), and provide a new user interface.

In addition to these special engines, Google provides access to several, "restricted" specialized tools that search on a specific "chunk" of the Google database devoted to a domain or topic.

Uncle Sam
http://www.google.com/unclesam
Limits your search to .gov, .mil sites, and some state material.

Google University Search
http://www.google.com/options/universities.html
"...narrow your search to a specific school Web site." Restricted searches to Web domains for hundreds of universities are available.

Beyond Domains
The following specialized engines do not restrict to a specific domain but use restrictions based on word lists and other factors. For example, the Apple Macintosh search looks for Apple trademarked terms, other Apple Computer mentions, etc. on a Web page. If the page meets these and other criteria, it's added to the restricted search database.

Apple Macintosh
http://www.google.com/mac.html
"Search for all things Mac."

Microsoft Search
http://www.google.com/microsoft.html
"Search Microsoft-related sites using Google."

BSD Unix
http://www.google.com/bsd
"Search Web pages about the BSD operating system."

Linux
http://www.google.com/linux
"Search all penguin-friendly [Linux related] pages."

Teoma
Teoma [http://www.teoma.com], part of AskJeeves.Com, does not offer any specialized search databases and interfaces like the others mentioned in the column. However, because of Teoma's unique page ranking algorithm, I consider it worthy of mention. Unlike other Web search products, Teoma's relevancy-ranking algorithm includes a measure of a given Web page versus similar pages on the SAME topic. Teoma bills this as "Subject-Specific Popularity."

In a nutshell, instead of ranking potentially useful pages against all other pages in the database, Teoma ranks results against other pages in the same "community" of pages. These "communities" are built dynamically, in real time.

I've also found Teoma's "Resources" feature useful when trying to locate pages with large amounts of quality links on a specific subject. Finally, I suggest you start with a broad search and then use Teoma's "Refine" feature to help focus your search. Teoma continues to be a product worthy of close attention. By the way, AskJeeves results labeled as, "You may find my search results helpful," use Teoma's page ranking algorithm.

"NEXT"
We've only scratched the surface. Stay tuned.

The Price List:
Five Reasons to Consider Specialized Web Search Engines
1. Instead of searching the entire Web, you search a smaller portion of Web space limited to a specific subject, domain, format. This should increase precision and lower recall.
2. Specialized engines can be worthwhile tools to share with less sophisticated searchers or those who do not have the time to learn to use advanced searching options.
3. In some cases, the content updates more frequently than general engines.
4. Specialized engines often provide access to content not crawled by general search engines. Remember, no search engine is complete. Using specialized tools helps reinforce the need to use more than a single search engine.
5. Specialized engines drill down to material not directly accessible via a general search engine, material commonly referred to as the Invisible Web.

Gary Price's email address is: gary@freepint.com.

Table of Contents

Previous Issues

Subscribe Now!

ITI Home