Information Today, Inc. Corporate Site KMWorld CRM Media Streaming Media Faulkner Speech Technology DBTA/Unisphere
PRIVACY/COOKIES POLICY
Other ITI Websites
American Library Directory Boardwalk Empire Database Trends and Applications DestinationCRM Faulkner Information Services Fulltext Sources Online InfoToday Europe KMWorld Literary Market Place Plexus Publishing Smart Customer Service Speech Technology Streaming Media Streaming Media Europe Streaming Media Producer Unisphere Research



Vendors: For commercial reprints in print or digital form, contact LaShawn Fugate (lashawn@infotoday.com).
Magazines > Online Searcher
Back Forward

ONLINE SEARCHER: Information Discovery, Technology, Strategies

HOME

Pages: 1| 2| 3
Search Strategies for Large Document Searching
By
November/December 2016 Issue

Phrase Search Pitfalls

In long documents, and even in shorter ones of certain formats, several issues can create problems when following the standard convention of using quotes around the phrase for phrase searching. Consider a PDF that for formatting reasons has some hyphenated words. For example, if the word “international” is hyphenated at the end of a line as ‘inter-national,’ then a phrase search for "international searching authorities" likely will not find a document containing that word unless the phrase appears elsewhere in the document. In addition, older documents that have been scanned and digitized may have one or many letters that were not properly identified. Thus, a search on "inter national searching uthorities" at Google finds one PDF from a 1981 WIPO (World Intellectual Property Organisation) meeting that contains the hyphenated word along with some inaccurate digitization.

Punctuation and numbers can also cause problems. One mystifying example comes from a phrase taken directly from Google’s own cached copy of a webpage. Usually the cache (especially the text-only version) is the basis for Google indexing, and a search phrase extracted from the cache should be able to find at least that one document again. For this test, I grabbed a phrase from near the bottom of an UpToDate article (from our subscription database). I was curious if Google would find illegally shared copies of the article.

Searching a mostly textual phrase found two versions in Google. One redirected back to the Wolters Kluwer UpToDate site, which prompted for authentication to see the whole article, while the other failed to load. However, the second link had a cached copy of the page containing the full text. Even so, when I tried copying another text string from a line above the first phrase, this phrase search failed to find any results in Google, presumably somehow due to the numbers or punctuation. Here is the exact phrase shown in the cached copy that I searched:

"placebo, was 0.13 (95% CI, 0.04 to 0.38) with the 75-mg daily dose, and 0.32 (95% CI, 0.17 to 0.59)"

Yet that found no results. Trying again and removing punctuation also gave zero results, as did just trying a shorter version.

In comparing phrases taken from books, similar problems emerged with line breaks and page breaks. This short textual phrase that goes across a line break in a PDF (a resource management plan from the National Park Service) finds no results at Google:

"experiences for visitors to the park is to view and photograph"

However, break the longer phrase into two phrases just where the line break occurs in the PDF, and Google can then find the document:

"experiences for visitors"
"to the park is to view and photograph"

Of course, the difficulty for the searcher is how to know—or guess—where line breaks may occur or which numeric or highly punctuated content to skip. If the search is based on a photocopy or screenshot of a page (where someone has a copy and is trying to find the full source), then look for appropriate phrases without a line break. If no copy is available, try multiple versions of the phrase and try breaking it at different points.

Just remember not to expect consistency for text search results from Google, particularly in these days of the increasing RankBrain emphasis. On this example, for no logical reason, this shorter dual phrase search did not find the document:

"experiences for visitors" "to the park is to view"

In general, when using phrase searches to find a document, try to find short phrases with unusual words or an unusual combination of words. Not only might lengthier phrase searches fail due to page breaks, line hyphenated words, typographical errors, and intervening special format ting, remember Google limits searches to 32 words. Here’s another strategy: Try various phrases and adding or subtracting just a few words to see if the results change.

Pages: 1| 2| 3


Greg R. Notess (www.notess.com) retired from being reference team leader at Montana State University and founder of SearchEngineShow­down.com.

 

Comments? Contact the editors at editors@onlinesearcher.net

       Back to top