Search Strategies for Large Document Searching
Volume 40, Number 6 - November/December 2016

Forgotten Indexing

Some sections of Google’s own content are not indexed in Google’s “All” results. For example, years ago, Google Groups included 800 million messages from Usenet (a global discussion network), with some going back as far as 1981. Usenet was known for helping news get out from China during the 1989 Tiananmen Square massacre, among other things. Google even includes that post in its lists of “Memorable Usenet moments” (

You can click on that link to explore the content of that historic message, but then try searching a phrase from that post. For example, a Google web search on "repressive communist regimes these people" finds no results as a phrase search. Trying the same search directly in Google Groups now finds that post, and yet a few days previously, it did not.

Some Google Groups content can be found with a Google web search. For more recent posts, just adding to other search terms can help limit results to those discussions, allowing the searcher to use Google’s search tools and advanced filters. Since Google Groups no longer has an advanced search and does not offer search tools, something like date limiting must be done in a regular Google search. Unfortunately, it seems that many of the posts from 1981–1989 have not been indexed by the Google crawler and are only available through Groups. You can sort by date in Groups, but you then have to scroll a long way down to get to the older content.

Just out of curiosity, I then tried the search in Bing. Surprisingly, even though Google could not find some of its own content, Bing did. However, for Bing searches, remember to put a + in front of phrase searches to try to force exact phrase matching. Otherwise, Bing will claim many results for a search such as "repressive communist regimes, these people"; adding the plus to make it + "repressive communist regimes, these people" finds just that one post.

Similar to the forgotten indexing of some of the older Groups content, I still occasionally find some content only if I go directly into Google Books or Google News. At least with Books and News, Google (mostly) makes it easy for the searcher to transfer a search from a Google web results page (labeled as All) to results from Books or News.

See a “no results” statement from a Google search? Just click on the Books or News database in the top bar (or possibly hiding under More) to transfer the results to those other databases. The transfer is only “mostly” easy because if Google All finds no results for a phrase search, when transferring that search to Google Books, the quotes for the phrase search are stripped out. Searchers must add them back in to try the phrase search.

Book Issues

Re-creating a search I had done 5 years ago—searching for a niche publisher with a unique name—still does not find the Google Book reference that turned out to be helpful. Barocuckoo Bassoon Publications shows up in WorldCat with a few records. Searching barocuckoo on an “All” Google search will only find a mention in a compilation of saved tweets that then points to a tweet from one of my talks. (That actual tweet is not shown by Google, although it can be found searching Twitter directly). Searching barocuckoo in Google Books finds an old address for the publisher (and a name of the individual) in a book from 1970. While only available in snippet view, the result had the needed information yet would not be found from a regular Google search.

Book searching makes for some other different search oddities. Especially for well-known, old, out-of-copyright works, many versions can be found online, and in many different file formats. Searching something that could be found in a text from Project Gutenberg, for example, could also be available at its many mirror sites and is also likely to have been republished elsewhere.

Search a phrase from a popular work such as Jane Austen’s Sense and Sensibility and many results should be found. However, testing this at Google with "not a syllable passed aloud" from near the end of the novel, Google at first reports only 10 results. Nine are from Google Books and none from Project Gutenberg.

To get more diverse results, a searcher must scroll to the note at the bottom of the results that mentions “we have omitted some entries” and then click the link to include the omitted results. However, while Google said it then had about 772 results to display, I could only see 28, none of which included Project Gutenberg or its mirrors, even though all include a digital version of the book with that phrase. To get Google to admit it had indexed Gutenberg, I had to search adding the site: operator:

"not a syllable passed aloud"

This succeeded in getting results from the actual Project Gutenberg site. To try to get more of the mirror records, I then searched "not a syllable passed aloud" inurl:Gutenberg, which did find results from some, but not all, of the mirrors.

Running similar tests with other, less-well-known Gutenberg titles, I came across other strange results. Sometimes one mirror would be displayed in the results set, but not the main Project Gutenberg site. Trying the same technique of adding the site: or inurl: parameters also failed to find the others, even though the mirrors did indeed have the book.

Another missing source is Amazon. While titles of products and even descriptions and reviews can show up in a search engine result, the ability of Amazon to search deeply within books does not translate over to web search engines. Thus, when searching for something that might be buried deep within book content, in addition to web searches, Google Books, and any local searchable commercial ebook collections, also remember to try Amazon Books.

The search does not usually work in a regular Amazon search, but if you first limit to books and try again, you will have better luck. Searching directly in Amazon Books for the Austen quote found other books that mention the Sense and Sensibility phrase, while a search for "little Erard piano was carried each day" finds various reprints of Owen Wister’s The Jimmyjohn Boss, and Other Stories.

Why does Google not display all the known sources for quotes from the freely accessible Project Gutenberg or other large collections such as Amazon? With so many copies of large, book-length documents, my guess is that Google is trying to limit the number of duplicates retrieved and to provide the searcher with a smaller variety of choices to make the results seem less overwhelming.

It is certainly not just Google that has such inconsistencies. I frequently encounter similar problems in commercial library databases, locally developed digital collections, discovery systems, other search engines, and even on retail shopping sites. Actively managed search systems may see frequent tweaking of results and of search capabilities in an attempt to give evermore relevant results. Sometimes the efforts backfire, and search results get worse. With web searching for large documents and within big files, knowing some of the idiosyncrasies of their indexing can help expert searchers track these documents down more reliably.

