Montana State University
on the net
Freshness Issue and Complexities with Web Search Engines
ONLINE, November 2001
|...no algorithm can predict perfectly which pages will have new content today and which will remain static into perpetuity.|
Pity the poor search engines. They crawl this seething, bubbling maelstrom we call the Web, indexing the text from hundreds of millions of pages, all of which can change at a moment's notice. For the past few years, most search engines claimed to refresh their entire database once a month or so. Yet, older records in their databases showed that the refresh rate was often more than claimed rate.
Some users still think that when they enter a query into a search engine's search box, that it goes out and searches the existing Web rather than an older database built from crawling the Web over some range of dates in the past. Given the technology used by the search engines, the Web search engines will always provide searchers with the ability to search a picture of the Web past. While much of that snapshot will still exist on the Web of today, other portions will be dead, changed, or inaccessible.
Understanding the currentness (or freshness) patterns of the search engines' databases helps the searcher to better understand the results and what might not be found. Several recent search engine initiatives are changing the freshness of their databases, but the impact on searchers varies.
Why is the freshness of the search engine important? Certainly, many Web sites have posted content on pages that do not change, may never be updated, and may always stay in the same location at the exact same URL. It might be surprising to know exactly how many pages are like this. For these kinds of pages, a single crawl of their content should be sufficient.
Yet, plenty of other Web sites make significant changes to their site content on a frequent basis. New pages are added. Old pages are moved to new URLs. New content is added to old pages. New sites spring up while others disappear. The longer the time lag since the search engine crawled, the less of this kind of content that will be searchable. In addition, older content that was crawled the last time and is no longer available will be found by the search engine, but it will result in dead links, pages that no longer match, or other errors.
Consequently, the fresher the search engine database, the fewer errors, misdirections, and dead pages will be included in the results. And there will be more new pages, updated sites, and revised content that will be searchable and retrievable. For most searches, and for most searchers, the fresher and more up-to-date search engines will provide better and more useful results.
One way in which the search engines are beginning to address the issue of freshness is in their paid inclusion programs. For many years, Webmasters of large sites have observed that when the search engine spiders visit their sites, they do not crawl every available page. The search engines would train their spiders to be sure first to find new sites and unindexed domains rather than crawling every single page on just one site. However, this becomes frustrating for Web sites who would like to have all of their pages indexed and accessible via the search engines.
For several search engines, that are also certainly seeking to become profitable companies, their recent introduction of paid inclusion programs helps both the bottom line and the freshness and completeness of content in their databases. At AltaVista, Inktomi, and Fast, Web sites can pay a fee to be sure that all the pages on their site are indexed and that they are reindexed more frequently. In some cases, all the pages are refreshed every one or two days, while others who pay for inclusion programs have even more frequent updates for some pages, and still others can have a longer refresh period. The difference is that the Web site can determine the rate to match the changes in content.
It is important to note that the participation in a paid inclusion program does not affect the ranking of the pages by the search engine; it just makes sure that they are included and are recrawled more frequently.
|These search engines are developing automated ways of identifying sites with frequently changing content.|
Google is indexing some of these sites on a daily basis. Fast announced a nine-day refresh period on a portion of its database that includes such sites. However, even these improvements and well-intentioned efforts fail to truly keep current. Many news sites change and update content so frequently that the search engines will still always have an older version indexed than the one currently on view at the site.
For example, even with daily indexing, Google's cached copy of several news sites is still several days old, while Fast's varied at about a week. Either way, today's content is not indexed and available until it has moved off the front page. Thus, use of a specialty news search engine is still far more effective for searching news on the Web.
|...this is just one snapshot in time of the lag time of the search engines.|
The comparison was run on August 13, 2001. On any other day, including today, the results would likely be quite different, so this is just one snapshot in time of the lag time of the search engines. Only Fast, Google, and Inktomi even found all 12 of the sites, and many had duplicate entries for one or more of the sample sites. Google actually had 43 separate URLs for a major media site. All the URLs were redirected to the exact same location, but the dates reported on those 43 ranged from June 22 to August 11.
Running this comparison demonstrated the problems with maintaining a fresh index. All the pages used for the study should have had the August 13 date. Instead, the oldest results found, by search engine, are as follows:
|February:||Wisenut, Northern Light|
|March:||Teoma, Excite, AltaVista, MSN, HotBot|
By contrast, the most recent page dates were:
|July 21:||Northern Light|
However, Google's August 12 hit was from a site in India, so it would probably have been August 11 if it had been in a North American time zone. These results show decent freshness for Google, Inktomi (used by MSN and HotBot), and Fast, but each only had a few results from that recent. While the oldest and most recent dates give some sense of the range, a rough median gives a better sense of the general date of their database:
Such results show the wide range of dates, with some very old content. While the freshest search engine is likely to change over time and from day to day, these findings reinforce the advice to use more than one search engine, especially when trying to find recent information.
The search engines may be getting better at keeping their databases more current, but there are times when a fresher database is not necessarily better. This is especially true at Google, with its cached copies of pages that show what the page contained when it was crawled. If you are looking for an older copy of a page and Google just updated the cached copy yesterday, you will not have access to an older version. If the site was alive last week, and Google has checked several times since then and found that it has been unavailable, will the cache continue to have the old version or will it just be purged?
Yet, in most cases, more attention to getting their databases as up-to-date as possible and frequently refreshed will just make it easier for searcher and Webmaster alike. Several search engines have major initiatives underway to improve their freshness. In the meantime, freshness will vary, depending on the day and time, which cluster of the search engine's computers your search hits, and the specific pages that have matching information. Just remember that search engines provide a searchable picture of the Web of the past. And maybe, it will soon become the more recent past.
Greg R. Notess (firstname.lastname@example.org; http://www.notess.com/) is a Reference Librarian at Montana State University and founder of SearchEngineShowdown.com.
Comments? Email letters to the Editor at email@example.com.
Copyright © 2001, Information Today, Inc. All rights reserved.