Counting Angels on a Pinhead: Critically Interpreting Web Size Estimates

feature

Counting Angels on a Pinhead:
Critically Interpreting Web Size Estimates

Michael Dahn

ONLINE, January 2000
Copyright © 2000 Information Today, Inc.

"The size of the Web is 800 million pages, and the biggest search engine only covers about 16% of it."

Please don't say that. Please don't write it. It's incorrect, and as information professionals, we should know better. If you have been tempted to speak or write such a statement (or worse, actually made such a statement), it is probably because you have read the excellent study by Lawrence and Giles in the July 8 issue of Nature, "Accessibility of Information on the Web" [1], or you have read or heard about that study in the popular media.

If you have merely relied on the reporting of this study by the media, I urge you to read the actual study. If you have read the study--or will read it--I urge you to read it carefully. The study is filled with many qualifying terms and concepts, and these allow for a number of potential interpretation pitfalls--many of which have caused the media to misquote, misinterpret, or generally misreport the findings of the study.

Let's look at what the study by Lawrence and Giles actually tells us:

As of February 1999, the publicly indexable World Wide Web contained about 800 million pages.
As of February 1999, the search engine with the largest index, Northern Light, indexed roughly 16% of the publicly indexable World Wide Web.
As of February 1999, the combined index of 11 large search engines covered (very) roughly 42% of the publicly indexable World Wide Web.

POTENTIAL INTERPRETATION PITFALLS

The study is chock-full of other measurements and estimates, but the preceding conclusions are likely to be quoted (or misquoted) widely by librarians and Internet trainers, and an analysis of them will provide a foundation for a similar analysis for all other such statistics.

In each of the three conclusions, there are three terms/concepts that will jump out at the critical thinker:

"Publicly Indexable"--What does that mean?
"February 1999"--Results are tied to a time period. Are they still accurate today?
"About" and "Roughly"--HmmÉjust how "rough" are these figures?

Publicly Indexable

If we were to rely only on the plain meaning of the words involved in this term, we might say that "indexable" refers to Web pages that are capable of being indexed, whether they are public or private. "Publicly" would mean that the pages are capable of being indexed by the public at large. (For instance, Web pages that have no static, published links to them, and are only accessible from a search form, cannot be indexed by the big "public" search engines using their current methods of indexing. However, these pages are certainly indexable by the organization publishing them. In fact, they probably are indexed if the only access to them is through a search form. These would be "privately" indexable pages.)

This definition, however, is murky. If "indexable" means that it is merely possible to index a Web page, and "publicly" implies that the Web page must be capable of being indexed by the public at large, then almost the entire World Wide Web would be "publicly indexable." This would include pages that are generally not considered publicly indexable, such as pages at password-restricted sites, pages accessible only through a search form, and pages in directories identified by the robots exclusion standard--because it is possible for the public to access these pages, copy them, and then index them. However, nobody does this because it is not practical.

Therefore, a logical definition of "publicly indexable" might be:

Pages that can be practically indexed en masse by the public.

A more easily understood definition might be:

Pages that can be indexed by the major public search engines using their current methods.

Lawrence and Giles defined "publicly indexable" as follows:

The publicly indexable Web excludes pages that are not normally considered for indexing by Web search engines.

They define the publicly indexable Web by what it is not--in this case, all Web pages other than those "not normally considered for indexing" by Web search engines. This definition leaves open the possibility that there are Web pages that are "not normally considered by" but could be indexed by the major public search engines using their current methods. For instance, Kris Carpenter, director of search products and services for Excite, has said that Excite purposely ignores a large part of the Web--not for technological reasons, but for consumer-oriented reasons [2]. Clearly then, there are "publicly indexable" pages (practical definition) that are "not normally considered" by Excite--and these would be excluded from the more narrow definition of the indexable Web by Lawrence and Giles (if the pages were similarly "not normally considered" by other search engines as well).

Perhaps I am splitting hairs. The point is that any of these definitions make one thing very clear: a large portion of the publicly accessible Web is excluded from the 800 million page estimate in the Lawrence and Giles study. In other words, "publicly indexable" is not synonymous with "publicly accessible."

The following types of Web pages are generally "publicly accessible," but they are generally not "publicly indexable":

Pages Accessible Only Through a Search Form--Since search engine robots use hyperlinks to find Web pages to index, a page that does not have a precise, published link to it will not be indexed by a public search engine unless the precise URL to the page has been submitted directly to the search engine. Pages that can be accessed only through a search form are such pages, and they are not part of the indexable Web--yet many of them can be accessed freely by the public. Examples include the thousands of SEC filings behind the search forms at Edgar (http://www.sec.gov/cgi-bin/srch-edgar) and Edgar Online (http://www.edgar-online.com), GAO Comptroller General Decisions (http://www.access.gpo.gov/su_docs/aces/aces170.shtml), and the thousands of patent and trademark files at the U.S. Patent & Trademark Office Web site (http://www.uspto.gov).

Password Protected Pages--Public search engines cannot index password protected pages. Pages are often password protected to limit access to paid subscribers of a Web site. However, many such sites offer free access to their password protected pages and simply require that users register to obtain a password. Probably the most famous example of this is the New York Times Web site (http://www.nytimes.com). Another example is the Latin American Resource Center from Alston & Bird (http://www.alston.com/ablar/login/login.cfm).

Pages in Directories Identified by the Robots Exclusion Standard--Web site administrators can "tell" search engine robots that certain Web pages and directories are off limits and should not be indexed. This is usually done with a "robots.txt" file, which specifies the pages and directories that are off limits, among other things. Any reputable search engine abides by this standard and does not index the specified pages. These pages, however, are usually just off limits to the search engine robots--not to the Web-browsing public. An example is the 'Lectric Law Library, where portions of its site are off limits to search engines, but freely browsable for the public. See their robots.txt file for the off limits files and directories (http://www.lectlaw.com/robots.txt), then browse a few of them to see what the search engines are missing.

Poorly-Designed Framed Pages--Because search engine robots use links to find pages to index, they can sometimes get tripped up and hit a dead-end when trying to index a site that has poorly-designed framed pages. This same poor design, however, will not prevent the public from freely accessing the framed pages. Examples of this are difficult to find, but they certainly exist. For the technical details of this, see the explanation at Search Engine Watch (http://www.searchenginewatch.com/webmasters/frames.html).

Many Non-HTML or Plain Text Pages--Currently, the big search engines primarily index pages on the Web that are in HTML format or plain text format. Of course, there are hundreds of thousands of documents on the Web that are not in one of these two formats, and many of these are freely accessible by the public with free browser plug-ins. Here we are talking about Word, WordPerfect, Flash, and PDF files, among many others. And while the publicly accessible textual content of Word, WordPerfect, and Flash files on the Web is an appreciable omission from search engine indexes, most significant is the absence of documents in Adobe's Portable Document Format. For examples of the rich (and abundant) content available to the public in this format, but missing from most search engines, see the Statistical Abstract of the United States (http://www.census.gov/prod/3/98pubs/98statab/cc98stab.htm), U.S. NLRB Decisions (http://www.nlrb.gov/decision.html), Dialog's Training Papers (http://training.dialog.com/sem_info/courses/pdf_sem/), and the huge number of PDF documents at FedWorld (http://www.fedworld.gov/pub/--browse through any folder there).

So, what conclusions can we draw from looking at these examples from the publicly accessible, but not publicly indexable, Web? Two conclusions that are readily apparent are that this portion of the Web is both large and significant. It is difficult to estimate how large this portion of the Web is, but one study found it to be roughly half of the Web (see https://www.infotoday.com/newsbreaks/nb0712-1.htm). If we are to believe this, then the publicly accessible Web would have about 1.6 billion pages as of February 1999. More conservative estimates would still put the publicly accessible Web over one billion pages in February 1999. This, of course, would make any estimates of what percentage of "the Web" or "the publicly accessible Web" that the search engines covered even smaller than those reported in the Lawrence and Giles study. Yet, here's what the popular media had to say soon after that study was published:

None of the 11 major Internet search engines covers even one-sixth of the Web's estimated 800 million publicly accessible pages.
--The Boston Globe (7/8/99) [3]
--The Chicago Tribune (7/19/99) [4]
The 800 million accessible Web pages were dominated by business interests, the survey concluded.
--The Washington Post (7/8/99) [5]
The study found that the Web, excluding private sites, had about 800 million pages in February.
--USA Today (7/8/99) [6]
The most comprehensive search engine today is aware of no more than 16% of the estimated 800 million pages on the Web.
--The Los Angeles Times (7/8/99) [7]
Most search engines cover only a small percentage of the 800 million Web pages that are open to the public.
--The New York Times (7/8/99) [8]

In many cases, the media either ignored the term "publicly indexable," or they translated it to mean "publicly accessible." Don't make that same mistake.

The Ides of February

The Web is an explosion in progress. Trying to estimate the number of pages on the Web is like trying to guess the number of molecules spewing from a violently erupting volcano. With anything as dynamic and rapidly expanding as the Web, any measurement is a snapshot frozen in time.

In December 1997, Lawrence and Giles estimated the number of pages on the publicly indexable Web at 320 million [9]. In April 1998, their estimate was published in the journal Science [10]. Then in February 1999, they estimated the number of pages on the publicly indexable Web at 800 million [11]. And in March 1999, I heard a speaker at a national conference say "there are 320 million pages on the World Wide Web."

The rate at which the Web is exploding is difficult to estimate, but it cannot be ignored. For a conservative rough estimate, we could look at the change in the number of pages on the publicly indexable Web from the December 1997 estimate (320 million) to the February 1999 estimate (800 million), which equates to a rate of increase of about 4.25% per month. (This is a very rough estimate, as the Web may have increased by 10% one month, then 2% the next--and as the rate of increase is very likely increasing with the popularity of the Web, 4.25% is probably too low a percentage to apply to the current Web.)

So if the publicly indexable Web had about 800 million pages in February 1999, and the publicly accessible Web had roughly 1 - 1.6 billion pages, an increase of 4.25% per month would result in significantly larger numbers by the time the study was reported in July 1999, and even larger numbers by the time you read this. Accounting for this increase, a rough estimate of the publicly indexable Web and the publicly accessible Web in July 1999 would be 985 million and 1.23 - 1.97 billion, respectively. Yet, in July, much of the media was reporting the results by stating that the size of the Web is 800 million pages, when clearly it was much larger. By November 1999, conservative page estimates would have swelled to 1.16 billion (publicly indexable) and 1.45 - 2.33 billion (publicly accessible).

Approximations

The nice round figure of 800 million, combined with the small sample size used in the study (less than one tenth of one percent) and the fact that the measurements were taken over an entire month, suggests that this is a ball-park figure rather than a precise page count. In fact, Lawrence and Giles specifically state that "Many sites have few pages, and a few sites have vast numbers of pages, which limits the accuracy of the estimate. The true value could be higher because of very rare sites that have millions of pagesÉor because some sites could not be crawled completely because of errors."

If we take a close look at the search engine percentage estimates, the numbers get even murkier.

When search engines index the Web, they are essentially copying a portion of it, and when their copies match what actually exists on the Web, we can say that the index represents a percentage of the current Web.

But what about when the index does not match what exists on the Web? Sometimes a search engine will index a Web page, and then at a later time, the page will cease to exist on the Web (because the author decided to take it down, or some other reason). Even more often, a search engine will index a Web page, and then a portion of (or all of) the content on the page will change on the Web. In both cases, the information about the page that exists in the search engine's index no longer matches what currently exists on the Web.

Taking this into account, only a portion of any search engine's index is actually an "index of the (current) Web." The remaining portion can best be described as an index of the past Web, or a pseudo-archive of the Web.

The Lawrence and Giles study reported that Northern Light indexed roughly 16% of the publicly indexable World Wide Web (as of February 1999), but they also reported that about 9.8% of Northern Light's index consisted of bad links, and an analysis of their numbers shows that they did not take this into account when reporting the 16% figure. Since 9.8% of Northern Light's index referred to pages that no longer existed on the Web, it would be more accurate to say that Northern Light's index represented 14.4%, rather than 16%, of the 800 million page estimate.

The study also estimated that the combined indexes of the 11 search engines tested covered 42% of the 800 million page estimate. If we adjust this for the percentage of dead links found in all of the search engines, the number would be significantly lower. If we were to further adjust the percentages for the amount of page content that exists in the search engine indexes, but no longer exists on the Web (and for the fact that many search engines do not index extremely long pages in their entirety), the percent of coverage would be even lower. Now calculate those percentages as a portion of the publicly accessible Web instead of the publicly indexable Web, and the coverage is lower still.

To make matters worse, the search engine indexes fluctuate at least as wildly--if not more so--than the Web itself. Lawrence and Giles found that Northern Light had the largest index, with 128 million pages, in February 1999. However, when the study was reported in July 1999, Northern Light had an index of over 150 million pages, and Fast Search (http://www.alltheweb.com) had an index around 200 million. Excite expects to have an index of over 250 million pages in the fall of 1999, and most of the big search engines are likely to increase the size of their indexes appreciably [12].

WHAT DOES IT ALL MEAN?

At this point, you may be wondering why Lawrence and Giles measured the publicly indexable Web instead of the more relevant publicly accessible Web, or why they did not account for broken links and outdated content when calculating search engine percentages. The answer is simple: It was easier.

Now, to be fair to Lawrence and Giles, measuring the publicly accessible Web would be extremely expensive and time-consuming, if not impossible. The same is true of trying to compare the content of search engine indexes with the actual content they are trying to represent from the Web.

Lawrence and Giles have done high-quality, ground-breaking work, and their studies teach us important lessons about how to find information on the Web.Attempting to measure the publicly accessible Web or further analyze search engine indexes would probably not help us understand the Web or the search engines much better. It would not teach us more about finding information on the Web than their current methods do--at least not enough to justify the heavy costs.

Their studies are important because they teach us that:

The Web is HUGE.
Search engines cover only a small fraction of the Web.
There is not a lot of overlap between the search engines.
Even the combined indexes of the major search engines cover only a small fraction of the Web.

From this, we learn that it is important to use more than one search engine or to use metasearch engines (but know their limitations), and that we need to use other tools to find Web information as well, such as searchable directories, vertical portals, specific databases, and Usenet and electronic mailing lists. These are the things that we want to be telling our audiences, our bosses, and our library customers--not that there are X number of pages on the Web or that search engine Y only indexes Z percent of it. These statistics can be misleading unless they are heavily qualified, and even when they are, their primary value lies in the generalities that can be gleaned from them.

When Lawrence and Giles conducted their study in February 1999, the publicly accessible Web had far more than 800 million pages, and the percent covered by search engines was significantly less than their numbers would lead you to believe. When the media got a hold of the study in July 1999, the Web was even bigger, and the percent covered by search engines was anybody's guess. Saying that the Web consists of 800 million pages based on the Lawrence and Giles study is like saying that you are three feet tall based on the fact that when you were nine years of age the distance from your knees to your nose was three feet. To be correct, you could qualify the measurement of three feet by explaining that it only covered the distance from your knees to your nose, and that you were nine when the measurement was taken, and that you have grown a lot since thenÉbut how valuable would that precise information be?

It is important to read reports of Web size estimates carefully, and to understand how search engines work generally. To keep up with what is going on with the major search engines, become a frequent visitor to the two best sites on the Web for search engine information, Search Engine Watch (http://www.searchenginewatch.com) and the Search Engine Showdown (http://www.notess.com/search/).

I hope that this analysis will encourage you to carefully interpret future statistics about the Web's size or the search engines' coverage of it--from Lawrence and Giles or anyone else--and to translate that data into meaningful information for your library users and others. The media may "dumb down" a lot of this sort of information for the general public, but as information professionals, we lose too much credibility if we do that. We are supposed to be the experts. We must take this type of information and make it useful--not merely quote it or misquote it. Read carefully, think carefully, and make the information useful to your end-users.

So how many angels can dance on the head of a pin? 800 million. Tell everyone you know.

REFERENCES

[1] Lawrence, Steve, and Giles, C. Lee. "Accessibility of Information on the Web." Nature 400, No. 6740 (July 8, 1999) : pp. 107 - 109. See http://www.wwwmetrics.com.

[2] Dunn, Ashley. "Most of Web Beyond Scope of Search Sites." Los Angeles Times (July 8, 1999): p. A1.

[3] Davis, Ryan. "Study: Search engines can't keep up with expanding Net; Researchers estimate only fraction of 800 million Web pages are covered." The Boston Globe (July 8, 1999): p. C1.

[4] Davis, Ryan. "Study Indicates Search Engines Fail to Keep up with Web." The Chicago Tribune (July 19, 1999): p. 10, Zone N.

[5] Behr, Peter. "Data Basics: In Search Engines, Number 1." The Washington Post (July 8, 1999): p. E05.

[6] Snider, Mike. "Web Growth Outpacing Search Power." USA Today (July 8, 1999): p. 1A.

[7] See note 2.

[8] Guernsey, Lisa. "Seek--but on the Web, You Might Not Find." The New York Times (July 8, 1999): p. G3.

[9] Lawrence, Steve, and Giles, C. Lee. "Searching the World Wide Web." Science 280, No. 5360 (April 3, 1998): pp. 98 - 100. See http://www.neci.nj.nec.com/~lawrence/science98.html.

[10] See note 9.

[11] See note 1.

[12] Sullivan, Danny. "The Search Engine Report." No. 33 (August 2, 1999). http://www.searchenginewatch.com/sereport/bydate.html.

Michael Dahn (mdahn@carltonfields.com) is the Manager of Intranet Development and Library Services at the law firm of Carlton Fields in Tampa, Florida.

Comments? Email letters to the Editor at editor@infotoday.com.

[infotoday.com]

[ONLINE]

[Current Issue]

[Subscriptions]

[Top]