Who Goes There? Measuring Library Web Site Usage

feature

Who Goes There?
Measuring Library Web Site Usage

Kathleen Bauer

ONLINE, January 2000
Copyright © 2000 Information Today, Inc.

After all the work, time, and money that's invested in building and maintaining the library Web site, you and your staff will most likely want to know who, if anyone, is using it. Additionally, what features and resources do visitors use most often? Are the people accessing the site the same people who come into the library? How do people find the Web site? Do they use a search engine?

These are usage questions, and librarians already have experience in gathering usage data. For example, librarians count the number of questions asked at a reference desk as a way of measuring its use. Like the reference desk, the library Web site represents a service point. The Web site service point, however, is electronic, and it requires new methods of measuring usage.

Understanding the basics of Web server technology and the data servers record is a good start in developing usage measurement techniques. After that, you can explore the software that exists to help you make sense of Web site statistics, and find the right software for your system.

WEB SERVER LOG FILES

Every transaction on the Internet consists of a request from a browser client and a corresponding action from the computer server. Each individual client/server transaction is recorded on the server in what is called a server log file. Its most basic form is called the common log file. [Note: In the examples that follow, log entries typically appear as a single line of text.]

The Common Log File

The common log file format is the standard set by the World Wide Web Consortium. The syntax of an entry in a common log file looks like the following:

remotehost rfc931 authuser [date] "request" status bytes

Broken out, each component of a common log file has its own meaning.

remotehost: The name of the computer accessing the Web server.
rfc931: The name of the remote user. This field is often blank.
authuser: The login of the remote user. This field is also often blank.
[date]: Date and time of the access request.
"request": The URL of the file requested, exactly as the client requested it.
Status: the error code generated by the request and returned to the client.
Bytes: size in bytes of the document returned to the client.

A typical log entry might look something like:

gateway.iso.com - - [10/MAY/1999:00:10:30 -000] "GET /class.html HTTP/1.1" 200 10000

In this example, the remote host is gateway.iso.com. The next two fields, rfc931 and authuser, are blank (represented by dashes). The request was made on May 10, 1999 at 10 minutes after midnight. The file requested was class.html. The error code 200 (status OK) was returned, and the file requested was 10,000 bytes in size.

The common log file format may be the standard, but variations of log files exist. Additional information may be stored in referrer and agent logs.

Referrer Log File

Many servers record information about the referrer site, or the URL a visitor came from immediately before making a request for a page at the current Web site.

08/02/99, 12:02:35, http://ink.yahoo.com/bin/query?p="sample+log+file"&b=21&hc=0&hs=0, 999.999.999.99, jaz.med.yale.edu

In this example, the referring page was a search engine, ink.yahoo.com, and the search used to find the requested page was "sample log file." (Many Web designers and marketers are interested in the search words that lead users to their sites.) Note that the IP address of the computer making the request, 999.999.999.99, is also recorded here.

Agent Log File

A third type of recording is the agent log. An agent log records the browser and operating system used by a visitor. It will also record the name of spiders or robots used to probe your Web site. An example of a hit from a Northern Light search engine, recorded in an agent log, might look like:

07/09/99, 13:59:24, , 999.999.99.99, scooby.northernlight.com, crawler@northernlight.com, Gulliver/1.2

In addition to the standard information about the date, time, and IP address, crawler@northernlight.com tells you that this hit came from a crawler. A hit from a Web browser would reveal the browser name and version, such as Mozilla/4.0. This probably means the visitor's browser was Netscape version 4.0 (Mozilla was the code name for Netscape and is still used for a browser compliant with the open-source Netscape code.) Browser information, however, is not always considered reliable.

Common log files, referrer logs, and agent logs are sometimes combined into one log. Whatever format your Web server uses, the first thing you will need to do is determine what type of log file is being generated. The person responsible for the server should be able to tell you what format is used. In addition, there may be options in the log file that determine what data is recorded, and you may be able to use these options to increase or decrease the data collected, depending on your needs.

To sum up, some of the things you can learn from your Web server's log files are:

What pages on your site are requested.
The IP addresses of computers making requests.
The date and time of requests.
If a file transfer is successful or not.
The last page a requester visited before coming to your site.
The search terms which led someone to your site.

LOG FILE LIMITATIONS

Log files are designed to help server administrators gauge the demands on a server, and they are very good at this. However, log files are not designed to describe how people use a site. You cannot always distinguish, for example, if three requests for a file came from three different people, or from one person requesting the file three times. This is largely because many people use the Internet through dial-up connections to an ISP. In a practice called dynamic addressing, the ISP assigns an IP address to the user while online, but will reassign the IP to a new user when the first user disconnects. An individual cannot be identified with an IP, unless there's a direct Internet connection (called static addressing). Some log analysis software tries to identify a single visitor to a site by looking at requests from the same IP address during short periods of time, but it is still not possible to tell when the same person returns the next day with a different IP.

There are also concerns about how caching affects log files. Caching occurs when you visit a Web page and your browser stores that page in memory. The next time you request the same URL, your browser will search its memory for the URL. If it is cached, it will pull the page from memory, and the server will never receive the request. You are using the site, but that will not be recorded in the server's log files. ISPs also utilize caching, which exacerbates the problem.

A good rule to remember is that the log file measures requests for specific files on a server, not exact usage. The number of requests does not translate into number of unique visitors, and the numbers may not reflect all usage because of caching. Measuring usage requires extrapolating from what the log file tells us and entails some level of error. To gain more exact knowledge about Web site usage, other means of investigation, such as questionnaires or cookies, must be used.

PRIVACY CONCERNS

The good news about log file limitations is that they represent at least some protection of user privacy. A log file never records the user's name, or home or email address. Such information can only be recorded if the Web site asks a user to register and then requires a login for each subsequent visit. A site that does this can then link log files to a database of user profiles to generate reports about individual usage. Analyzing log files, by themselves, can only provide data about groups of users.

Also remember that dynamic addressing masks some individual users because they are not associated with a unique IP address. Anyone, however, who connects directly to the Internet will have a unique, unchanging IP address. Even though a name is not recorded, access to an individual's IP address can reveal their actions on a Web site.

There are currently no laws covering how to handle the information contained in a log file, but because log files can contain information about individual IP addresses, they should be considered confidential, much as circulation records are confidential. Any data the library makes public from its log files should mask individual IP addresses. Data can always be presented at the level of usage by large groups (such as users from a particular country or in-house versus outside users).

However your library decides to analyze log files, the library Web site should carry a complete statement of what data is collected, who can see the data, and how that data is used.

CHOOSING LOG ANALYSIS SOFTWARE

Log files consist of thousands of lines of text. It is impossible to extract useful information by simply reading them. Log files can be analyzed by downloading their contents into a spreadsheet or database designed specifically for your site. Creating this type of application is time-consuming and very difficult. A far more common approach to analyzing log files is to use software designed to manipulate and analyze the data they contain.

Before you consider analysis software, make sure you understand what you really want to know about your library's Web-site usage. There is free and commercial software available for log analysis-each has advantages and disadvantages. In general, commercial software offers more features, enhanced graphics, and some level of customer support. If your needs are not too complex, a simpler-and less expensive-alternative may suit you as well as, or better than, the most full-featured analysis packages.

The following are not reviews of software. They are quick snapshots of some of the features of free and commercial software to acquaint you with what is available and the price range. No single software package is right for everyone. Performance of individual software will be affected by the types of log files your server produces, so you need to test your own system using your own log files to evaluate what works best in your environment. As you examine software options, keep one key point in mind. Log analysis software can aid in gathering, distilling, and displaying information from log files, but no matter how sophisticated the software, it cannot add to or improve on what is already available in the log file. The contents of the log file are the ultimate limiting factor in what log analysis software can do for you.

FREE LOG ANALYSIS SOFTWARE

TITLE: Analog Version 3.31
PRODUCER: Stephen Turner, University of Cambridge Statistical Laboratory
URL: http://www.statslab.cam.ac.uk/~Esret1/analog/
PRICE: Free
CUSTOMER SUPPORT: No
LOG FILE FORMAT: Configurable to support many formats
PLATFORM: Windows, Macintosh, UNIX, and others

Analog is a very popular, freely available log analysis software developed by Stephen Turner. Analog offers a standard report, which can be configured to the specifications of the user, and offers a General Summary of requests to a Web server.

An important feature is the Request Report. This report displays the most requested Web pages on the Web site, from most to least. The Request Report lists the number of requests, the last date when the file was requested, and the file name.

In addition to the General Summary and the Request Report, Analog will display monthly, daily, and hourly summaries. This can help to identify the busiest month, day of the week, and hour of the day. Also, Analog can show the most common domain names of computers where requests for the server's pages originated. This can tell you, for example, that 35% of requests came from academic sites in the U.S. Analog makes no attempt to try to identify the number of visitors to a site.

Analog is widely used. It runs on a variety of platforms and can recognize many log files. It does not offer advanced graphics capabilities. The two titles below are also free and can be downloaded from the Internet.

TITLE: wwwstat
URL: http://www.ics.uci.edu/pub/Websoft/wwwstat/
PRODUCER: Roy Fielding
PRICE: Free
CUSTOMER SUPPORT: No
LOG FILE FORMAT: Common log file format
PLATFORM: UNIX

TITLE: http-Analyze 2.01
URL: http://www.netstore.de/Supply/http-analyze/default.htm
PRODUCER: RENT-A-GURU
PRICE: Free for educational or individual use
Customer Support: No
LOG FILE FORMAT: Common log file format, some extended log file formats
PLATFORM: UNIX

COMMERCIAL LOG ANALYSIS SOFTWARE

TITLE: WebTrends Log Analyzer
URL: www.Webtrends.com
PRODUCER: WebTrends Corporation
PRICE: $ 399, more expensive products also available
CUSTOMER SUPPORT: Free technical support, by phone, email, and Web server; online FAQ and documentation.
TRIAL: Free 14-day trial
LOG FILE FORMATS: Recognizes 30 log file formats
PLATFORM: Windows 95/98/NT

WebTrends is a powerful software package that attempts to simplify the process of log analysis. Log profiles and reports are created and edited in menu-driven systems, with wizards and online help available to ease the process. WebTrends lets you manage multiple log files across several servers.

Generating a customized report is done easily through the Report Wizard. In the report creation module, you may elect to generate tables and graphics from General Statistics, Resources Accessed, Visitors & Demographics, Activity Statistics, Technical Statistics, Referrers & Keywords, and Browsers & Platforms. Including a table or graph is as easy as checking a box in the wizard process. Graphs can be further customized as pie charts, or bar or line graphs. Reports can be generated as HTML, Microsoft Word, or Microsoft Excel documents.

Some of the WebTrends reports are similar to what is offered in free software. For example, WebTrends will generate a report of the most requested pages on the Web site. Notice that a graph is included and that file addresses are also identified by titles.

WebTrends has more reports available than the free software. In the area of Resources Accessed alone, WebTrends generates tables and graphs for entry pages, exit pages, paths through the site, downloaded files, and forms. The other report sections are also full of enhanced capabilities. Referrers & Keywords presents the top search engines, sending hits to your site and the search terms that found your site.

WebTrends reports can be filtered to exclude or include particular data. For example, you can choose to exclude requests generated by library employees by filtering those IP addresses out of the report. Other filters can present data for only one page, for a particular day of the week or hour of day, or for a particular referrer page. This feature is helpful in controlling the amount of data presented and aids in more finely targeting your reports to a particular subject.

WebTrends uses mathematical algorithms to try to distinguish the number of visitors to your site. There are difficulties in determining unique users from log files, and this information may not be credible. WebTrends itself states that the only way to determine a unique visitor to the site is to use authentication (i.e., logons and passwords.) For sites that do require authentication, WebTrends offers the ability to link user profile information in databases to visitor activity on the site.

WebTrends offers many easy-to-use features. In some ways, it's a bridge between low-cost or free utilities and very high-end software packages, which can cost from $7,000-$10,000. Some of the advanced capabilities in WebTrends might be more than your library requires. The following two software packages present many of the same capabilities as WebTrends, such as predefined and customizable reports, data filtering, graphics, and friendly user interfaces.

TITLE: Netintellect V.4.0
URL: http://www.Webmanage.com/
PRODUCER: WebManage
PRICE: $295
CUSTOMER SUPPORT: Support by phone, email, and online, as well as an online tutorial and documentation
TRIAL: Free 15-day trial
LOG FILES: Recognizes 45 log file formats
PLATFORMS: Windows 95/98/NT

TITLE: FastStats
URL: http://www.mach5.com/fast/
PRICE: $99.95
25-day free trial
CUSTOMER SUPPORT: Free technical support via email, no phone support available
PLATFORM: Windows 95/98/NT

CONCLUSION

Library Web sites will only become more important. They represent vital service points and investments of money and staff time. If a library wishes to measure usage of its Web site, server log analysis is one tool that should be employed. Libraries that wish to gain more in-depth knowledge about usage should investigate other means of data gathering, such as questionnaires and cookies.

Server logs were designed to measure traffic and demand loads on a computer server, and they work well for this purpose. When server log files are used to try to measure how people use a site, they don't work quite as well. They can, however, give you useful information about the relative usage of pages on your Web site, other sites that refer visitors to your site, and how search engines help people find your site, among other important data.

Although log analysis isn't perfect, few measures of usage are. For example, when we count people who come through the doors of our library, we don't know if they are there to read books or magazines, or just use the bathroom. When we circulate a book, we don't know why it was selected or even if it is read. Server log file analysis can be viewed in the same light, as a flawed but necessary measure of usage. The important thing is to educate yourself about the abilities and limitations of log file analysis so that you can make educated use of the data it produces.