Using Web Server Logs to Track Users Through the Electronic Forest

Online

KMWorld

CRM Media

Streaming Media

Faulkner

Speech Technology

Unisphere/DBTA

Other ITI Websites

American Library Directory Boardwalk Empire Database Trends and Applications DestinationCRM Enterprise AI World Faulkner Information Services Fulltext Sources Online InfoToday Europe KMWorld Literary Market Place Plexus Publishing Smart Customer Service Speech Technology Streaming Media Streaming Media Europe Unisphere Research

Magazines > Computers in Libraries > January 2005
Back Index Forward

SUBSCRIBE NOW!

Vol. 24 No. 1 — January 2005

FEATURE
Using Web Server Logs to Track Users Through the Electronic Forest
by Karen A. Coombs

As the electronic services librarian at SUNY Cortland, I am responsible for the electronic presence of the library and for providing cohesion between its various Web-based systems and electronic information. As part of my responsibilities, I work with the library's Webmaster and other librarians to create a unified Web-based presence for the library. Along with our main site, we have Web sites for our ILL system, proxy server, and Web-based survey host. When the university wanted to know more about who was using our Web pages and how they were doing it, I was asked to collect some statistics. After using three methods to assess our site, I chose to concentrate on using Web server logs, and in following their paths discovered a wealth of useful data.

Welcome to the Virtual Electronic Forest

Currently, the library Web site at SUNY Cortland has approximately 270 pages spread across two different servers. These pages include database-driven Web pages, XML, and Web-based forms. The site offers a variety of Web-based services including online interlibrary loan, electronic resources, research guides, tutorials, and e-mail reference. Since my arrival at SUNY Cortland almost 4 years ago, the significance of the Web site as a library service has grown considerably. The number of users visiting the site seems to be increasing, and the library has been providing more Web-based services and resources. In contrast, the last 4 years have seen a steady decline in the number of users visiting the physical library. In my opinion, SUNY Cortland is seeing a shift in user behavior toward the virtual library; this is a trend being noticed by libraries throughout the country.

Although our library's Web site is not a "physical" space per se, patrons utilize it in ways that are comparable to the physical library. Since the library currently collects and analyzes a variety of statistics on user behavior within the physical library, I've been asked to do the same for the Web site. Some questions that I am interested in answering are:

• What resources, services, and pages are patrons utilizing?

• Do people find the Web site navigable and useful?

• How do users learn about the library's Web site, and why are they motivated to visit?

Answering these questions has become part of our plan for continuous assessment and improvement. In particular, I have been asked to assess the effectiveness of the library's Web site. While working at SUNY Cortland, I've used three different methods to do this. My first foray into the assessment arena was at the start of my second year here. I had just finished redesigning the Web site and wanted to get feedback from our users. I decided to use a Web-based survey to do this. The survey provided me with some very subjective data about how patrons felt about the site. However, I received a limited number of responses, which didn't tell me much about what portions of the site were being used.

Second, I turned to our Web server logs to track which parts of the library's Web site were being used. At the time, I was receiving a report from the campus Webmaster that listed our most frequently used pages. This information was helpful in assessing the effectiveness of different Web site pages, but it didn't answer all of my questions. One question of particular interest at that time was how many visitors use Netscape 4.7 when accessing our site. In spring 2001, we were implementing a new Web catalog that did not work well with Netscape 4.7; we needed to know how many of our users this would adversely affect. To accomplish this goal, we sent several months' worth of Web server logs to a consultant, who ran them through a log analysis tool. The results were that less than 4 percent of our users accessed the site via Netscape 4.7. After this project was completed, I contracted the consultant to analyze our server logs on a monthly basis. This arrangement continued from the spring of 2001 until the library successfully implemented its own Web log analysis software in August 2004.

During my third assessment, the Webmaster and I studied our site using task-based usability testing. This is where users are observed as they perform particular tasks (or answer particular questions). They're encouraged to "think out loud." It is often difficult to do this type of testing without disrupting the user. In addition, the observer often only gets snippets of behavior. The task-based usability testing we conducted was very successful and allowed us to make significant changes to the library site. However, the process of building and conducting the testing made me realize that we needed to more efficiently and consistently be collecting and analyzing the data from our Web server logs. These logs could provide the foundation for our analysis because they provide the most continuous and complete data about our library site. Additionally, by analyzing these log files, the Webmaster and I would be able to focus on specific areas of the site to improve. Having come to this conclusion, I began to investigate how to collect and analyze our logs.

Seeking Electronic Tracks

Our first step was getting the server to generate logs. Almost all Web servers have the capability to do this; the function just needs to be turned on. The crucial piece for us was making sure the data we wanted to analyze was in the log files. After doing some research, I learned that most log analysis tools want very similar information; you can configure the server to collect specific information in your log files.

Additionally, many servers let you control whether the log files are collected daily, weekly, monthly, and so on. Based on my reading, I set up our server to collect log files on a daily basis. Each daily log file contains several lines of information. Each line represents a user's interaction with the library's Web site. A line in the log file typically contains information like the date and time, the IP address of the user, the page being accessed, and the browser and operating system being used by the person visiting the site.

Recognizing Users' Electronic Footprints

Simply collecting Web server logs wasn't enough. In order to follow our virtual visitors' footprints, we needed a Web log analysis tool. While it is possible to look for patterns in the server log files yourself, analysis tools make this task much easier. These tools take the raw data in the log files and look for patterns such as which pages are most or least visited. The end result is an aggregation of data transformed into a useful set of reports about what is going on in your virtual forest.

Today, there are a variety of log analysis tools available. These tools have many similarities but can range in features and price. Since we had been receiving log analysis reports from a consultant, I understood the types of information these tools provide about a Web site. Also, I knew that at the very least the library would need the most basic analysis data, like number of hits and visitors, referring sites, referring search engines, and search key phrases. I found an array of Web analysis tools by searching Google. Since cost was a factor, I specifically searched for open source tools and compared these with commercial solutions. In addition, I spoke with several other Web managers and the consultant who had been providing us with log analysis data. This gave me insight into the reports provided, cost, technical requirements, difficulty of installation, configuration, maintenance, and available features.

My research revealed that almost all log analysis tools provide the basic set of data I was hunting for. There is a great range in prices of log analysis packages. Some (like WebTrends) cost several hundred dollars, while others (such as Analog) are free. The pricing models also differ from one tool to another—some packages are priced by the number of servers being analyzed, others by the volume of traffic (number of page views) analyzed. Certain tools allow the Webmaster to access and run them remotely. Different tools produce reports in HTML, PDF, or other formats.

WebTrends is probably the best-known log analysis tool for businesses. However, it costs close to $500 for a small-business license. SurfStats and Sawmill are other commercial packages that cost about $100 for a single-server license. In addition, I found three open source log analysis tools that seemed worth investigating: AWStats, Analog, and Webalizer.

Choosing a Tool for Tracking and Analysis

When it came to choosing the tracking tool to analyze our Web logs, price was an important factor. I had wanted to perform this analysis for some time; however, funding it had never been an organizational priority. So, software was never acquired, and our log files had never been consistently analyzed. Many Web managers recommended WebTrends as a solution. However, while WebTrends provides extremely in-depth information, I could not justify the cost for the types of data the library managers were interested in. The problem was not only one of initial cost but of upgrades. Other Web managers told me that WebTrends and many other log analysis tools are upgraded frequently. This would mean a software investment every other year (if I chose to skip a version or two). While SurfStats and Sawmill provided a lower-cost alternative, the upgrade cost was still a factor. In addition, these products were licensed per Web site, meaning we would need to purchase four licenses to cover our four sites.

As a result, my search for a log analysis tool turned to open source solutions. Currently, there are at least a dozen available. In selecting a tool for Memorial Library, I looked at Analog, Webalizer, and AWStats.

• Analog is a C-based analysis tool that can be run from a Web page or the command line. It is probably the most popular of these three open source Web analysis products. It has comparable features to AWStats but provides no information about visits. Additionally, it does not archive the data gathered from the server logs in a format that can be analyzed by another product such as a database.

• Webalizer is a C-based analysis tool that has to be run from the command line. While it provides similar basic data as the other two tools, it doesn't provide information about users' operating systems nor the search engines they may have used to find the site.

• AWStats is a Perl-based log analysis tool that can be run from a Web page or the command line. It has more reports about "visits" than the other two tools, including tracking where users entered and exited the site.

Based on this comparison, I decided to use AWStats because of its versatility and ability to be extended.

Setting Up Tracking Gear

Next, I needed to implement AWStats on our Web servers. The first step in this process was to download the program and the accompanying documentation from the Web (http://awstats.sourceforge.net). After reading the documentation, I became a little concerned that I might lose data during installation. Therefore, I contacted the consultant I work with and asked him if he could help me get the software installed and properly configured. Together we decided the next step was to configure the server logs to match the format preferred by AWStats. This meant configuring the server to collect the following data within the log files:

• date

• time

• c-ip (client IP address)

• c-username (client server)

• s-ip (IP address of the Web server)

• cs-method (method- GET or POST)

• cs-uri-stem (the path to the file accessed)

• cs-uri-query

• sc-status (the status sent back by the Web server—i.e., 404-file not found or 200-ok)

• sc-bytes (bytes sent)

• cs-bytes (bytes received)

• time-taken

• cs-version (protocol version used)

• cs (User-Agent)—(operating system and Web browser patron used to access the site)

• cs (Cookie)

• cs (Referer)—(page the user came from when accessing the current page)

Next we decided to test AWStats on a sample set of log files from my server without actually installing the software on our Web server. When this test was successful, the consultant installed AWStats on a test server. During this process, we could not make the Web-based interface for updating the statistics work. However, I was not sure I needed "real-time" data, so we journeyed on. The final step in the process was installing and configuring AWStats on our Web server. The three-step installation process for AWStats was relatively simple: 1) Install Perl if necessary (because AWStats is a Perl-based program); 2) install AWStats; 3) configure AWStats. This required altering several values in the configuration file that controls AWStats. Once this was done, we were ready to start analyzing log files.

Because AWStats is run from the command line, the consultant and I developed a couple of batch files that made interactions with the program easier. One batch file allows us to control the date range for which the log files are being run. A second batch file allows the analysis statistics to be automatically updated. I have chosen to check statistics on a monthly basis. In order to do this, a batch file is used in conjunction with the Windows scheduled tasks; it is set to run at 11:59 p.m. every day. When the batch file runs, it updates the statistics and reports for the current month and places the reports in a folder for that month. This allows SUNY Cortland to have up-to-date statistics with no human intervention. Additionally, the statistics can be updated via a batch file if necessary. All of the Web log analysis reports are automatically made available via the library's intranet; this allows me, the Webmaster, and the director to access different pieces of information about the Web site's usage when necessary.

Realizing the Benefits of Doing Our Own Tracking

Having our own Web log analysis statistics has had distinct benefits. First and foremost, we now have statistics about our Web site available on demand. As a result, I am able to more easily answer questions about visitors to our Web site. Since we are no longer reliant on an external source for our data, we are able to gather the data we want in the format we want. Moreover, we can change the data we are gathering as needed. Another advantage is the fact that we have a more complete record of how our Web site is being used. Prior to implementing AWStats, we were only running log files on our main Web site. Now we are able to run usage statistics for all of our sites. This provides us with greater insight into the overall patterns of user behavior across the library's electronic forest.

As a result of analyzing our server logs, we have learned several interesting things about our users' behavior. There are a few pages in our site that people utilize more than others. The most surprising of these is a page that lists the library's periodical holdings. The heavy use of this page has emphasized the importance of creating complete holdings for our journals in the Web catalog. Additionally, users prefer the alphabetical listing of the library's database to a list of full-text databases or a list of databases by subject. Data collected from the server logs also revealed that most users access our site while on campus. This is interesting, considering a significant number of students live off campus. Another important discovery is that most users come to the library's site directly rather than through an external source like a search engine or a link on another site.

The information I've obtained from analyzing the server logs has taught us many intriguing things about our users and has created as many questions as it has answered. Nonetheless, the data has been invaluable in making decisions about Web-based services. We have found many practical applications of Web server log data, including designing future usability studies. All of these endeavors have helped us to improve the overall quality of our Web site. However, none of this would have been possible without AWStats. This demonstrates that there are low-cost solutions that can yield big results for small and medium-sized libraries.

Further Reading

Bailey, Dorothy (2000). "Web Server Log Analysis" (http://slis-two.lis.fsu.edu/~log).

Fichter, Darlene (2003). "Server Logs: Making Sense of the Cyber Tracks," ONLINE 27 (5) 4755.

Haigh, Susan and Megarity, Janette (1998). "Measuring Web Site Usage: Log File Analysis," Network Notes 57 (http://www.collectionscanada.ca/9/1/p1-256-e.html).

Kerner, Sean Michael (2003). "Handle Log Analysis with AWStats," Builder.com (http://builder.com.com/5100-6371-5054860.html).

Open Directory Project: http://dmoz.org/Computers/Software/Internet/
Site_Management/Log_Analysis/Freeware_and_Open_Source.

Rubin, Jeffrey (2004). "Log Analysis Pays Off," Network Computing 15 (18) 7679.

Karen A. Coombs is the electronic services librarian at SUNY Cortland in N.Y. She holds an M.L.S. and M.S. in information management from Syracuse University in N.Y. In addition to developing and maintaining the library's Web applications (SFX, ILLiad, and OPAC), she is responsible for implementing and maintaining the library's electronic resources. Coombs is the author of the Library Web Chic Weblog (http://www.librarywebchic.net) and has published articles in Computers in Libraries and Journal of Academic Librarianship. Her e-mail address is coombsk@cortland.edu.

Back to top