| FEATURE Using Web Server Logs to Track Users
                        Through the Electronic Forest
 by Karen A. Coombs
 
 As the electronic services librarian at SUNY Cortland,
                          I am responsible for the electronic presence of the
                          library and for providing cohesion between its various
                          Web-based systems and electronic information. As part
                          of my responsibilities, I work with the library's Webmaster
                          and other librarians to create a unified Web-based
                          presence for the library. Along with our main site,
                          we have Web sites for our ILL system, proxy server,
                          and Web-based survey host. When the university wanted
                          to know more about who was using our Web pages and
                          how they were doing it, I was asked to collect some
                          statistics. After using three methods to assess our
                          site, I chose to concentrate on using Web server logs,
                          and in following their paths discovered a wealth of
                        useful data.
  Welcome to the Virtual Electronic Forest   Currently, the library Web site at SUNY Cortland
                          has approximately 270 pages spread across two different
                          servers. These pages include database-driven Web pages,
                          XML, and Web-based forms. The site offers a variety
                          of Web-based services including online interlibrary
                          loan, electronic resources, research guides, tutorials,
                          and e-mail reference. Since my arrival at SUNY Cortland
                          almost 4 years ago, the significance of the Web site
                          as a library service has grown considerably. The number
                          of users visiting the site seems to be increasing,
                          and the library has been providing more Web-based services
                          and resources. In contrast, the last 4 years have seen
                          a steady decline in the number of users visiting the
                          physical library. In my opinion, SUNY Cortland is seeing
                          a shift in user behavior toward the virtual library;
                          this is a trend being noticed by libraries throughout
                          the country.   Although our library's Web site is not a "physical" space
                          per se, patrons utilize it in ways that are comparable
                          to the physical library. Since the library currently
                          collects and analyzes a variety of statistics on user
                          behavior within the physical library, I've been asked
                          to do the same for the Web site. Some questions that
                          I am interested in answering are:  
                           	What resources, services, and pages
                            are patrons utilizing?   	Do people find the Web site navigable
                            and useful?   	How do users learn about the library's
                            Web site, and why are they motivated to visit?   Answering these questions has become part of our
                          plan for continuous assessment and improvement. In
                          particular, I have been asked to assess the effectiveness
                          of the library's Web site. While working at SUNY Cortland,
                          I've used three different methods to do this. My first
                          foray into the assessment arena was at the start of
                          my second year here. I had just finished redesigning
                          the Web site and wanted to get feedback from our users.
                          I decided to use a Web-based survey to do this. The
                          survey provided me with some very subjective data about
                          how patrons felt about the site. However, I received
                          a limited number of responses, which didn't tell me
                          much about what portions of the site were being used.   Second, I turned to our Web server logs to track
                          which parts of the library's Web site were being used.
                          At the time, I was receiving a report from the campus
                          Webmaster that listed our most frequently used pages.
                          This information was helpful in assessing the effectiveness
                          of different Web site pages, but it didn't answer all
                          of my questions. One question of particular interest
                          at that time was how many visitors use Netscape 4.7
                          when accessing our site. In spring 2001, we were implementing
                          a new Web catalog that did not work well with Netscape
                          4.7; we needed to know how many of our users this would
                          adversely affect. To accomplish this goal, we sent
                          several months' worth of Web server logs to a consultant,
                          who ran them through a log analysis tool. The results
                          were that less than 4 percent of our users accessed
                          the site via Netscape 4.7. After this project was completed,
                          I contracted the consultant to analyze our server logs
                          on a monthly basis. This arrangement continued from
                          the spring of 2001 until the library successfully implemented
                          its own Web log analysis software in August 2004.   During my third assessment, the Webmaster and I studied
                          our site using task-based usability testing. This is
                          where users are observed as they perform particular
                          tasks (or answer particular questions). They're encouraged
                          to "think out loud." It is often difficult to do this
                          type of testing without disrupting the user. In addition,
                          the observer often only gets snippets of behavior.
                          The task-based usability testing we conducted was very
                          successful and allowed us to make significant changes
                          to the library site. However, the process of building
                          and conducting the testing made me realize that we
                          needed to more efficiently and consistently be collecting
                          and analyzing the data from our Web server logs. These
                          logs could provide the foundation for our analysis
                          because they provide the most continuous and complete
                          data about our library site. Additionally, by analyzing
                          these log files, the Webmaster and I would be able
                          to focus on specific areas of the site to improve.
                          Having come to this conclusion, I began to investigate
                          how to collect and analyze our logs.   Seeking Electronic Tracks   Our first step was getting the server to generate
                          logs. Almost all Web servers have the capability to
                          do this; the function just needs to be turned on. The
                          crucial piece for us was making sure the data we wanted
                          to analyze was in the log files. After doing some research,
                          I learned that most log analysis tools want very similar
                          information; you can configure the server to collect
                          specific information in your log files.   Additionally, many servers let you control whether
                          the log files are collected daily, weekly, monthly,
                          and so on. Based on my reading, I set up our server
                          to collect log files on a daily basis. Each daily log
                          file contains several lines of information. Each line
                          represents a user's interaction with the library's
                          Web site. A line in the log file typically contains
                          information like the date and time, the IP address
                          of the user, the page being accessed, and the browser
                          and operating system being used by the person visiting
                          the site.   Recognizing Users' Electronic Footprints   Simply collecting Web server logs wasn't enough.
                          In order to follow our virtual visitors' footprints,
                          we needed a Web log analysis tool. While it is possible
                          to look for patterns in the server log files yourself,
                          analysis tools make this task much easier. These tools
                          take the raw data in the log files and look for patterns
                          such as which pages are most or least visited. The
                          end result is an aggregation of data transformed into
                          a useful set of reports about what is going on in your
                          virtual forest.   Today, there are a variety of log analysis tools
                          available. These tools have many similarities but can
                          range in features and price. Since we had been receiving
                          log analysis reports from a consultant, I understood
                          the types of information these tools provide about
                          a Web site. Also, I knew that at the very least the
                          library would need the most basic analysis data, like
                          number of hits and visitors, referring sites, referring
                          search engines, and search key phrases. I found an
                          array of Web analysis tools by searching Google. Since
                          cost was a factor, I specifically searched for open
                          source tools and compared these with commercial solutions.
                          In addition, I spoke with several other Web managers
                          and the consultant who had been providing us with log
                          analysis data. This gave me insight into the reports
                          provided, cost, technical requirements, difficulty
                          of installation, configuration, maintenance, and available
                          features.   My research revealed that almost all log analysis
                          tools provide the basic set of data I was hunting for.
                          There is a great range in prices of log analysis packages.
                          Some (like WebTrends) cost several hundred dollars,
                          while others (such as Analog) are free. The pricing
                          models also differ from one tool to anothersome
                          packages are priced by the number of servers being
                          analyzed, others by the volume of traffic (number of
                          page views) analyzed. Certain tools allow the Webmaster
                          to access and run them remotely. Different tools produce
                          reports in HTML, PDF, or other formats.   WebTrends is probably the best-known log analysis
                          tool for businesses. However, it costs close to $500
                          for a small-business license. SurfStats and Sawmill
                          are other commercial packages that cost about $100
                          for a single-server license. In addition, I found three
                          open source log analysis tools that seemed worth investigating:
                          AWStats, Analog, and Webalizer.   Choosing a Tool for Tracking and Analysis   When it came to choosing the tracking tool to analyze
                          our Web logs, price was an important factor. I had
                          wanted to perform this analysis for some time; however,
                          funding it had never been an organizational priority.
                          So, software was never acquired, and our log files
                          had never been consistently analyzed. Many Web managers
                          recommended WebTrends as a solution. However, while
                          WebTrends provides extremely in-depth information,
                          I could not justify the cost for the types of data
                          the library managers were interested in. The problem
                          was not only one of initial cost but of upgrades. Other
                          Web managers told me that WebTrends and many other
                          log analysis tools are upgraded frequently. This would
                          mean a software investment every other year (if I chose
                          to skip a version or two). While SurfStats and Sawmill
                          provided a lower-cost alternative, the upgrade cost
                          was still a factor. In addition, these products were
                          licensed per Web site, meaning we would need to purchase
                          four licenses to cover our four sites.   As a result, my search for a log analysis tool turned
                          to open source solutions. Currently, there are at least
                          a dozen available. In selecting a tool for Memorial
                          Library, I looked at Analog, Webalizer, and AWStats.  
                           	Analog is a C-based analysis tool that
                            can be run from a Web page or the command line. It
                            is probably the most popular of these three open
                            source Web analysis products. It has comparable features
                            to AWStats but provides no information about visits.
                            Additionally, it does not archive the data gathered
                            from the server logs in a format that can be analyzed
                            by another product such as a database.   	Webalizer is a C-based analysis tool
                            that has to be run from the command line. While it
                            provides similar basic data as the other two tools,
                            it doesn't provide information about users' operating
                            systems nor the search engines they may have used
                            to find the site.   	AWStats is a Perl-based log analysis
                            tool that can be run from a Web page or the command
                            line. It has more reports about "visits" than the
                            other two tools, including tracking where users entered
                            and exited the site.   Based on this comparison, I decided to use AWStats
                          because of its versatility and ability to be extended.   Setting Up Tracking Gear   Next, I needed to implement AWStats on our Web servers.
                          The first step in this process was to download the
                          program and the accompanying documentation from the
                          Web (http://awstats.sourceforge.net).
                          After reading the documentation, I became a little
                          concerned that I might lose data during installation.
                          Therefore, I contacted the consultant I work with and
                          asked him if he could help me get the software installed
                          and properly configured. Together we decided the next
                          step was to configure the server logs to match the
                          format preferred by AWStats. This meant configuring
                          the server to collect the following data within the
                          log files:  
                           	date   	time   	c-ip (client IP address)   	c-username (client server)   	s-ip (IP address of the Web server)   	cs-method (method- GET or POST)   	cs-uri-stem (the path to the file accessed)   	cs-uri-query   	sc-status (the status sent back by
                            the Web serveri.e., 404-file not found or 200-ok)   	sc-bytes (bytes sent)   	cs-bytes (bytes received)   	time-taken   	cs-version (protocol version used) 	cs (User-Agent)(operating system
                            and Web browser patron used to access the site)   	cs (Cookie)   	cs (Referer)(page the user came
                            from when accessing the current page)   Next we decided to test AWStats on a sample set of
                          log files from my server without actually installing
                          the software on our Web server. When this test was
                          successful, the consultant installed AWStats on a test
                          server. During this process, we could not make the
                          Web-based interface for updating the statistics work.
                          However, I was not sure I needed "real-time" data,
                          so we journeyed on. The final step in the process was
                          installing and configuring AWStats on our Web server.
                          The three-step installation process for AWStats was
                          relatively simple: 1) Install Perl if necessary (because
                          AWStats is a Perl-based program); 2) install AWStats;
                          3) configure AWStats. This required altering several
                          values in the configuration file that controls AWStats.
                          Once this was done, we were ready to start analyzing
                          log files.   Because AWStats is run from the command line, the
                          consultant and I developed a couple of batch files
                          that made interactions with the program easier. One
                          batch file allows us to control the date range for
                          which the log files are being run. A second batch file
                          allows the analysis statistics to be automatically
                          updated. I have chosen to check statistics on a monthly
                          basis. In order to do this, a batch file is used in
                          conjunction with the Windows scheduled tasks; it is
                          set to run at 11:59 p.m. every day. When the batch
                          file runs, it updates the statistics and reports for
                          the current month and places the reports in a folder
                          for that month. This allows SUNY Cortland to have up-to-date
                          statistics with no human intervention. Additionally,
                          the statistics can be updated via a batch file if necessary.
                          All of the Web log analysis reports are automatically
                          made available via the library's intranet; this allows
                          me, the Webmaster, and the director to access different
                          pieces of information about the Web site's usage when
                          necessary.   Realizing the Benefits of Doing Our Own Tracking   Having our own Web log analysis statistics has had
                          distinct benefits. First and foremost, we now have
                          statistics about our Web site available on demand.
                          As a result, I am able to more easily answer questions
                          about visitors to our Web site. Since we are no longer
                          reliant on an external source for our data, we are
                          able to gather the data we want in the format we want.
                          Moreover, we can change the data we are gathering as
                          needed. Another advantage is the fact that we have
                          a more complete record of how our Web site is being
                          used. Prior to implementing AWStats, we were only running
                          log files on our main Web site. Now we are able to
                          run usage statistics for all of our sites. This provides
                          us with greater insight into the overall patterns of
                          user behavior across the library's electronic forest.   As a result of analyzing our server logs, we have
                          learned several interesting things about our users'
                          behavior. There are a few pages in our site that people
                          utilize more than others. The most surprising of these
                          is a page that lists the library's periodical holdings.
                          The heavy use of this page has emphasized the importance
                          of creating complete holdings for our journals in the
                          Web catalog. Additionally, users prefer the alphabetical
                          listing of the library's database to a list of full-text
                          databases or a list of databases by subject. Data collected
                          from the server logs also revealed that most users
                          access our site while on campus. This is interesting,
                          considering a significant number of students live off
                          campus. Another important discovery is that most users
                          come to the library's site directly rather than through
                          an external source like a search engine or a link on
                          another site.   The information I've obtained from analyzing the
                          server logs has taught us many intriguing things about
                          our users and has created as many questions as it has
                          answered. Nonetheless, the data has been invaluable
                          in making decisions about Web-based services. We have
                          found many practical applications of Web server log
                          data, including designing future usability studies.
                          All of these endeavors have helped us to improve the
                          overall quality of our Web site. However, none of this
                          would have been possible without AWStats. This demonstrates
                          that there are low-cost solutions that can yield big
                          results for small and medium-sized libraries.	 Further Reading
  Bailey, Dorothy (2000). "Web Server Log Analysis" (http://slis-two.lis.fsu.edu/~log).   Fichter, Darlene (2003). "Server Logs: Making Sense
                          of the Cyber Tracks," ONLINE 27 (5) 4755.   Haigh, Susan and Megarity, Janette (1998). "Measuring
                          Web Site Usage: Log File Analysis," Network Notes 57 (http://www.collectionscanada.ca/9/1/p1-256-e.html).   Kerner, Sean Michael (2003). "Handle Log Analysis
                          with AWStats," Builder.com (http://builder.com.com/5100-6371-5054860.html).   Open Directory Project: http://dmoz.org/Computers/Software/Internet/Site_Management/Log_Analysis/Freeware_and_Open_Source.
  Rubin, Jeffrey (2004). "Log Analysis Pays Off," Network
                          Computing 15 (18) 7679.        Karen A. Coombs is the electronic services librarian
                        at SUNY Cortland in N.Y. She holds an M.L.S. and M.S.
                        in information management from Syracuse University in
                        N.Y. In addition to developing and maintaining the library's
                        Web applications (SFX, ILLiad, and OPAC), she is responsible
                        for implementing and maintaining the library's electronic
                        resources. Coombs is the author of the Library Web Chic
                        Weblog (http://www.librarywebchic.net) and
                        has published articles in Computers in Libraries and Journal
                        of Academic Librarianship. Her e-mail address is coombsk@cortland.edu.
 |