A Checkup With Open Source Software Revitalizes an Early Electronic Resource Portal
Stephan Spitzer and Stephen Brown
The Uniformed Services University of the Health Sciences, located on the National Naval Medical Center’s campus in Bethesda, Md., is a medical education and research facility for the nation’s military and public health community. In order to support our approximately 7,500 globally distributed users, the university’s James A. Zimble Learning Resource Center provides Web-based access to more than 8,000 electronic resources through one of the first electronic library portals that was ever on the Internet. But after 9 years of service, our portal “patient” was starting to show its age.
We provide the IT development and support for the various systems within our library. Stephen Brown, project architect and director of our group, and Stephan Spitzer, lead developer and database administrator, worked with our development team of two others to revitalize this application suite. While we made many innovations and improvements to our system, our discussion will focus on the performance enhancements that used open source software and various programming techniques.
A Physical Exam Begins With the Patient’s History
First deployed in 1997, Remote Computer Services (RCS) was a Web-based collection of electronic library resources, including full-text journals and books, indexes to the published literature, medical databases, and other medical education resources. (See Figure 1.) Given that our user population worked under less-than-optimal network conditions (sometimes in combat situations), we designed RCS to function under low-bandwidth constraints. It offered a single sign-on for seamless access to all resources. Because our patrons were largely unfamiliar with the Web in 1997, we provided extensive help and guidance for all of our eclectic resources and eventually made inroads with linking full text to search engines.
Our formula was successful. With proper protections in place for our publishers, RCS began to function as a consortium tool, modularizing and providing custom interfaces to a number of related medical and research institutions, even outside of the Department of Defense. At its peak, RCS had amassed more than 8,000 registered users in all areas of health and medicine.
RCS evolved, growing like other Web applications, from static HTML to dynamic, database-driven content. This content was pieced together with Apple’s FileMaker database manager and Blue World Software’s Lasso middleware (now owned by LassoSoft) and was wrapped in Perl CGI. At various stages, we employed proprietary client/server applications such as Aries Systems’ Knowledge Finder/Server to accompany our Web interface.
As our system became increasingly popular, its performance began to degrade. Our users’ mantra was, “I love your system, but it’s too slow.” For simplicity’s sake, our screens were largely primitive and utilitarian. Administration was getting cumbersome, and new development was taking conspicuously too long. Faced with the potential of additional consortium partners, it was time for a redesign.
Our Initial Diagnosis
By listening to our users, staff, and administrators and heeding the lessons that we had learned from RCS, we set out with the following design goals:
- High performance and scalability
- Replacement of proprietary development tools with open source alternatives
- Streamlined account processing
- Logical content management
- To put users in control of their account management and preferences, including server-side bookmarking, search history, and automated password recovery
- Intelligent site searching and multiple indexing strategies to resources, including support for synonyms and abbreviations
- Integrated services for material requests (interlibrary loan), e-reserve, and training registration
- Printer-friendly formats for site documents
- Enhanced system alerts, improved user communication, and a universal news instrument
- An engaging, intuitive, and usable design based on CSS (cascading style sheets)
Behind the scenes, we defined the following new methods:
- To replace the easily defeated HTTP_REFERER authentication with a digital ticket so we could seamlessly bind our proxies to our portal
- To streamline our account-processing procedures and offer more flexible forms of authentication by discarding our duplicative Microsoft Active Directory server
- To adopt a session model for user credentials
- To exploit the new technology so we could refine staff tools for content management, version control, error checking, and universal news administration
But as important and innovative as these changes are, our primary design goal remained improving our performance.
Plan of Treatment: Prescribing PHP and MySQL
The project began in summer 2005, with our then-lead-developer producing a working straw man application. There was then a hiatus during personnel changes. Stephan Spitzer became the new lead developer in fall 2005. The project team comprised one full-time (Stephan) and two half-time programmers (Raji Annamalai and Marisa Stoolmiller). Work proceeded from the original straw man, and the first release was essentially complete by fall 2006. We offered the new system in parallel with the old one until we gained enough confidence to stop the original. In order to distinguish the completely new system, we named it Electronic Resources (ER). The old system retained the name RCS. ER became our primary electronic resource portal in January 2007. (See Figure 2.)
Our initial plan was to use PHP as the programming language and continue to use FileMaker as our database server. PHP has many advantages. It was created just to support dynamic HTML page development. It’s fast to execute and easy to learn. Since PHP is a scripting language (as opposed to a compiled programming language), it’s cross-platform. In addition, it’s open source and free, so there are many readily available resources on the Web, including lots of documentation and examples. FileMaker, too, has its advantages, such as a very easy user interface and lots of staff experience with the software.
We soon found, however, that FileMaker was not going to provide the performance necessary for our application. We believe this was due to a convoluted interface to PHP. We made the decision to change database servers to MySQL, an open source (and free) product. MySQL is a high-performance server that has a much more efficient interface to PHP. In addition, the software has many available configuration tuning parameters, which we’ll discuss later. By just changing the database server, we were able to reduce our access time for 10,000 items from approximately 30 to 90 seconds to the 5- to 10-second range.
The trade-off is that our design became more complicated. In order to leverage our investment and experience, we kept FileMaker as our internal interface to the database. The patron application access is from MySQL via the Web. We purchased FmPro Migrator, an inexpensive product that allows us to transfer data between the FileMaker and MySQL database tables using automatically generated Perl scripts. We use the Open Database Connectivity standard (ODBC) to extract the FileMaker data and use Structured Query Language (SQL) to load the data to MySQL. While we had to manually tweak the migration scripts a bit in the beginning, this system has worked very well in production.
Of course, it isn’t just the application design that became more complicated. Nor does “free” software come cost-free. We now have two complete database systems that we have to host, support, and maintain. But we considered this acceptable in order to meet our performance criteria.
phpMyAdmin completes the open software solution. It allows us to administer MySQL databases using a browser. While you do need to learn some SQL in order to effectively use MySQL, many standard SQL commands have a simple graphical invocation. We also took advantage of the system run time and variable reports when we had to tune the system.
Adjustments to the Meds
As development proceeded, we discovered ways in which we were able to significantly boost the system performance.
One major enhancement was to make use of PHP’s output buffering. Our results listings are produced by using multiple echo() functions, which is a very time-consuming operation. With buffering on, only one actual output process is executed, and the application runs much faster.
MySQL offers many opportunities for tuning performance. Disk input/output (I/O) is much slower than any memory access. A standard rule of thumb is to keep as much of the database in memory as you can and avoid disk I/O whenever possible. Because our database table design is not optimized for MySQL, we have many table joins that do not use keys efficiently. The result is that many of our SQL queries produce temporary tables. We increased the temporary table size and were able to keep these tables in memory.
Another area we wanted to improve was the query cache. Since most of our patrons are searching more or less for the same items, they can realize major performance improvements if previous queries can stay resident in memory. We were able to adjust the query cache size so that most of our queries are available directly from memory. An unexpected phenomenon is database “warm-up.” This occurs as users query a fresh database. Since there’s nothing in the cache, each new query has to go to the database files for results. As each subsequent user repeats the same query, the results can be returned directly from the database cache. So the longer a database stays active, the faster the query responses become.
We did find one unexpected advantage to our new database design. By segregating the update functions in one database and the user queries into another, we now have an application database that is read-only. This means that the tables stay optimized without becoming fragmented from record updates. The result is that database accesses are fast and consistent.
One last major element in any Web application is the Web server. We are using Internet Information Server (IIS) version 6.0 from Microsoft, running under Windows Server 2003. We were able to get another performance increase by converting our PHP interface from the default Common Gateway Interface (CGI) to Internet Server Application Programming Interface (ISAPI). In short, CGI causes the PHP application to be loaded each time a script is executed, which is a slow process. The ISAPI is customized for IIS, so PHP can be loaded once and stay resident in the server’s memory.
We were able to squeeze one more improvement from IIS by means of HTTP compression. With this option set, if the requesting browser is compression-enabled, the server will compress any data sent. While there’s some cost for the compression and subsequent decompression, this is minimal compared to data transmission times.
After applying all of the above changes, we were able to reduce most MySQL database queries to sub-second responses from the original 5- to 10-second range.
The Treatment’s Side Effects
Once coding was complete and we started to prepare the system for patron testing, we were forced to really look at our electronic holdings database. It was a great opportunity to clean up the data as we checked for accuracy. The result is that our users have a much more accurate list of electronic resources available to them.
We have also tightened up our application error reporting to alert our support staff of errors before the patron knows about them. We implemented PHP so that any programming errors are not displayed to the user. Instead, we set it up to automatically save them to an internal log. A simple Perl script runs in the background and monitors this log every 10 minutes. When it finds that the log has been updated, it sends out an email so that we can begin problem determination very quickly.
We are using Google Analytics, a great product for keeping tabs on our Web site. It’s a free service that provides a tremendous amount of Web traffic statistics for your site. We have found the information that’s provided to be very helpful for understanding our usage patterns.
Along the way, we made many other improvements. Here are some highlights:
- We have made ER more central to our library by integrating an RSS viewer, OPAC print resources, training and user registration, and affiliate self-service usage reports into the application.
- We added many new options for searching and browsing our data.
- We customized application views based on user affiliation.
- We tightened security by sanitizing user data and running PHP in safe mode, which limits possible run-time damage.
What We Learned After We Released the Patient
As part of this project, we normalized our database tables. Normalization is the process of reducing data redundancy and usually involves separating the database into more tables. While this is good for data integrity and maintenance, executing table joins is costly during run time. In retrospect, we could have left our tables flatter and enjoyed faster performance.
The bookmark feature has proven to be very popular with our patrons. Users can directly access up to 10 favorite sites with just one click. But an even larger benefit is that each bookmark use is one less expensive search of the database, thus saving internal resources.
The biggest surprise in our project rollout was our longtime users’ resistance to change. We had created a system that would allow our patrons to do multiple searches, click directly through to a provider, and download a full-text article in the time that the old system would have still been waiting for the initial search to complete. Yet we had to beg them to use it. In fact, we had holdouts until the day we shut the old system down. The lesson we learned is that patrons have their own professional concerns. And as technically impressive as the new system is, users just want to get their data in the most familiar and simplest way possible for them, even if it’s slower.
Possible Future Therapies
Improvements are in the works for both PHP and MySQL. Microsoft has released FastCGI, a new PHP interface that promises higher performance. We tried to use this tool with IIS 6 but were not successful. We’ll try again once we migrate to IIS 7. MySQL has released new code for PHP called MySQL Native Driver for PHP (mysqlnd). We’ll try it once it has had time to mature.
One technique that promised a very big return is PHP caching. Since PHP is a scripting language, each time a script is executed it needs to be compiled into code that’s appropriate to the platform. This is a relatively slow process. In PHP caching, the compiled version of the script is saved in memory and re-executed each time that script is called. We tried to implement the Zend Platform for Windows, a commercial product, but were not successful. There are many open source PHP caching products available, and this is an area we’d like to revisit.
Our Assessment of the Patient’s Outcome
We have more than 6 months of experience using the new system, and we are very pleased with its performance. We are averaging approximately 1,600 visits a day, of which 700 are from unique users. During this period, the database response has remained consistent, even as we have increased the size of our electronic holdings. We believe that our current configuration can handle significant growth without jeopardizing user response.
Based on our patron feedback and user statistics, we feel that this project has been very successful. While there was some initial reluctance to undertake a relatively large development project using noncommercial software, we have been quite happy with our open source solution. Open source software has proven to be of high quality, providing both good performance and easy installation and maintenance. The lack of paid vendor support is more than compensated by the wealth of Internet resources available, including the open source provider’s Web site and other user postings. We now have a more vibrant application. It’s more aesthetically pleasing, is easily modified, and will allow us to grow in the future.
Stephan Spitzer has been database administrator and lead developer at the James A. Zimble Learning Resource Center’s Applied Medical Informatics Division and the Uniformed Services University of the Health Sciences in Bethesda, Md., since 2005. He holds an M.S. in library service from Columbia University, an M.S. in computer science from New York University’s Courant Institute, and a Post-Master’s certificate in computer science from Johns Hopkins University. His email address is email@example.com.
Stephen Brown is the director of applied medical informatics at the Uniformed Services University of the Health Sciences’ James A. Zimble Learning Resource Center and is an assistant professor of biomedical informatics. He has been actively engaged in bringing library services to the digital desktop since 1987. He holds a B.S. in computer science from the University of Maryland and a B.A. in biological sciences from the University of Delaware. His email address is firstname.lastname@example.org.
Authors’ Note: The views expressed are those of the authors and do not reflect the official policy or position of the Uniformed Services University of the Health Sciences, the U.S. Department of Defense, or the U.S. government.