A Plain Text Metamorphosis: Converting Search Results to HTML

feature

A Plain Text Metamorphosis: Converting Search Results to HTML

Winfred Ark and Sue Park

ONLINE, November 2000
Copyright © 2000 Information Today, Inc.

"Change is inevitable, except from vending machines."
–Anonymous

As information specialists, we often focus on the quality of the information we deliver. We select the most appropriate databases, formulate optimal search strategies, and edit results to retain the most relevant results. What is often overlooked, however, is the format in which this information is delivered and the impact it can have on the overall utility of the information sent. With the development of the Web and exploding use of the Internet and corporate intranets to deliver information, information specialists have exciting new opportunities to enhance the value of the results they deliver. Rather than static, ASCII-based text, the Web provides the ability to deliver more dynamic, interactive information. The Genentech Library developed a project to transform typical text-based online search results downloaded from popular vendors, such as Dialog, STN, and DataStar into HTML format, and enhance them using currently available Web technology. Results are sent to end-users in a format they can view and manipulate within a standard Web browser like Netscape Navigator.

GENENTECH LIBRARY

The Genentech Library serves the staff of Genentech, Inc., a leading biotechnology company involved in the research, manufacture, and sale of pharmaceuticals. The online search group within the Genentech Library is composed of seven full-time online searchers working in the areas of business, patents, chemistry, biology, and medicine. In a typical year, 1,500 or more literature searches are run at the request of research scientists, pharmacists, clinicians, patent attorneys, marketing staff, and business development groups. The searches are run on a variety of systems, including Dialog, DataStar, STN, LEXIS-NEXIS, Ovid, and others. The computer systems used at Genentech are diverse and include mainframes using the UNIX operating system, plus a mix of desktop Macintoshes and PCs.

In addition to the online search group, we have five full-time staff members in technical services whose jobs include working to fulfill the potentially overwhelming number of document requests that are generated by these searches. The Library is also fortunate to have a computer programmer on staff, a contributor to this article, Sue Park, who willingly facilitates all the wild projects we dream up for her.

Prior to the development of our Web-based delivery method, online search results were either sent directly to the patron in the body of an email or sent as a Microsoft Word email attachment. In a smaller proportion of requests, the search results were printed and sent via interoffice mail or picked up by the patron in the Library. These online search results typically were composed of citations and abstracts downloaded from one of the online vendors, such as Dialog, and saved as a text file.

THE EXPERIMENT

In late 1996, we began experimenting by taking the standard text-based downloads from the vendors we were using and converting them into HTML-based Web documents and storing them on a central server. Instead of sending a Word document, we sent patrons an email containing a URL. Using their Web browser, users linked from the URL to the HTML-based search results. The main advantage to this method of delivery was providing links that automatically directed requested citations to the Library's in-house document delivery service. Our patrons could then request copies of the full text of articles from their online search results just by clicking on citations of interest. These citations were automatically entered into a database of Library document orders and printed out for our in-house document delivery service to be filled.

Prior to the development of this system, ordering multiple articles meant a patron had to either type by hand or cut and paste citations from a Word or other text document into a Web-based order form on the Library's home page. This worked well for orders requiring only a few citations, but for very large document orders, the process of cutting and pasting became tedious and time-consuming. Automated ordering also reduced the delays resulting from typos and other citation errors that occurred when patrons entered the data into the ordering page by hand.

GETTING WITH THE PROGRAM

Once an outline for the idea was in place the next big step was to actually write the program that would handle all of these various tasks. Fortunately, our computer programmer understood the project's requirements and was fluent in today's tools for the Web: HTML, JavaScript, and Perl–a popular scripting language often used to write the CGI (Common Gateway Interface) scripts that make Web pages interactive. Working closely with our programmer and going back and forth through various prototypes was critical during the development phase of this project. We also brought in selected high-usage Library patrons for a focus group to provide feedback on what the final output of the program should look like.

Parsing All Data

The resulting program, which we call CONVERT SEARCH, took about a year to develop, and we continue to make improvements to it. The program was mainly written in Perl. Given a text file of journal citations, it first parses or breaks down the file into manageable chunks. It identifies the beginning and end of each record, which is composed of a single journal citation, plus abstract if available, and assigns each record to an array–a type of organized arrangement–in computer memory. It then breaks down each record into a sub-array that identifies the title, journal, author, and other fields that go into making up the record. By assigning the records to an array in computer memory, the records can later be manipulated by the program in ways that a simple, flat text file cannot.

One of the keys to parsing the data was the use of tagged records. In Dialog, this format is obtained by simply typing the word "tag" after a normal type command. Records from STN and DataStar come with two-letter tags by default. The computer program, for instance, can identify the location of the title by looking for the two-letter TI tag with a blank space before and after it. Other tags, such as AN for the accession number, identify where one record begins and another ends. During development of the program, the search staff checked all of the Library's most commonly-used databases on the major vendors to determine what two-letter tags would be necessary for the computer program to recognize.

Creating HTML Files for Viewing

After parsing the data, the program takes citation information and rewrites it to a new file adding all the necessary HTML tags that enable it to be viewed by a Web browser. Actually, several HTML files are written. One file contains just a list of titles that allows for a special browsing feature. Another file contains the full citations that are viewed by the patron, and another file contains just the necessary citation information to send to our document delivery service omitting abstracts, indexing, and other parts of the record not needed for ordering. A special file called a "frame file" ties these separate files together and specifies that the file containing the titles will go at the top of the Web page in an upper frame and the file of full journal citations at the bottom in the lower frame. The program assigns a unique URL to each of these files–it is the URL of the special frame file that is eventually sent to the patron.

The files are stored on the server for a period of six months, giving the patron more than adequate time to view, print, and/or save their results. After six months, the files are deleted and the URLs are no longer valid. The program can also automatically create and email a separate Microsoft Word document containing a duplicate copy of the online search results to the patron. This was done in response to some users who still wanted to have a Word document they could archive for later reference. By receiving results in both HTML and Word formats, the patron has complete flexibility in terms of viewing, ordering, printing, and archiving. The use of a tagged format also facilitates the use of bibliographic citation management software, such as End Note.

BETTER THAN WEB-BASED SEARCHING

Although Web-based interfaces to online vendors have been developed since starting this project, allowing a searcher to directly download results in HTML format, the online search group has not widely adopted this method of searching. First, converting results from ASCII to HTML is really just a starting point to enhancing online search results. As discussed earlier, there is additional parsing and manipulation of the file that our program does, so downloading results in HTML in and of itself would not be of great advantage. It's the additional ordering links and browsing features that are added that give the program real value. Second, despite improvements in the reliability and speed of the Web, the Web-based methods of searching do not match classic command language for speed and performance. Third, online searchers, like anyone, can be creatures of habit and simply prefer the command language they have used for years. The program we have created allows searchers to continue to use traditional text-based classic command language, but still send results to patrons in a more useful Web format.

WHAT YOU SEE

When the patron opens the URL sent to them, the browser displays a window with three frames. The top frame acts as a control panel, with buttons that facilitate the order process. The middle frame contains the reference titles and was added after some research scientists mentioned they liked to scan through the titles first as a method of prescreening their results before reading the abstracts.

Clicking on the title link in the middle frame, jumps the user down to the location of the associated citation and abstract in the bottom frame. There are also scroll bars alongside each frame allowing a user to quickly move to any location within the frame.

To further increase user readability, the program reorganizes the citation prior to delivery, placing the subject-rich title and abstract at the beginning and specific journal reference information at the bottom. The program also gives the online searcher the ability to put in special highlighted notes at the top of each citation. A cover letter describing the contents of the search, the databases that were searched, the strategy, and any other relevant details or analysis can also be added at the very beginning of the search by the online searcher. All of this is designed to help the reader quickly process the search in a highly efficient manner.

WHAT YOU GET

Patrons can request photocopies of single or multiple articles by clicking the appropriate boxes next to each title in the middle frame. When the order button in the top control panel is clicked, they are asked to enter their company email login name and password. Logging in enables a program to automatically retrieve all of the necessary identification information of the patron, including their name, phone extension, and company mailstop from a central employee database. Again, this saves the patron time by eliminating the need to fill out a Web-based form. A combination of JavaScript and Perl is used to enter patron identification and all the necessary citation data into an in-house database that stores all of our document request information. The new document orders are printed twice daily and either fulfilled in-house if we have the material requested or sent to an outside document delivery service. When the system receives an order, a confirmation email is automatically sent to the patron indicating which citations were ordered.

LINKS TO FULL TEXT

As more full-text electronic journals became available, an additional button was added to the control panel allowing users to get to the Library's list of available e-journals containing direct links to their Web sites. We have not yet linked the individual citations directly to full-text PDF documents, but clearly that is the next logical step we would like to take. We're examining software products, such as MDL's LitLink, which links citations to their full-text equivalents on publisher's Web sites. At that point, our researchers would, in effect, send clients a virtual stack of full-text documents rather than a list of citations that would require additional work to obtain the full text.

IMPACT

We launched CONVERT SEARCH in late 1997, and it has become tremendously popular among patrons as a format for receiving search results. In 1998, the first full year of its availability, over one-third of the searches run in the scientific literature were sent out in this new Web-based format by request. This number has steadily increased each year since its introduction. Currently, over 50% of our scientific literature search results are sent via CONVERT SEARCH. These include large, exhaustive searches containing hundreds of citations used to support FDA filings, searches used to respond to outside requests for information by medical doctors working with our marketed products, and searches for scientists preparing scientific publications. The ability to scan through large numbers of citations and rapidly order those of interest from the Library makes it an ideal format for lengthy searches that will eventually result in large numbers of requests for photocopies.

Patrons comment on a number of advantages to getting online search results through the new program. Most often mentioned is the ease of ordering online documents. Also mentioned were the ease of reading, the ability to scan a list of titles, and the ability to quickly navigate through a set of records. Ease of opening was another advantage since opening a Word document with different versions of Word on different platforms can create formatting problems.

Would the new service result in rampant document requests? We weren't overly concerned about uncontrolled usage since the Library charges back the cost of article orders to the requestor's cost center.

What we found, though, was that there was a significant increase in document orders in 1998, the first full year the system was introduced. Document orders went from an average of roughly under 2,500 a month to over 3,000 during the second and third quarters of 1998 and were fairly high through early 1999. Whether that increase was due to the introduction of CONVERT SEARCH is difficult to say since factors, such as ongoing company research projects, litigation, and other events, can create heavy demands for documents. Since that time, however, the increase in monthly document requests has gone down somewhat.

What caused this? Assuming the initial increase was due to the introduction of the new ordering system, we could speculate that the increase in the ability to get full-text journal articles in PDF has recently offset any increases in document copying requests. One other possibility is that the novelty of rapidly ordering documents wore off once people realized they did not have enough time to read them all!

FUTURE POSSIBILITIES

Scientific literature searches using databases, such as MEDLINE, BIOSIS, and EMBASE, have been the easiest searches to adapt to the HTML method of delivery because of their consistent citation format between databases and vendors. Search results that we do not currently send as Web documents, but are now working on, include patent searches run in patent files, such as Derwent's World Patent Index and Inpadoc. The complicated nature of patent families, application numbers, and the inconsistent way these are handled between the various patent databases make this a tricky area to automate ordering. The tagged format for patents required by the program to decipher the citation can be particularly messy and difficult to read. Despite these difficulties, we plan to make patent searches available in this format in the future with the thought of eventually linking to the full-text patents now readily available via the Web.

Business searches run in files, such as PROMT and ABI/Inform, which often are already downloaded in full text from online vendors, are another area that has made selective use of this format. Searches containing chemical structure diagrams and graphs, charts, or tables have not yet been adapted, though this is in development as well. As a whole, we currently send out the results of nearly 25% of all online searches. This is a number that has been increasing and is expected to increase as we develop the system further to handle various types of searches.

So what do we have planned for the future of CONVERT SEARCH? Entering the online results into an Oracle database would allow greater real-time manipulation of the data by the end-user. Sorting or ranking citations could be done by the end-user even after results have been downloaded from the vendor. Visualization tools, such as graphs and tables, could be created as needed. As bandwidth on the Web increases and full-motion video is common, annotations with video and sound from the researcher might be added to help explain various portions of the search. Currently, additional URL references that are live links to Web sites are being added to our search results. Online search results of the future will likely become an interlinked document of in-house resources, citations from traditional online databases linked to full-text documents, and Internet resources rich in multimedia.

FINAL THOUGHTS

The key to the Web is interactivity. We can use it to give patrons the power and ease to do many things for themselves. It is not simply taking what we did in the text-based world and slapping it into an HTML page, but taking advantage of the full power this new medium has to offer. Despite the proliferation of end-user searching tools, the information professional still has a valuable role to play. The ultimate vision is that we use our expertise to give our customers high-quality, targeted information from the best sources and then package this information in ways that make the information useful, informative, and fun and engaging to use. It's kind of like the cherry on top of an ice cream sundae or the bow on top of a present. It completes the package we deliver. All of the various possibilities take us a long way from the plain text format documents we were limited to sending in the past, but of course, there is still a long way we can go.

Winfred Ark (win@gene.com) and Sue Park (spark@gene.com) are Information Specialist and Senior Programmer Analyst respectively of the Library and Information Services at Genentech, Inc.

Comments? Email letters to the Editor at editor@infotoday.com.

[infotoday.com]

[ONLINE]

[Current Issue]

[Subscriptions]

[Top]