ONLINE, March 2001 | Automating Enhanced Discovery and Delivery: The OpenURL Possibilities

feature

Automating Enhanced Discovery and Delivery: The OpenURL Possibilities

David Stern

ONLINE, March 2001
Copyright © 2001 Information Today, Inc.

Imagine a search result page containing links to hundreds of related items, regardless of location, media format, and language. (Well, OK, maybe not language quite yet.) Your initial item is cross-referenced via hypertext links to other materials using descriptors such as:

Full text materials <free or as document delivery>
Author searches <of books and/or articles>
Journal article subject searches (by controlled subject terms or 'find similar' techniques)
OPAC book catalog searches
Citation analyses
ISI related records (citation cluster analysis)
Movies
Images
WWW sites clustered by subject <Inference Find >
media type <SearchLight >
semantic analysis <Oingo>
Visualizations (subject cartographies and concept lines)
Raw datasets (census info, GIS data)

HOW IS THIS POSSIBLE?

How can you find distributed data from around the world in a variety of separate and non-associated databases? By running data mining processes (smart agents) against an Open- URL resource identification system. The core element is the OpenURL which is composed of a standard syntax that includes a pointer to an item-specific "location resolver" to identify the appropriate host site, and a consistent item identification and/or item description metadata holder.

This OpenURL syntax allows for interoperability by providing a simple and consistent way to identify where any item is found and how any item is described. In their technical explanation of the OpenURL, Herbert Van de Sompel, Patrick Hochstenbach, and Oren Beit-Aire ("OpenURL Syntax Description"; http://sfxit.exlibris-usa.com/openurl/openurl.html) review the OpenURL syntax as an HTTP GET request, which gives commands for description, origin-description, object-description, global-identifier-zone, object-metadata-zone, and local-identifier-zone. SFX is a software product developed as a PhD project by Van de Sompel at the University of Ghent, and acquired by Ex Libris, developer of integrated library systems. It facilitates a fully interlinked environment for scholarly information using context-sensitive linking techniques. OpenURL is the generic, public syntax used by SFX.

WHERE THE ITEM RESIDES

The "where" portion of the OpenURL syntax is accomplished by contacting an identified local resolver machine that maintains a validation system and a pointer to the appropriate location (and version) of your desired item. This determination is accomplished using customized look-up table templates based upon local access rules and user profiles.

Known resolver/known item –> local resolver –> host machine –> item delivery
Known Item Example OpenURL: {http://www.pointerjournals.yale.edu?id=DOI%123-45-7654}

The look-up table within the resolver "pointerjournals" would say that any DOI (Document Object Identifier) starting with 123 should be sent to a specific full-text server (such as JSTOR). Examples of complex decision processes might include identifying the one "appropriate copy" item among a set of possible versions, including such considerations as the PDF or enveloped HTML copy or the copy from a standalone host as opposed to an aggregator copy.

GETTING TO THE WHAT

The "description" is held in either a standard, known protocol (examples are DOI, SICI, and PII) or a series of standardized metadata fields with descriptors (values). The OpenURL extensible metadata syntax creates a standard format to store and query data, while at the same time providing a wide range of industry-specific metadata possibilities. Metadata syntax might include:

Material Type	RDF Format	Value Elements
Books	Dublin Core	LC Subject
Visual Images	VRA Core	Headings
GIS Data	FGDC	Name Authority
	(Content
	Standard for
	Digital
	Geospatial
	Metadata)

Two examples of metadata content layout would be:

RDF metadata standards for journal citation elements, along with their associated properties and values:
Journal Citation Metadata Descriptor Example OpenURL:
http://www.pointerjourmet.yale.edu?author=smith, joyce&issn=1234-5678&title=lost_and_found}
or
http://www.pointerjourmet.yale.edu?issn=1234-5678&date=1999&volume=1&issue=2&spage=13}
Metadata values within descriptor fields, providing a basis of searching for items using their contained field-dependent variables:
Image Item Metadata Descriptor Example OpenURL:
http://www.pointermetaimage.yale.edu?creator=smith joyce&topic=hat&topic=blue}

COMBINING THE WHAT AND WHERE

Search agents can quickly format the appropriate strategies to search across these metadata indexes on demand.

The identified local resolver would use the available metadata fields to determine which local metadata index or indexes should be searched. A search against each local metadata repository would find matches to the elements.

SIMPLE SEARCH OF DATA ACROSS LOCAL INDEXES

In the first two example OpenURLs mentioned earlier, the resolver "pointerjourmet" would either:

Locally identify the appropriate DOI for further processing by a separate DOI resolver, and then use the "pointerjournals" resolver to point to the appropriate full-text article server, or
A look-up table within would say (according to elements such as ISSN or other variables) that additional metadata is found at server "journalindex", and that this total package of metadata should be sent to that specific article search server for further identification. The next step would be to use a smart agent to formulate and send the search strategy to identify the correct match within the index and forward the DOI on to the "pointerjournals" resolver in order to identify the full-text host and obtain the appropriate full-text item.
Known resolver/known data elements –> local resolver (determine appropriate indexes –> search index machine(s) –> local/remote resolver to find item hosts –> host machine –> item delivery

In some cases, the metadata (and possibly additional associated metadata from a local server) will be captured and searched against other remote indexes (e.g. the Internet Movie Database or HotBot).

MULTIPLE STEP SEARCH OF ASSOCIATED DATA ACROSS LOCAL OR REMOTE INDEXES

In the final (complex) example, the look-up table within the resolver "pointermetaimage" would say that additional metadata is found at server "imagedata", and that this total package of metadata should be sent to a specific image search server.

Known resolver/known data elements –> local metadata repository (capture metadata) –> local resolver (determine appropriate indexes) –> search index machine(s) –> local/remote resolver to find item hosts –> host machine –> item delivery

The search and retrieval of information across these complex and interacting indexes, resolvers, and host machines can occur over a variety of infrastructures. You can search across only local machines with a standard set of possible communication protocols. Alternatively, an information network can provide searching across a hybrid of local and remote servers with only the OpenURL as the standardizing agent. Some search engines will provide direct links to items while others may only provide additional metadata leading to other network resources before finding your final item(s). The following are examples of possible search engines that might be utilized:

Pre-created centralized knowbot/robot harvested indexes (Service Providers), such as a centralized index to eprint collections in a discipline from remote and heterogeneous sites
Pre-harvested or on-the-fly homogeneous site indexes using standard RDF (e.g. Dublin Core) formats and thesauri, such as a centralized index of newspaper-clippings data using LCSH (Library of Congress Subject Headings)
On-the-fly search agents accessing various standalone databases, such as the Internet Movie Database; the ISI Web of Science database; WWW search engines (both pre-harvested and on-demand); and locally created databases (possibly containing standardized metadata syntax and thesauri, keywords, raw data, and relational information within context). This last set of materials is often a large portion of the hard-to-mine "Deep Web".

The process would include the following steps:

Perform an initial search on any search site.
Limit your desired domain of results.
First determine the important concepts, their synonyms, and the appropriate relational operators between the terms. This critical thinking step is the most important intellectual activity in the entire search process in terms of both precision and scalable resource utilization. The use of well-considered limitations to narrow the initial search to relevant materials will save a great deal of computer processing and reader review time. In addition to simple limitations such as media type, language, and year, other possibilities include discipline hierarchies, peer-reviewed material, and relevance ranking.
A challenge to this approach will be the integration of interactive feedback processing into this scenario; perhaps that will be best accomplished by simply allowing users to make real-time connections to certain hosts during the process. For example, linking on the term "ISI" in the first search screen in this article will make a connection to ISI rather than clicking on the "223" to see the actual citations.
Contact Service Providers (search indexes) and retrieve associated metadata or skip to source resolver (step #5) if requesting a known-item ID.
Imagine a search agent that would start with the metadata from your journal citation (e.g. Dublin Core-identified subject term elements "population-U.S.-Arkansas") and link this item to all related records in a GIS data repository index with the following (FGDC elements "Arkansas- census-population").
This ability to search for metadata terms across variant indexing schema provides for powerful linkage opportunities. The ability to map subject headings across different subject thesauri or hierarchies would provide even more powerful ways to perform sophisticated interdisciplinary searches.
Present options (with embedded OpenURLs). This is the Related Links Result screen as shown in the Search Strategy table.
Contact source resolver. This step will match the item identifier with the "appropriate copy" host.
Contact host site. This sends a retrieve command to the identified host.
Deliver item. The item is delivered to the end-user workspace.

THE SERVICE PROVIDERS

The Service Providers featured in step #3 include search engines and associated databases that are used to locate matching metadata and item descriptors. Any search process is only as powerful as the data and search structure in the background. Some files may be simple flat files of data while other databases may be composed of complex elements and relational linkages.

Some advanced search and analysis search engines are now able to create visual maps of the concepts within large sets of data. In these enhanced search option cases, you would more likely link directly to the remote search engine for interactive searching, and then return to the Related Links page for further exploration when you were finished using the specialized remote search interface.

THE RESOLVERS

In addition to the smart agents that process the search and retrieve queries, among the most important elements of the OpenURL scenario are the local resolvers mentioned in step #5. These databases use sophisticated algorithms and look-up tables to link information resource data, user profiles, and local permissions. "Appropriate copy" issues are handled by populating pre-created templates with if/then pointers.

For example, once the appropriate hosts have been located, the following is run:

IF item = "123" and desired format = "HTML" THEN host "XX" sends "item123h"
IF item = "123" and desired format = "XML" THEN host "YY" sends "item123x"

HOST MACHINES

Data sets and items may be held in, and retrieved from, repositories located anywhere in the world, thanks to Internet networking protocols. In addition to storing and distributing data on demand, these servers perform user verification, item authentication, and possibly payment tracking operations. In some instances, even these host machines may determine appropriate copies of materials (i.e. formats) based upon local user configurations (i.e. cookies) or specific embedded requests.

OUR ONE-STOP FUTURE

Given universal adoption of the OpenURL syntax as a standard for data delivery, it is possible to network a variety of search engines, local resolvers, and data repositories to provide interoperability across both local and remote computers. The next significant steps will be the development of the many smart search agents, the local resolver databases, and the metadata required to create truly enhanced one-stop discovery and delivery systems.

Open Archives Initiative

The Open Archives Initiative (OAI) (http://www.openarchives.org) is facilitating the coordination of distributed eprint archives. It provides metadata tagging standards to ensure cross-archive interoperability. CogPrints (http://cogprints.soton.ac.uk), a centralized Cognitive Science Archive, is one of the first registered OAI-compliant Archives. The archive- creation software used is generic, so OAI-compliant Eprints Archives can now be mounted, registered, and filled by any institution (http://www.eprints.org). Both CogPrints and Eprints accept papers in a choice of formats, such as HTML, PDF, TeX, PS, ASCII, and more.

The ARC search service (http://arc.cs.odu.edu/) uses the OAI conventions to search an interdisciplinary set of distributed, refereed Eprint Archives. The coverage includes areas of Physics, Mathematics, and Computer Science, and limited coverage of the Cognitive Sciences (Psychology, Neuroscience, Behavioral Biology, Linguistics, Neuroscience).

jake

The "jake" or Jointly Administered Knowledge Environment (http://jake.med.yale.edu/docs/about.html) is a database and resolver created at Yale University, and used as an open-source tool in many libraries. Greg Notess discussed jake in his ON THE NET column in the October/November 2000 issue of EContent (http://www.ecmag.net). The database has many metadata possibilities; among them is a reference source finder for finding, managing, and linking online journals and journal articles to A&I databases. jake can also link known item citations directly to full-text materials.

Additional Reading/Resources

Van de Sompel, Herbert and Patrick Hochstenbach, "Reference Linking in a Hybrid Library Environment: Part 1: Frameworks for Linking and Part 2: SFX, a Generic Linking Solution." D-Lib Magazine. 4 No 4 (April 1999): (http://www.dlib.org/dlib/april99/van_de_sompel/04van_de_sompel-pt1.html and http://www.dlib.org/dlib/april99/van_de_sompel/04van_de_sompel-pt2.html)

Van de Sompel, Herbert and Patrick Hochstenbach, "Reference Linking in a Hybrid Library Environment: Part 3: Generalizing the SFX Solution in the "SFX@Ghent and SFX@LANL" Experiment." D-Lib Magazine. 4 No 10 (October 1999) : (http://www.dlib.org/dlib/october99/van_de_sompel_10vam_de_sompel.html)

A full project description of SFX is online (http://www.sfxit.com/sfx2.html).

"Digital Library Projects: Focus on Improving Access to Information Users." Session at Special Libraries Association Global 2000 Conference, October 16-19, 2000, Brighton, U.K. (http://dli.grainger.uiuc.edu/sla2000/).

This session included a number of speakers providing updates on linking technologies and trials. Especially related is the presentation, "The SFX-Framework & the OpenURL" by Herbert Van de Sompel. The excellent graphics and real-world examples from his current project are posted at the University of Illinois at Urbana-Champaign's Web site (http://dli.grainger.uiuc.edu/sla2000/sla2000_hvds/sld008.htm). There are plans to link the SFX approach with the materials within the engineering testbed at the UIUC Library.

For a good overview of smart agent technologies in libraries, see "Library Agents: Library Applications of Intelligent Software Agents" by Gerry McKiernan, Curator, CyberStacks, Iowa State University, Ames, IA (http://www.public.iastate.edu/~CYBERSTACKS/Agents.htm).

NESSTAR

NESSTAR (http://www.nesstar.org/index.shtml) is an infrastructure for data dissemination via the Internet. NESSTARExplorer offers an end-user interface for searching, analyzing, and downloading data and documentation available via the Internet. There is a demonstration available that allows for locating and manipulating raw data in the Social Sciences from a number of distributed data set repositories.

David Stern (david.e.stern@yale.edu) is director of science libraries at Yale University.

Comments? Email letters to the Editor at marydee@infotoday.com.

[infotoday.com]

[ONLINE]

[Current Issue]

[Subscriptions]

[Top]