XML and the Resource Description Framework: The Great Web Hope

feature

XML and the Resource Description Framework: The Great Web Hope

Norm Medeiros

ONLINE, September 2000
Copyright © 2000 Information Today, Inc.

There exists in the information community a vision of metadata solving Web searching chaos, despite the fact that some leading search engines ignore metatags altogether.
What incentive do search engine companies have for altering their indexing and rating algorithms? Moreover, what motivation do Web content providers have for implementing an intricate metadata standard if such efforts are futile at increasing retrieval or page ranking? The World Wide Web Consortium's (W3C) Resource Description Framework (RDF) and Extensible Markup Language (XML) offer a potential means to enhanced resource discovery on the Web. But will they work?

METADATA

The concept of metadata is a not a new one. Well before the first HTML page graced the Web, millions of digital metadata records existed. Created in a semantic scheme known as the Anglo-American Cataloging Rules, and stored in a framework called the MARC format, these records described the world's documented knowledge. These external metadata records, or surrogates, referenced resources that existed separately from the resources they described. This traditional form of metadata deployment often allows the creator of the surrogate the ability to customize the agent parsing the metadata. A common case is the cataloger who not only creates the bibliographic record, but can also adjust the way the library's OPAC renders it.

RDF and XML offer a potential means to enhanced resource discovery on the Web.
Embedded metadata is a more recent event. Shortly after the popularization of commercial search en gines, users of the Web discovered the poor quality of spider-derived, full-text indexing. At the same time, content providers discovered a means to increase access to their works. By embedding terms within metatags in their HTML documents, Web developers enhanced the chan- ces of being indexed and having higher ranking by search engines. However, much of the early metadata implementations were of a dubious nature. Spamming, the overloading and repeating of terms within metatags, caused many search engine companies to change the way their systems treated metadata. Search Engine Watch, a Mecklermedia Internet site (http://www.searchenginewatch.com), reports that a few search engines still index meta keywords. These include AltaVista, Go, HotBot, and Inktomi. However, the majority of engines ignore metatags altogether. These include Excite, FAST, Google, Lycos, and NorthernLight. Moreover, only Go and Inktomi offer a boost in ranking to pages carrying metadata. The following describes Lycos' indexing system, which is common to many commercial search engines and unkind to metadata:

When a Web page is submitted to Lycos, the spider examines the full text of the page and determines relevant keywords based on its composition. The spider will pay close atten- tion to the components of the URL, the TITLE tag, headings and subheadings, frequency of word use, location of words on the Web page, and the distance between words. Descriptions consist of the first 100 characters on a Web page, including the ALT tags. The Lycos spider ignores META tags. [1]

A MATTER OF TRUST

Clearly, designers of the <META> element could not have anticipated the way it would be prostituted by some content providers.

The original catalyst for metadata application was as a means to better describe, and therefore provide more precise access to, Web-based information. Clearly, designers of the <META> element could not have anticipated the way it would be prostituted by some content providers. The current search engine architecture no longer supports metadata on a wide scale largely due to a lack of trust. If there were assurances that metadata would be deployed in a thoughtful, discretionary manner, search engine proprietors would likely index it, and incorporate it into their ranking schemes.

To this end, efforts are underway to port the use of digital signatures to the metadata community. Digital signatures convey personal identity on the Internet [2]. The technology, coined "electronic fingerprints", involves three elements: a mathmatical algorithm, encryption, and certification (for more detail, see "Digital Signatures: Secure Transactions or Standards Mess?" by Danielle Borasky in the July/August 1999 issue of ONLINE). The World Wide Web Consortium is presently examining the viability of XML-based digital signatures as a means of authenticating information, and information providers, on the Web [3]. This development holds promise in terms of reinstating a much-needed trust factor in metadata deployment.

THE SEMANTIC WEB

In his recent book, Tim Berners-Lee, founder of the World Wide Web Consortium and inventor of the Web, details his hope for a Web populated with rich metadata that is machine-readable, semantically flexible, and derived from trusted sources [4]. He refers to this vision as the Semantic Web. Under his guidance, the W3C has developed tools, that if adopted could render the Semantic Web a reality. In his words:

The Web was designed as an information space, with the goal that it should be useful not only for human-human communication, but also that machines would be able to participate and help. One of the major obstacles to this has been the fact that most information on the Web is designed for human consumption, and even if it was derived from a database with well-defined meanings (in at least some terms) for its columns, that the structure of the data is not evident to a robot browsing the Web. Leaving aside the artificial intelligence problem of training machines to behave like people, the Semantic Web approach instead develops languages for expressing information in a machine-processable form. [5]

In short, what's needed is a trusted, structured mechanism to parse context relationships across all semantic schemes. With such an agent in place, search engine companies would be better equipped at handling metadata, and eager to exploit this technology in order to provide more enhanced retrieval options for their users. Spamming would cease to be a concern in this "Web of trust", since valid sources of metadata would be authenticated through digital signatures [6]. Moreover, search engines could target metadata and thus support "efficient searching of the Web as though it were one giant database, rather than one giant book" [7].

RESOURCE DESCRIPTION FRAMEWORK

The current generation of browsers cannot support embedding full-level RDF records within HTML files.

The Resource Description Frame- work was developed by the World Wide Web Consortium in early 1999 as a universally accepted carrier for metadata. RDF serves as a structure into which any metadata semantic (such as Dublin Core) can operate. It supports semantic interoperability; that is, semantic elements can be "mixed and matched" within its framework while supporting the automated parsing of nonrelated schemes [8]. It accomplishes this feat through the use of the XML namespace facility. A namespace points to a Uniform Resource Identifier (URI) which provides the metadata agent with the definition of the elements in use. Through use of this mechanism, semantic vocabularies can be used without fear of conflicting element names or meanings. Lines 1 and 2 in the XML Namespace box demonstrate the XML namespace feature as it's used to express RDF and Dublin Core URIs.

Line 3 contains the RDF element <rdf:Description> with attribute "rdf:about." These properties can be parsed using data in the default RDF URI (line 1). Lines 4 through 7 contain Dublin Core fields, as referenced by the acronym "dc:" that precedes the title, creator, and subject fields. Machine-readable information to parse these elements can be located using the Dublin Core URI which appears in line 2. Line 8 closes the RDF element <rdf:Description> and line 9 closes the RDF expression. Element termination and nesting, the proper opening and closing of elements, are mandated in the XML specification. Failure to follow these procedures will result in an invalid RDF statement.

ABBREVIATED RDF

The current generation of browsers cannot support embedding full-level RDF records within HTML files. This higher level of RDF, as displayed in the XML Namespace box, must be kept external to the HTML document it is describing. This is accomplished through use of the <LINK> command. In short, the resource being described in the RDF record must include the <LINK> tag to reference its associated metadata [9]. (In the "Core competencies for staff" example, I would place <LINK rel="meta" href="staff.html.rdf"> within the <HEAD> elements in the source document of http://library.med.nyu.edu/training/staff.html, and name the RDF file staff.html.rdf, resulting in a relationship between the HTML page and its descriptive metadata.)

A less formal manifestation of RDF has been developed, however, designed for use with existing brow- ser technology. Abbreviated RDF, or RDF-ABBREV, supports embedding of RDF instances within HTML <HEAD> tags. In this abbreviated syntax, all elements, regardless of semantic, fall within the <rdf: Description> element, such that the XML namespace example would be rewritten according to the example outlined in the Abbreviated RDF box.

The abbreviated format is tighter, yet still relies on element termination and nesting principles. Line 7 highlights the XML syntax for an empty element--a concluding space, slash, and angle bracket "_/>"--which eliminates the need for line 8. Some harvesters, including UKOLN's DC-DOT Generator, hosted at the University of Bath, export records in this browser-suitable format.

XML'S ROLE

XML, Extensible Markup Language, is the syntax for RDF. Not surprisingly, XML was also developed by the World Wide Web Consortium and shares similarities with its metadata sibling. Both XML and RDF are concerned with information contexts. RDF focuses on this concern in regard to metadata, and uses the XML namespace facility to parse element sets. XML is poised to dethrone HTML as the language of the Web since it is extensible (not restricted to a limited number of elements), and supportive of automated data exchange (able to create contextual relationships that are machine-processable). In particular, XML's document type definition (DTD) and Extensible Stylesheet Language (XSL) separate the context from the display of information--a division HTML is incapable of achieving.

THE FUTURE

Although RDF, XML, and a number of semantic standards avail themselves to the metadata community, the question remains: Will they work? Clearly, locally-defined metadata projects will benefit from the W3C's commitment to the Semantic Web. OCLC's Cooperative Online Resource Catalog (CORC) and UKOLN's DC-DOT Generator are just two examples of projects utilizing RDF in an attempt to populate the Internet with highly descriptive, machine-readable metadata. If digital signature technology can operate within the framework of RDF, perhaps commercial search engines will once again trust content providers to incorporate and associate appropriate metadata within their works. Will searching the Web one day be like searching an OPAC? Perhaps. It can't get any worse than it is...can it?

XML Namespace

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.0/">
<rdf:Description rdf:about="http://library.med.nyu.edu/training/staff.html">
<dc:title> Core competencies for staff </dc:title>
<dc:creator> Colleen Cuddy </dc:creator>
<dc:creator> Trisha Stevenson </dc:creator>
<dc:subject> Libraries; Staff development; Training </dc:subject>
</rdf:Description>
</rdf:RDF>

A SIMPLE EXAMPLE OF RDF METADATA WITH DUBLIN CORE SEMANTICS

Abbreviated RDF

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.0/">
<rdf:Description rdf:about="http://library.med.nyu.edu/training/staff.html"
dc:title="Core competencies for staff"
dc:creator="Colleen Cuddy"
dc:creator="Trisha Stevenson"
dc:subject="Libraries; Staff development; Training" />
</rdf:RDF>

ABBREVIATED RDF, SUITABLE FOR INCLUSION WITHIN AN HTML DOCUMENT

REFERENCES

[1] Email message from Webmaster@lycos.com to the author regarding treatment of metatags in the Lycos search engine. Dated January 3, 2000.

[2] Borasky, Danielle V. "Digital Signatures: Secure Transactions or Standards Mess?" ONLINE, v. 18, no. 4 (July/August 1999).

[3] Lambert, Paul A. "Validation and Semantics of XML Digital Signatures." Available on the Internet at http://www.w3.org/Dsig/signed-XML99/pp/certicom.html.

[4] Berners-Lee, Tim. Weaving the Web: The Original Design and Ultimate Destiny of the World Wide Web by Its Inventor. San Francisco: HarperSanFrancisco, 1999.

[5] Berners-Lee, Tim. "Semantic Web Road Map." Available on the Internet at http://www.w3.org/DesignIssues/Semantic.html.

[6] Lassila, Ora. "Introduction to RDF Metadata." Available on the Internet at http://www.w3.org/TR/NOTE-rdf-simple-intro/.

[7] Berners-Lee, Tim. "Semantic Web Road Map." Available on the Internet at http://www.w3.org/DesignIssues/Semantic.html.

[8] "W3C Metadata Activity Statement." Available on the Internet at http://www.w3.org/Metadata/Activity.html.

[9] Miller, Eric & Paul Miller, Dan Brickley. "Guidance on Expressing the Dublin Core within the Resource Description Framework (RDF)." Available on the Internet at http://www.ukoln.ac.uk/metadata/resources/dc/datamodel/WD-dc-rdf/.

Norm Medeiros (medeiros@library.med.nyu.edu) is Technical Services Librarian at New York University's School of Medicine.

Comments? Email letters to the Editor at editor@infotoday.com.

[infotoday.com]

[ONLINE]

[Current Issue]

[Subscriptions]

[Top]