THE SYSTEMS LIBRARIAN
Linked Data: The Next Big Wave or Another Tech Fad?
by Marshall Breeding
Independent Consultant and Founder of Library Technology Guides
Semantic web. Open linked data. These concepts dominate conference presentations and technology conversations almost as much as Web 2.0 did a few years ago—with the promise of taking the web beyond its current limitations of manually coded hyperlinks to a system based on exploring paths of related resources based on meaningful associations encoded in storehouses of content. Today, both on the general web and in most library environments, the discovery of resources takes place mostly through the harvesting and indexing of the content of pages. Such search and retrieval services provide very effective ways for persons to find items of interest from within very large bodies of content, such as the entirety of the web or representations of the print and electronic materials that comprise library collections. But even with the most sophisticated relevancy algorithms, index-based search and retrieval lacks the ability to lead users to the potential related content. Semantic web technologies, in conjunction with repositories of open linked data, promise to deliver significant new capabilities in exploring and exploiting information resources on the web.
|At this critical juncture, it seems almost inevitable to me that the semantic web and open linked data will be important factors in the future of library technologies.
For those not already familiar with semantic web technologies and open linked data, there are many sources that can better explain them. The basic building blocks include encoding content using RDF (Resource Description Framework) triplestores. These RDF triples express relationships among units of information through Subject – Predicate – Object statements (“Herman Melville” -> “author” -> “Moby Dick”). Webpages delivered through HTML provide a medium with which humans can interact and navigate. Documents delivered through the semantic web allow computers to parse the relationships among data objects, thus forming the basis for more powerful tools for the discovery, visualization, and exploitation of information. Linked data uses a series of RDF triplestores to make a body of data intelligible within a semantic web platform. When these data are available without proprietary access restrictions, such as through the Creative Commons public domain license (CC0), it can be considered open linked data.
The semantic web has been an area of interest in the broader realm of the internet since the earliest days of the web. Tim Berners-Lee, the original creator of the web, articulated the concepts of the semantic web from the beginning. Yet these concepts were not fully realized as the web took root and grew into the massive global network that we know today. All along there have been efforts in various niche areas to introduce pockets of the semantic web into this global fabric, but in general terms, these efforts have not had a huge impact. Librarians especially resonate with the concepts of the semantic web and open linked data, especially in regard to making bodies of content as openly accessible as possible and in order to more fully operationalize some of the classification, categorization, and controlled vocabularies created through our long history of organizing information.
It seems like the library technology arena has been on the verge of a breakthrough into the realm of the semantic web for the past 5 years or more. There have been a number of initiatives and projects that attempt to exploit semantic web technologies and reshape some of the key resources of interest to libraries as open linked data. This month, we’ll take a quick tour through the intersection of the realm of linked data and libraries and assess the potential impact on the landscape of library information systems and technologies.
Talis Evangelizes the Semantic Web
One company in the library arena that planted a stake in semantic web technology early on was Talis Group Ltd., based in the U.K. Talis embraced a vision of developing an infrastructure based on semantic web technologies available as a platform for a variety of applications, primarily for libraries, but also with an eye to a broader commercial sphere. Talis was historically involved in the industry as the developer of the Alto integrated library systems used in public and academic libraries in the U.K. Organizationally, Talis was the successor organization to the nonprofit BLCMP library cooperative based in Birmingham, U.K., and founded in the 1960s. In 1999, the cooperative was restructured into a for-profit organization, Talis Information Ltd.
Beginning around 2006, Talis began its strategic initiative to create the Talis platform based on semantic web technologies. While it continued to support its legacy library automation products, the company focused most of its energies on developing semantic web technologies and in bringing the message of the benefits of this approach to the broader library community. Although the company marketed Alto and its other library automation products only in the U.K., its educational efforts were cast to a broader audience, including those in the U.S. Individuals including Paul Miller and Richard Wallis served as technology evangelists for Talis, actively involved with promoting the concepts of the semantic web. Increasingly, Talis positioned itself more as a semantic web company and less as a library automation vendor, despite its dependence on Alto and related products as its main source of revenue. In March 2011, Talis divested its library automation business to the software and outsourcing firm Capita, PLC, apparently hoping to channel the proceeds from the sale into strengthening its semantic web activities. As it divested its library automation business, Talis retained its education division and the Talis platform.
That bold move ultimately did not prove to be a huge commercial success. In July 2012, Talis announced its withdrawal from business activities related to the semantic web. The company, now somewhat downsized, focuses primarily on the Talis Aspire reading list management product for higher education and on digitization services. Despite growing interest in the semantic web, Talis acknowledged in its press announcement that commercial opportunities remain limited. It seems to me that although Talis was ultimately not able to build a profitable business model based on semantic web technologies, it was successful in building awareness of these concepts in the library community and beyond.
OCLC Becomes a Champion for Linked Data
In recent years, OCLC has increasingly been involved in the realm of the semantic web and linked data. OCLC operates as a nonprofit membership organization with a wide variety of products and services; it also operates a research and development unit. In addition to its core services, the organization has identified linked data as an area of interest for its research efforts.
Unlike a small company such as Talis, OCLC has the flexibility to engage in research projects that do not necessarily result in revenue-generating activities in the short term but that advance tools or technologies that may be of interest to the library community in the longer term. Wallis, aforementioned former technology evangelist for Talis, joined OCLC in April 2012, playing a similar role in promoting semantic web and linked data technologies at OCLC as he did at Talis.
Several of the initiatives that have come out of the OCLC office involve offering OCLC resources as linked data. OCLC resources that have been released as linked data include the following:
- VIAF (Virtual International Authority File)
- FAST (Faceted Application of Subject Terminology)
- Dewey Decimal Classification System
OCLC has also been working on ways to bring WorldCat within the realm of linked data. Today, every page delivered representing a resource in WorldCat now includes an embedded section that presents the same content in RDF based on Schema.org, a method for providing structured data within webpages. The presentation of these pages as seen through web browsers remains unchanged. Looking at the source coding of the page reveals the RDF section to humans, which would also be accessible through software tools designed to operate with this flavor of the semantic web. Although this is currently mostly a proof-of-concept effort, the layering of Schema.org into WorldCat records can be seen as at least a small step in bringing the semantic web to libraries.
While not yet at the point of operationalizing its services through semantic web technologies, OCLC stands as one of the few library organizations that at least is making experimental forays into this territory.
The Library of Congress Bibliographic Framework Transition Initiative
One of the main obstacles in the way of joining the realm of library technologies and information systems to the semantic web lies in the way that we encode bibliographic information. The MARC record formats were designed decades prior to the advent of the internet and the web, much less to the more recent surge of interest in the semantic web and open linked data. Developed for the exchange of bibliographic records between mainframe computers in times when storage and bandwidth were extremely scarce commodities, MARC was optimized to squeeze data into the smallest possible package, encoded using conventions that seem almost bizarre today. Unconstrained by such limitations, data today are exchanged using structures such as XML or RDF that can be easily parsed by computers and can be read by humans.
It is possible to express the data held in MARC records in an XML syntax, such as through the widely used MARCXML. But this mechanical translation cannot fully make intelligible the quirky MARC coding in terms of semantic relationships. Further, cataloging rules such as AACR2 result in data within most of the MARC tags that defies any automated computer processing. AACR2 was designed for consistent presentation as text and punctuation on catalog cards, which was later incorporated into online catalogs. These rules were devised without regard for how those fields would operate as data for manipulation by computers.
In order to bring the vast body of library bibliographic content to the semantic web, it is necessary to transform the lower-level transport structures as well as the conventions used to express data elements. The recent revision of AACR2 to RDA (Resource Description and Access) was intended to bring cataloging practices into a form a bit more consistent with the semantic web. To me, this has seemed like a very small step that did not fully separate data from presentation elements, thus, it is still quite far from the desired result of fully machine-actionable bibliographic records. But regardless of whether it went far enough, RDA has gone through a long process of definition, testing, and implementation and is finding use in current cataloging operations.
A dramatic step needed to make the bibliographic universe of libraries compatible with the semantic web involves the replacement of MARC as its underlying carrier. Initially created as an important advancement in the interoperability of library automation systems, MARC has become quite an obstacle to progress in that it locks bibliographic records into structures more appropriate for the 1960s than for contemporary computing environments. Yet, MARC is so integral to automation systems, such as bibliographic utilities and integrated library systems, and to the craft of cataloging that its replacement will be traumatic and disruptive at best.
Progress on many fronts of library technologies demands the eventual demise of MARC. One of the seminal efforts to replace MARC 21 has been launched at the Library of Congress through its Bibliographic Framework Transition Initiative (see loc.gov/marc/transition). This project involves mapping all the elements of MARC 21 into a linked data structure. The Library of Congress has contracted with a consulting firm (zepheira.com) that specializes in semantic web to lead an investigation of the transformation of MARC. The proposed mappings, vocabularies, and general discussion of the project have been made publically available through the BIBFRAME.org website. A list and other opportunities for involvement have been made available to any interested party during this critical developmental phase.
I see the realm of the semantic web and open linked data as a very interesting area of research and development with great potential to reshape library technologies in the future—initially in the realm of discovery services but also in the systems used to manage library collections. But so far, these technologies have not yet played a significant role in the products and services available today. The path from the experimental effort and the nascent efforts at the creation of new standard to operational systems is a long trek.
At this critical juncture, it seems almost inevitable to me that the semantic web and open linked data will be important factors in the future of library technologies. I’m confident that any organization involved in the creation of technology products or services is already watching developments in this area and beginning to plan strategies accordingly.
Changes to the foundational underpinnings, such the MARC formats, will have major ramifications in so many aspects of library automation. At a minimum, systems reliant on MARC will need to accommodate new formats, such as those that may ultimately emerge from the BIBFRAME conversations. Beyond just adjusting legacy products to support new formats, more substantial benefits will accrue to those products and services able to fully exploit the full potential of the semantic web and open linked data. While I don’t anticipate substantial progress in this area for the next few years, it seems time to focus more attention on these issues. I’m optimistic that the library technology products of the next decade will be deeply rooted with the semantic web.