What’s New With the Semantic Web
By Donald T. Hawkins
NISO (National Information Standards Organization) and the National Federation of Advanced Information Services (NFAIS) held a jointly sponsored virtual conference on Dec. 2, 2015, that reviewed semantic web updates and developments. It at tracted more than 60 participants, mostly academic librarians. In his welcoming remarks, Todd Carpenter (executive director of NISO) noted that the 2001 groundbreaking article on the World Wide Web (by Tim Berners-Lee, James Hendler, and Ora Lasilla) suggested that machines could interact together in a semantic environment. That vision has become a reality—many new resources and applications that use the semantic web have begun to appear, and the library world has an opportunity to take advantage of them.
In his keynote address, Matt Turner (CTO for media and entertainment at MarkLogic) said that today’s understanding of the semantic web is as groundbreaking as the original moments of the technology. A generational shift has occurred in the database market: Media and publishing are morphing into media and entertainment. We have moved from publishing form-based prod ucts to a dedicated infrastructure in which products are created from databases of information. The semantic web plays a major role in such activities. Semantics are a good way to organize disparate data, so we must think about a new class of information. Data helps us understand customers, authors, and specialized subject areas. The use of semantic data is one of the biggest changes in our industry.
So where do all these changes lead us, and what’s next? Turner said that it will be a revolution: data-driven integrated publishing. We must think about information delivery as a linked data-centric process. Silos are the bane of answers, and semantics can bridge them. By linking data together, silos can be broken down. We have come a long way from thinking about research as only search and retrieval; to be able to look at data more holistically, we must now understand not only customers, but also authors and their institutions. Ontologies describe the concepts, and semantic data is the structure that does the binding. One role of publishers is to create knowledge graphs, which might be the strongest component of their intellectual property. People who understand how to create such data are in high demand; this is an excellent opportunity for information professionals.
Context, one of the biggest aspects of content, is another component of the publishing revolution. Machines by themselves cannot understand context, but semantics are the links between pieces of data allowing context to drive search. Innovation is also important, and Turner suggested reading How Stella Saved the Farm:A Tale About Making Innovation Happen.
A Library ‘Knowledge Vault’
Jeff Mixter and Bruce Washburn (software engineers at OCLC Research) are creating a “knowledge vault” for libraries that is based on Google’s Knowledge Vault, a database of more than 1.6 billion facts that will power Google’s search functions when it is fully developed (see searchengineland.com/google-builds-next-gen-knowledge-graph-future-201640). Using resource description framework (RDF) triples to relate subjects, predicates, and objects, Mixter and Washburn are combining data from several providers and studying how a library knowledge vault might be useful for library data. They are taking MARC records and enhancing it with controlled vocabularies and other elements of linked bibliographic data to create statements about entities and their relationships. There are several challenges with this approach: a single record can contain many subjects, the relationships between the subjects are not always clear, and the entities may lack identifiers in library systems.
OCLC has developed a system, EntityJS, that searches across entities and enhances the user experience by displaying related works, organizations, people, places, and other information in response to a simple query. This system is not publicly available yet; experiments are continuing.
Examples of Semantic Technologies in Vocabularies
Joan Cobb (principal IT project manager for IT services at the J. Paul Getty Trust) illustrated how linked terms for the Getty Vocabularies (vocab.getty.edu) come from several ontologies and are used in four databases: Art & Architecture Thesaurus (AAT), Getty Thesaurus of Geographic Names (TGN), Union List of Artist Names (ULAN), and Cultural Object Name Authority (CONA). Getty Vocabularies expressed as linked open data can be used to connect to many repositories, including those outside of Getty, and searches of Getty’s content are run across all its databases.
Mike Stabile (president of Knowledgelinks.io) and Jeremy Nelson (metadata and systems librarian for Colorado College and CIO of Knowledgelinks.io) continued the discussion of using RDF technology as a means of building vocabularies, combining multiple uses, and providing links to data from a variety of sources. RDF databases are relational, so it is not necessary to duplicate concepts across multiple tables—once resources are tagged and deduplicated, it is simple to find them. For example, BIBFRAME (BIBliographic FRAMEwork) is an RDF vocabulary developed by the Library of Congress as an eventual replacement for MARC 21 records. The relationships used in the BIBFRAME system are shown in the image above.
A pilot search and display system for BIBFRAME data (BIBCAT) has been developed using the principles that information should exist only once in the system and that changes will update everywhere, thus producing an uncluttered repository. The next step in BIBFRAME development will be to install and test BIBCAT in an academic library.
The International Dunhuang Project (idp.bl.uk) is digitizing thousands of manuscripts, paintings, and artifacts found in the Mogao Grottoes near Dunhuang, China (a station along the ancient Silk Road). Cultural considerations are having a large impact on it, and, as with any international project, communication among all participants is highly important. Semantic techniques are being applied to help understand the metadata.
Europeana, a platform for Europe’s cultural heritage, provides access to more than 48 million objects from more than 2,000 galleries, museums, archives, and libraries. A data model (pro.europeana.eu/edm-documentation) has been created to harvest and enrich metadata using semantic technology. Prior to the model, there were no links between objects, so data on objects and contexts was mixed, and many mapping problems existed. Even though the model has been implemented in stages, some benefits have already been realized: Links between terms in different languages have been created; metadata can be enriched by third parties and collaborating partners; multilingual searches are possible; searching has been enhanced by adding auto-completion of query terms; and users are able to enrich the data by annotating records.
According to Jaqui Hodgkinson (VP of product development at Elsevier), her company is helping users in the drug discovery field keep up with the growing flood of articles now being published (nearly 1 million annually in the MEDLINE database alone) and identify the exact articles they want to read. Semantic technologies were used to normalize data from different taxonomies and create “biomarkers”—visualizations that allow viewing of research trends and publication data. Knowledge maps can help researchers navigate the data and find appropriate collaborators. Heat maps can show how compounds bind to one another as well as help users search for articles and patents and discover related articles.
The primary objects of libraries are books. Jason Clark (head of library informatics and computing at Montana State University Libraries) described a project to develop a prototype of reading software that would deliver books in web browsers by machine indexing them for semantic discovery. Two books were indexed by applying schema.org vocabularies to transform a webpage into linked data sources. The role of publishers will be disrupted when this process becomes widespread; they will need to develop skills in archiving, metadata, discovery, analytics, and sharing. Analytics will allow us to discover how readers move through a book, which are the most popular pages, and estimated read times, etc., and every page can have its own unique web address. Linked and structured data will be the key to these processes.