July 27 th 2015 1 Comment

Posted in:

Technology: Internet

Are We Approaching the Maturation of Library Linked Data Processes?

by F. Tim Knight

It’s nice to see that the processes involved in the creation of library linked data have evolved to a point where you might say they are approaching a degree of maturity. For a while now there have been a number of technical barriers including seemingly simple things like deciding which of the many programming languages to invest your time in or which of the many applications are necessary to accomplish your linked data goals. A number of useful tools have emerged in the last couple of years and there are now enough people who have tried them with some success and sharing their experiences.

One very useful contribution in this regard is Eric M. Hanson’s article, “A Beginner’s Guide to Creating Library Linked Data: Lessons from NCSU’s Organization Name Linked Data Project.” Hanson is the Electronic Resources Librarian at North Carolina State University Libraries and this project brings authority control to their eResource management system E-Matrix.

Their process draws on “Best Practices for Publishing Linked Data” a W3C Working Group Note released as a “work in progress” in January 2014. Although the intention of this note is to guide and facilitate the development of linked open data for open government initiatives Hanson considers this W3C paper to be, “… one of the best and most concise guides available on the subject of publishing linked data, and though it focuses on publishing government linked data, the project phases described could be applied to any type of linked data project.”

He distils the ten original W3C steps [1] down to the five “project phases” used by the NCSU project team: project definition and modelling; data clean up; data enhancement; converting data to RDF; and publishing. As Hanson notes, these phases are a “modified and reordered version of the framework described by [W3C] and more closely resembles how the NCSU ONLD project developed.”

For those who may not have direct access to this Serials Review article I’ve condensed and outlined Hanson’s project phases below.

Project Definition and Modelling

Select source data weighing size and complexity against available time and resources
Define publishing goals considering use cases and needs of potential users
Identify available suitable RDF vocabularies (e.g. Dublin Core, FOAF, SKOS, OWL) [2]
Define the structure for the Uniform Resource Identifiers (URIs) to be used in the linked data set

Data Clean Up

Review source file and unsure data are well structured and error free
Common errors include inconsistent formatting, duplicate entries, and incomplete information
Consider differences based on the history, structure, and the original purpose of the data
To facilitate available tools and minimize error generation during the subsequent conversion of the data convert the dataset to a structured format (e.g. XML, JSON)
Consider data clean-up as an on-going part of the project life cycle

Data Enhancement

Link to other published datasets adds value for users by associating additional properties relating to your data (e.g. Virtual International Authority File (VIAF), Library of Congress Linked Data Service, International Standard Name Identifier, Freebase, DBpedia)
Reconcile your dataset to discover URIs in other established datasets can be done using batch and manual searches in Open Refine with the RDF Refine extension
Hanson notes too that it’s OK to link to different forms of a name as long as it can be established that both data sources described the same thing but used different labels

Converting Data to RDF

Most technically challenging part of creating linked data
List of tools for converting to RDF maintained by the World Wide Web Consortium (W3C)
RDF Refine extension in Open Refine is one available tool that creates “RDF skeleton” output
The NCSU project team developed their own XSLT (eXtensible Stylesheet Language Transformations) scripts using the Oxygen XML Editor 15.2 for all RDF serializations (e.g. RDF-XML, N-Triples, Notation 3, Turtle, JSON-LD)
Also provide browsable web pages using RDFa embedded in HTML
W3C tutorials on XSLT, XML and XPATH (XML Path Language)
StackOverflow is recommended as a good resource for troubleshooting specific scripting errors encountered

Publishing

More than simply uploading files to a server
Associate on open data licence (e.g. Creative Commons’ CC0 license, Open Data Common’s ODC-BY and ODbL)
Announce availability to potential user groups (NCSU also registered with Open Knowledge Foundation’s Datahub a registry of published linked datasets)
Finalize plans for long-term maintenance of the dataset and include update frequency and contact information
Share project files and tools created (e.g. NCSU has posted their scripts with sample data and serialization outputs)

I’ll leave you with this quote from Hanson’s conclusion:

“In these beginning stages of linked data, library staff have an excellent opportunity to influence the development of best practices in linked data publishing if we become prolific creators and users of linked data. The library community should share their expertise in authority control and metadata management and help make their data and the resources in collections an important part of the evolving Semantic Web.”

This article includes a great Lessons Learned section and a collection of must read references. Thanks to Hanson and the NCSU project team for sharing this useful project and this inspiring read. Read the complete article for the detailed story and insights of benefit to anyone setting out on their own linked data project.

[1] Here’s a summary of the ten suggested W3C best practices described by Hyland and others:

PREPARE STAKEHOLDERS: Prepare stakeholders by explaining the process of creating and maintaining Linked Open Data.
SELECT A DATASET: Select a dataset that provides benefit to others for reuse.
MODEL THE DATA: Modeling Linked Data involves representing data objects and how they are related in an application-independent way.
SPECIFY AN APPROPRIATE LICENSE: Specify an appropriate open data license. Data reuse is more likely to occur when there is a clear statement about the origin, ownership and terms related to the use of the published data.
GOOD URIs FOR LINKED DATA: The core of Linked Data is a well-considered URI naming strategy and implementation plan, based on HTTP URIs. Consideration for naming objects, multilingual support, data change over time and persistence strategies are the building blocks for useful Linked Data.
USE STANDARD VOCABULARIES: Describe objects with previously defined vocabularies whenever possible. Extend standard vocabularies where necessary, and create vocabularies (only when required) that follow best practices whenever possible.
CONVERT DATA: Convert data to a Linked Data representation. This is typically done by script or other automated processes.
PROVIDE MACHINE ACCESS TO DATA: Provide various ways for search engines and other automated processes to access data using standard Web mechanisms.
ANNOUNCE NEW DATA SETS: Remember to announce new data sets on an authoritative domain. Importantly, remember that as a Linked Open Data publisher, an implicit social contract is in effect.
RECOGNIZE THE SOCIAL CONTRACT: Recognize your responsibility in maintaining data once it is published. Ensure that the dataset(s) remain available where your organization says it will be and is maintained over time.

[2] Hanson suggests using RDF vocabulary registries like the Open Data Registry and Linked Open Vocabularies as good sources.

Comments

F. Tim Knight

November 1st, 2016 at 1:22 pm

Ashley Denham from WhoIsHostingThis.com was kind enough to alert me to a broken link re: the W3C tutorial on XPath. I’ve corrected that link. Denham also suggested using a similar resource which you’ll find on their site http://www.whoishostingthis.com/resources/xpath/. Many thanks for letting me know!

Most Recent Comments

David Schulze on Summaries Sunday: Supreme One-Liners:

I am not sure that the appeal in Haggaï v. Loisell will turn mostly on "Professional discipline issues re pharmacist."… more »
Andrea Stuart on What’s an Author to Do? Shadow Libraries in the Age of AI.:

Thanks for writing this, Mark. The prospect of everything being scraped by AI triggers for me a longing for the… more »
Daphne Dumont on Book Review: Mary Jane Mossman’s Quiet Rebels: A History of Ontario Women Lawyers:

Thank you for your review; I'm going to order this book immediately. I live in Prince Edward Island where our… more »
David Collier-Brown on The Case for and Against Co-Authoring With AI:

Writing prompts is the modern version of writing programs in an (unambiguous) computer language like "C" or Pascal. A "compiler"… more »

+ -

How I Learned About Mentorship by Being “Exiled” to the Library

Notes to a Young AI Professional: On Speed, Status, and Sanity

Book Review: Chilton & Rozema’s Trial by Numbers: A Lawyer’s Guide to Statistical Evidence

The Law Firm Foundational Rebuild

Consciously Competent: A State of Mind for Supporting Student Learning

What’s an Author to Do? Shadow Libraries in the Age of AI.

Are We Approaching the Maturation of Library Linked Data Processes?

Comments