Are We Approaching the Maturation of Library Linked Data Processes?

It’s nice to see that the processes involved in the creation of library linked data have evolved to a point where you might say they are approaching a degree of maturity. For a while now there have been a number of technical barriers including seemingly simple things like deciding which of the many programming languages to invest your time in or which of the many applications are necessary to accomplish your linked data goals. A number of useful tools have emerged in the last couple of years and there are now enough people who have tried them with some success and sharing their experiences.

One very useful contribution in this regard is Eric M. Hanson’s article, “A Beginner’s Guide to Creating Library Linked Data: Lessons from NCSU’s Organization Name Linked Data Project.” Hanson is the Electronic Resources Librarian at North Carolina State University Libraries and this project brings authority control to their eResource management system E-Matrix.

Their process draws on “Best Practices for Publishing Linked Data” a W3C Working Group Note released as a “work in progress” in January 2014. Although the intention of this note is to guide and facilitate the development of linked open data for open government initiatives Hanson considers this W3C paper to be, “… one of the best and most concise guides available on the subject of publishing linked data, and though it focuses on publishing government linked data, the project phases described could be applied to any type of linked data project.”

He distils the ten original W3C steps [1] down to the five “project phases” used by the NCSU project team: project definition and modelling; data clean up; data enhancement; converting data to RDF; and publishing. As Hanson notes, these phases are a “modified and reordered version of the framework described by [W3C] and more closely resembles how the NCSU ONLD project developed.”

For those who may not have direct access to this Serials Review article I’ve condensed and outlined Hanson’s project phases below.

  1. Project Definition and Modelling
  • Select source data weighing size and complexity against available time and resources
  • Define publishing goals considering use cases and needs of potential users
  • Identify available suitable RDF vocabularies (e.g. Dublin Core, FOAF, SKOS, OWL[2]
  • Define the structure for the Uniform Resource Identifiers (URIs) to be used in the linked data set
  1. Data Clean Up
  • Review source file and unsure data are well structured and error free
  • Common errors include inconsistent formatting, duplicate entries, and incomplete information
  • Consider differences based on the history, structure, and the original purpose of the data
  • To facilitate available tools and minimize error generation during the subsequent conversion of the data convert the dataset to a structured format (e.g. XML, JSON)
  • Consider data clean-up as an on-going part of the project life cycle
  1. Data Enhancement
  1. Converting Data to RDF
  1. Publishing
  • More than simply uploading files to a server
  • Associate on open data licence (e.g. Creative Commons’ CC0 license, Open Data Common’s ODC-BY and ODbL)
  • Announce availability to potential user groups (NCSU also registered with Open Knowledge Foundation’s Datahub a registry of published linked datasets)
  • Finalize plans for long-term maintenance of the dataset and include update frequency and contact information
  • Share project files and tools created (e.g. NCSU has posted their scripts with sample data and serialization outputs)

I’ll leave you with this quote from Hanson’s conclusion:

In these beginning stages of linked data, library staff have an excellent opportunity to influence the development of best practices in linked data publishing if we become prolific creators and users of linked data. The library community should share their expertise in authority control and metadata management and help make their data and the resources in collections an important part of the evolving Semantic Web.”

This article includes a great Lessons Learned section and a collection of must read references. Thanks to Hanson and the NCSU project team for sharing this useful project and this inspiring read. Read the complete article for the detailed story and insights of benefit to anyone setting out on their own linked data project.

 


 

[1] Here’s a summary of the ten suggested W3C best practices described by Hyland and others:

  1. PREPARE STAKEHOLDERS: Prepare stakeholders by explaining the process of creating and maintaining Linked Open Data.
  2. SELECT A DATASET: Select a dataset that provides benefit to others for reuse.
  3. MODEL THE DATA: Modeling Linked Data involves representing data objects and how they are related in an application-independent way.
  4. SPECIFY AN APPROPRIATE LICENSE: Specify an appropriate open data license. Data reuse is more likely to occur when there is a clear statement about the origin, ownership and terms related to the use of the published data.
  5. GOOD URIs FOR LINKED DATA: The core of Linked Data is a well-considered URI naming strategy and implementation plan, based on HTTP URIs. Consideration for naming objects, multilingual support, data change over time and persistence strategies are the building blocks for useful Linked Data.
  6. USE STANDARD VOCABULARIES: Describe objects with previously defined vocabularies whenever possible. Extend standard vocabularies where necessary, and create vocabularies (only when required) that follow best practices whenever possible.
  7. CONVERT DATA: Convert data to a Linked Data representation. This is typically done by script or other automated processes.
  8. PROVIDE MACHINE ACCESS TO DATA: Provide various ways for search engines and other automated processes to access data using standard Web mechanisms.
  9. ANNOUNCE NEW DATA SETS: Remember to announce new data sets on an authoritative domain. Importantly, remember that as a Linked Open Data publisher, an implicit social contract is in effect.
  10. RECOGNIZE THE SOCIAL CONTRACT: Recognize your responsibility in maintaining data once it is published. Ensure that the dataset(s) remain available where your organization says it will be and is maintained over time.

[2] Hanson suggests using RDF vocabulary registries like the Open Data Registry and Linked Open Vocabularies as good sources.

Comments

  1. Ashley Denham from WhoIsHostingThis.com was kind enough to alert me to a broken link re: the W3C tutorial on XPath. I’ve corrected that link. Denham also suggested using a similar resource which you’ll find on their site http://www.whoishostingthis.com/resources/xpath/. Many thanks for letting me know!