from rpi.edu - Tetherless World Constellation

Transcription

from rpi.edu - Tetherless World Constellation
PRE_PUBLICATION DRAFT --- DO NOT CITE
The Importance of Authoritative URI Design Schemes
for Open Government Data
Alexei Bulazel, Dominic DiFranzo, John S. Erickson, James A. Hendler
Tetherless World Constellation, Rensselaer Polytechnic Institute, Troy, NY
{bulaza, difrad, erickj4, hendlj2}@rpi.edu
Abstract
A major challenge when working with open government data is managing,
connecting, and understanding the links between references to entities found
across multiple datasets when these datasets use different vocabularies to refer
to identical entities (ie: one dataset may refer to Microsoft as “Microsoft”, another
may refer to the company by its SEC filing number as “0000789019”, and a third
may use its stock ticker “MSFT”.) In this paper we propose a naming scheme
based on Web URLs that enables unambiguous naming and linking of datasets
and, more importantly, data elements, across the Web. We further describe our
ongoing work to demonstrate the implementation and authoritative management
of such schemes through a class of web service we refer to as the “instance
hub”.
When working with linked government data, provided either directly from
governments via open government programs or through other sources, the issue
of resolving inconsistencies in naming schemes is particularly important, as
various agencies have disparate conventions for referring to the same concepts
and entities. Using linked data technologies we have created instance hubs to
assist in the management and linking of entity references for collections of
categorically and hierarchically related entities. Instance hubs are of particular
interest to governments engaged in the publication of linked open government
data, as they can help data consumers make better sense of published data and
can provide a starting point for development of linked data applications.
PRE_PUBLICATION DRAFT --- DO NOT CITE
In this paper we present our findings from the ongoing development of a
prototype instance hub at the Tetherless World Constellation at Rensselaer
Polytechnic Institute (TWC RPI). The TWC RPI Instance Hub enables
experimentation and verification of proposed URI design schemes for open
government data, especially those developed at TWC in collaboration with the
United States Data.gov program. We discuss core principles of the TWC RPI
Instance Hub design and implementation, and summarize how we have used our
instance hub to demonstrate the possibilities for authoritative entity references
across a number of heterogeneous categories commonly found in open
government data, including countries, federal agencies, states, counties, crops,
and toxic chemicals.
Introduction
Motivated by the Obama administration's government transparency initiatives, in
May 2009 the United States launched the Data.gov web portal with a catalog of
47 datasets containing government information that had previously not been
easily available online. [Kirkpatrick, 2009] The aggregation of datasets in a single
online portal at Data.gov made them easier to find and search, and has led to the
publication of datasets previously not avaliable online. During its first year
Data.gov grew to include more than 250,000 datasets and has inspired the
creation many hundreds of applications and services. [Kundra 2010]
PRE_PUBLICATION DRAFT --- DO NOT CITE
The launch of Data.gov was a key event near the beginning of a worldwide
movement of open government data1 publication as governments, NGOs, and
many other institutions began to make their data openly accessible to interested
parties. During Data.gov’s first year, US municipalities including San Francisco
and New York City, and states including California, Utah, Michigan, and
Massachusetts launched open data portals; around the world, countries including
the UK, Canada, and Australia, as well as organizations such as the World Bank
followed the Data.gov lead. By early 2013, the International Open Government
Dataset Search2 project of the Tetherless World Constellation at Rensselaer
Polytechnic Institute (TWC RPI) had recorded nearly 200 catalogs from over 40
countries, totaling more than a million datasets spanning a vast array of topics.
[Erickson et al, 2011]
The significant growth in number and size of open government data catalogs
since 2009 has been made possible by the emergence of an open government
data ecosystem consisting of policy makers, agencies (as providers and
consumers), data experts, independent software developers and service
providers, academia, and citizen stakeholders. The publication of widely varied
data has inspired a wide range of applications and services, has provided
essential data for journalists, bloggers and activists, and has fueled academic
research. In turn, demand from stakeholders has increased the quantity, quality
and diversity of this data.
1 "Open government data" for the purposes of this paper refers to data released by government
agencies without license restrictions, usually for the purpose of fulfilling a transparency policy. In
the case of open government data released through the US Data.gov portal, this was the "Open
Government Directive" of December 2009. [Orszag 2009] Note that non-governmental groups
have proposed objective criteria for defining open government data, including and especially
"The Annotated Eight Principles of Open Government Data," [http://opengovdata.org/] but since
government providers such as Data.gov have not formally adopted its tenets, it is not appropriate
to use that definition here.
2 http://logd.tw.rpi.edu/page/international_dataset_catalog_search
PRE_PUBLICATION DRAFT --- DO NOT CITE
The growth in the availability of open government data from providers around the
world is encouraging, but for potential users of this data including developers of
applications and services, the variety of formats and practices used to publish the
data can cause interoperability, scalability, and usability problems. Government
datasets are typically published "as is" (i.e., using a variety of structures and
formats), requiring substantial human workload to clean them up for machine
processing and to make them comprehensible. One of the most challenging
issues for consumers of government datasets is establishing the equivalence of
named entities within different datasets.
The Importance of Unambiguous Naming
An emerging issue for providers and consumers of open government data is the
creation and interconnection of names for common entities referenced in
datasets. For example, one dataset may refer to the state of New York as “NY”
while another may use the Federal Information Processing Standards (FIPS)
code of "36" in reference to the state The lack of common naming schemes
across datasets hinders developers attempting to build applications or to
otherwise make insights through data “mashups,” wherein datasets are
combined and analyzed together to show a broader context.
The process of finding common references to entities across multiple datasets is
further complicated by unintentional naming clashes that may arise. For example
“NY” might refer either to the city or state of New York. It would be wrong to make
inferences about the entire state from data about the city, and visa-versa. The
name “NY” is not sufficiently unique and does not provide us with enough
information to accurately link multiple datasets together. One solution would be to
name these common concepts using Uniform Resource Identifiers (URIs), the
fundamental identification scheme for resources on the Web. 3
3 The acronym “URI” stands for “uniform resource identifier”, which includes both URLs (uniform
resource locators) and URNs(uniform resource names). URLs specify how to reach a specific
resources online (ie: http://example.com), and URNs identify a specific resource with a given
namespace (ie: urn:isbn:0-12-385965-4 uniquely identifies “Semantic Web for the Working
Ontologist” by James Hendler and Dean Allemang within the space of all books with ISBNs).
URIs may be either URL, URNs, or both at once, and we use the term to broadly describe web
naming schemes that may be used to describe or identify data entities. For more discussion on
URIs, see <http://www.w3.org/TR/uri-clarification/> and RFC 3986 <http://tools.ietf.org/html/rfc3986>
PRE_PUBLICATION DRAFT --- DO NOT CITE
URIs are ideal for uniquely naming and linking common entities found throughout
open government data. Given careful design, the structure and syntax of URIs
can provide clarity as to the controlling authority for the entities they name and
some insight as to their purpose. For example, a news item found on the
"rpi.edu" domain referring to an event at RPI would likely be considered more
trustworthy than news from other sources, since Rensselaer Polytechnic Institute
is known to control the domain "rpi.edu." Similarly, government agencies can use
URIs in domains they control (eg. “.gov” for the US government) to assert
authority over the names of entities they oversee.
URIs, Structured Data and Instance Hubs
The structural freedom of the Web has been vital to its growth, but in practice has
also led to the propagation of a large amount of factually incorrect information. As
the Web has grown and become increasingly vital to human civilization around
the world, online resources have naturally stratified into varying levels of authority
within their respective topical domains. A government agency's official web
presence can serve as the authoritative source of information about its activities,
in contrast to sources with no official status like Wikipedia, the news media, or
social networks.
The use of URIs with clear, systematically structured syntax to name resources
allows providers to express authoritative references that both name and describe
entities to both computers and humans. An instance hub is a web-based service
that implements an authoritative, structured URI scheme for collections of
categorically and hierarchically related entities. Instance hubs are of particular
interest when applied to linked open government data (LOGD), a method of
structured data publication that exposes, shares, and connects data, information,
and knowledge using URIs and the Resource Description Framework (RDF) 4.
[Ding 2012] Instance hubs can provide key infrastructure for managing
references to entities commonly found in government published datasets. In the
next section we present a proposed set of URI design principles suitable for
adoption by governments involved in the publication of LOGD, which we have
used as the basis for the TWC RPI Instance Hub implementation. These URI
Design Principles are also available online at on the TWC website. 5
4 http://www.w3.org/RDF/
5 http://logd.tw.rpi.edu/instance-hub-uri-design
PRE_PUBLICATION DRAFT --- DO NOT CITE
Design Principles
An increasing number of governments and government agencies have begun
publishing linked open government data. From this work, policies and best
practices have emerged, and continue to emerge, for the use of URIs in open
government data release. RPI TWC is working extensively with key players in the
United States and international linked open government data initiatives to
develop URI schemes that are useful today and in the future.
Developing ways of handing authoritative data release is especially important in
the open government data space as developers of linked data and Semantic
Web applications have few tools with which to verify that data they are using is
factually correct or at least has been obtained from authoritative sources.
Methods and infrastructure for clearly establishing a government's authority over
such data are essential. Governments in turn have similar concerns, and want to
ensure that users can tell the difference between data they directly publish and
data that has been released unofficially by non-authoritative sources. Linking to
authoritative data can be achieved by associating dataset entity references (to
government agencies, states, etc.) with named instance hub URIs that present
authoritative metadata about these entities (names, statistics, official logos, etc).
These authoritative URIs may then in turn be used by developers to foster easy
linking between datasets and disambiguate entity references.
The Linked Data Cloud [LODCloud 2011] illustrates the Semantic Web’s heavy
dependence on DBpedia as a central locus for online data linking and its role as
the de facto source of authoritative URIs. While DBpedia has been immensely
useful for building data infrastructure online, it is not well suited for this role when
applied to government data. DBpedia presents crowdsourced information from
Wikipedia that has been converted to RDF, but the organizations with authority
over the actual entities referenced in DBpedia data have no way to correct
inconsistencies or otherwise control the publication of information about entities
under their control, especially those appearing in other linked sources. Great
strides have been made in the publication of open government data over the past
several years using DBpedia as a source of entity identifiers, but linking
government data using DBpedia or other non-governmental sources as central
loci is not ideal. Instance hubs provide a way for government data providers to
share authoritative information about the entities referenced in their linked data
publications.
PRE_PUBLICATION DRAFT --- DO NOT CITE
The Linked Open Data Cloud6 represents the inter-linking of datasets published in
Linked Data format, by contributors to the Linking Open Data community project and
other individuals and organisations. The depiction is based on metadata collected and
curated by contributors to datahub.io.
Instance Hubs and Entity Name Reconciliation
Instance hubs can serve as central loci for entity linking on the Semantic Web,
presenting authoritative, descriptive URIs which allow for easy disambiguation of
references. A common issue encountered when combining linked data from
disparate datasets is the use of different naming schemes to refer to the same
across datasets. Instance hubs were first conceived as a solution to this problem,
with the intent that they could present multiple alternative names for a given
entity and thereby foster linking between datasets using different name schemes
For example, states may be referred to by commonly used full names, ie: “New
York”, official legal names, ie: “Commonwealth of Massachusetts”, or any number
of other schemes such as two letter abbreviations or FIPS codes.
6 http://lod-cloud.net/versions/2011-09-19/lod-cloud.html
PRE_PUBLICATION DRAFT --- DO NOT CITE
URIs managed through instance hubs are expected to be dereferenceable to
obtain descriptive data about the entities they represent, both as human-readable
web pages and as RDF exposed through HTTP content negotiation. Instance
hubs may also be used to aid in dataset discovery, by presenting links to
datasets associated with each instance housed within the instance hub.
Instance hubs serve several purposes at once, providing disambiguation of entity
references in datasets, presenting easily accessible metadata about entities,
fostering the linking of online open data, and aiding in dataset discovery.
To help address these many potential uses of instance hub technology, we
created a URI scheme that provides rich descriptive information about entities for
humans examining open government data, while encouraging intuitive navigation
and exploration of this data. The implementation of this scheme in the instance
hub shows its utility for data publishing, and we hope to see government support
and adoption of this scheme in order to further test its viability.
Design Goal 1: Easily Rehosted URIs
One of our most important goals in creating the the TWC instance hub was to
provide a way for government agencies to act as an authority in publishing linked
data, using entity naming schemes that they manage and create. Our solution
was to create a URI scheme that is easily rehostable, meaning that URIs are not
tightly bound to the specific domains at which they are hosted. These URIs
should easily be transformable across multiple “base” domain names. For
example, a URI pattern used to refer to toxic chemicals as classified by the US
Environmental Protection Agency is:
http://logd.tw.rpi.edu/id/us/fed/agency/Environmental_Protection_Agency/chemical/XXXXX
This URI can easily be rehosted on other sites without losing any meaning:
http://example.com/id/us/fed/agency/Environmental_Protection_Agency/chemical/XXX
XX
A more interesting use case is if the Environmental Protection Agency decided to
itself publish RDF datasets on chemicals. In this case, since the agency’s interest
would be in providing its own authoritative but agency-specific view of toxic
chemicals and not necessarily serving as the sole representation of that
knowledge for the government, it might choose to adopt an alterative domainspecific URI scheme such as:
PRE_PUBLICATION DRAFT --- DO NOT CITE
http://epa.gov/id/chemical/XXXXX
If Data.gov decided to also publish chemical data from EPA, they might do so at
http://data.gov/id/fed/agency/Environmental_Protection_Agency/chemical/XXXXX
skipping the “us” in the URI as Data.gov’s association with the US government is
obvious and implied. Shortening these URIs to minimal length is by no means a
design requirement, but it is a possibility for publishers not interested in
presenting a full explicit URI hierarchy.
When chemicals are referenced in datasets from these agencies, the responsible
agencies can link the instance hub URI for each chemical, helping consumers of
this data to further link it with their own or with other datasets, as well as
providing them with additional metadata about the chemicals beyond what is
available in the datasets they are linking. This additional metadata can be used
to enrich applications and data mashups created by consumers and developers.
Design Goal 2: Concise URIs
A second guideline for creating instance hub URIs is that they should be concise,
with as little extra material as possible. Our practice in building the TWC Instance
Hub has been to create concise URIs that enable instance hub users to intuitively
navigate through the instance hub by interpreting and manipulating URIs.
For the TWC instance hub implementation it was important that users should be
able to navigate to any substring of given URI as denoted by slashes in the URI
and at the least be presented with a logical HTML page that reflects the content
one would expect to find at that page, including category and subcategory
listings. Concise and clean URI design has been an important part of realizing
this goal.
PRE_PUBLICATION DRAFT --- DO NOT CITE
Phil Archer’s “10 Rules For Persistent URIs” [Archer 2012] provides design
guidance based on a 2012 EU survey studying approximately twenty cases from
EU agencies and services, and EU Member States and standardization bodies
and initiatives, where URI management and persistence have been subject to
policy. These rules have been met for the most part in the design and
implementation of the TWC instance hub.7
By keeping URI schemes consistent and only changing the extension associated
with the final “token” in the URI, we enable easy and consistent navigation
through URIs. For example, a user viewing the page at /id/us/state/New_York
can simply remove the last token of the URI (“/New_York”) and view all states at
/id/us/state”, and from there move to /id/us/state/Connecticut, all without the
confusion of switching between id and id_page as would be necessary in
previous instance hub versions.
Design Goal 3: Multi-domain URIs
This goal stipulates that URI design patterns should be applicable to many
domains of authority, including national identifiers (e.g. governmental agencies,
states, provinces, zip codes); state-level identifiers (e.g. counties, congressional
districts); and agency-level identifiers (e.g. EPA facilities). This is critical to
human-friendly design and contributes to intuitive interpretation of the generated
URIs, but can also potentially be the cause of lengthy URIs. While the resultant
URIs may be long, in the TWC RPI instance hub we have made them only as
long as needed, keeping them as concise as possible while maintaining a logical
hierarchical structure.
7 One rule we did not follow is Archer’s recommendation that URIs not contain file formats
(“.html”, “.ttl”, “.xml”, etc). When visiting a given URI in a web browser, the user is redirected to the
same URI with “.html” appended to the end, and they may then change this extension to view
other Semantic file formats. Traditional Semantic Web “content negotiation” via HTTP request
headers is still also supported. This behavior comes from the LODSPeaKr publishing framework
used to build the instance hub, discussed later in this paper.
PRE_PUBLICATION DRAFT --- DO NOT CITE
In the case of entities that are logical sub-parts of other entities, URIs simply take
the larger entity’s URI and append a unique token on to it. This pattern is used in
sub-agency and state county URIs; for example the Department of Energy’s
Office of Science can be found at
/id/us/fed/agency/Department_of_Energy/Office_of_Science, and New York
state’s Rensselaer county is found at /id/us/state/New_York/Rensselaer. In the
RPI instance hub these sub-entities are associated with their larger “parent”
entities8, and are listed in documents describing the larger entities (both HTML
for humans and RDF documents about the entities).
In other cases, a category of entities may be logically related to another entity,
but may not be a direct sub-class, so a category identifier is appended to the
entity before an identifier token for a related entities is placed at the end.
Examples of this pattern include crops as defined by the USDA and toxic
chemicals as defined by the EPA; a listing of crops may be found at
/id/us/fed/agency/Department_of_Agriculture/crop, with crop specific identifier
token following after “crop” (ie: “soybean”, “cauliflower”), and toxics may be found
at /id/us/fed/agency/Environmental_Protection_Agency/chemical with specific
identifiers following after chemical (ie: “hydrochloric_acid”,
hexachlorocyclopentadiene).
URI Design Details
The TWC RPI instance hub URI patterns implement and build upon a set of
design goals for persistent open government data URIs recommended earlier by
TWC researchers. As noted in the previous section, the three main design goals
set forward in our recommendation are that URIs should be easily rehosted,
should be concise, and should be cross-domain. The basic URI structure upon
which the instance hub is based is:
“http://” [base] / “id” / ( [category] / [token]*)+
8 Via skos:broader relations.
/ [token]+
PRE_PUBLICATION DRAFT --- DO NOT CITE
Finite state machine representation of URI structure via http://www.regexper.com
“Base” denotes the base domain of the site hosting the instance hub. This allows
other government organizations to easily rehost our instance hub URIs under
their own domain names. The “id” part of the URI scheme ensures that the base
namespace of the site is not "polluted" with identifiers for entities; from this root,
all other instance hub URIs follow. Instance hub entities are disambiguated by a
combination of categories and tokens included in their URIs. Categories are
broad and provide a classification for multiple entities. For example “agency”
could be a category that contains the government agencies of a particular
country. No entity URI may end with a category token, since categories are
themselves not entities. Examples of URIs ending with categories include
/id/country, or /id/us/fed/agency; these URIs might resolve to a presentation of a
listing of entities which may be described the URI category hierarchy, but no
actual entity exists at the URI.
A tree style representation of URI “categories”, which present listings of the
entities that are logically categorized beneath them.
PRE_PUBLICATION DRAFT --- DO NOT CITE
Example “category” listing of country resources available in the instance hub.
PRE_PUBLICATION DRAFT --- DO NOT CITE
Tokens serve to provide unique identification of a given entity within a specific
category, or following from another token. For example, the token
“Department_of_Justice” is ambiguous by itself, but could uniquely identify the
United States Department of Justice at /id/us/fed/agency/Department_of_Justice
versus the Canadian Department of Justice at
/id/ca/fed/agency/Department_of_Justice. While all entities must proceed from
the root of “id” with at least one category in their URI, entity URIs may then be
formed by any combination of categories and tokens so long as the URI ends
with a token. Federal government subagencies may be specified as a token
proceeding from a federal government token /id/us/fed/agency/Department_of_Defense is a valid entity URI, but we may also
view the Navy as a sub-agency at
/id/us/fed/agency/Department_of_Defense/Department_of_the_Navy.
Other combinations are also possible, such as presenting a listing of crops as
defined by the US Department of Agriculture at
/id/us/fed/agency/Department_of_Agriculture/crop (token
Department_of_Agriculture followed by category crop), and a specific crop
(cauliflower) at /id/us/fed/agency/Department_of_Agriculture/crop/cauliflower.
If crop data from the Canadian agriculture and food inspection agency,
Agriculture and Agri-Food Canada, were incorporated into the same instance hub
it would be located at /id/ca/fed/agency/Agriculture_and_AgriFood_Canada/crop. If when the Canadian government refers to an entity called
“cauliflower” they attach the same meaning as the USDA, links denoting their
equivalence could then be established
between /id/ca/fed/agency/Agriculture_and_Agri-Food_Canada/crop/cauliflower
and /id/ca/fed/agency/Department_of_Agriculture/crop/cauliflower.
PRE_PUBLICATION DRAFT --- DO NOT CITE
On the Semantic Web, equivalence between entities is commonly expressed
using the “sameAs” property within the OWL web ontology language (denoted
“owl:sameAs”). To state in an RDF document that “entity_A owl:sameAs entity_B”
means that (as “sameAs” would employ), these entities are in fact the same
thing. This may be useful as entities may be hosted on different sites, have
different local names, or as in the example of cauliflower given previously, come
from different naming authorities (USDA vs. AAFC). The owl:sameAs property is
commonly used on the Semantic Web to establish links between RDF
documents describing the same entity or concept. For example, in the TW
instance hub, the US Army is found at
http://logd.tw.rpi.edu/ih2/id/us/fed/agency/Department_of_the_Army/Department
_of_the_Army, while on DBpedia the army is found at
http://dbpedia.org/resource/United_States_Department_of_the_Army . The
instance hub RDF data on the Army states that it is owl:sameAs the DBpedia URI
for the Army, so anyone who views this RDF knows that is appropriate to treat
any dataset reference to one of these as a reference to the other, and that any
property of the Army asserted on one site is also true for the Army entity
referenced in the other.
In the cauliflower example, the “links denoting their equivalence” would be
owl:sameAs links. Entity equality may be explicitly indicated across the instance
hub for unique names for given entities; for example, crop “XXXX” under a given
classification system could be owl:sameAs linked to crop “YYYY” in another if
they are in fact scientifically the same. If two dissimilar entities are referred to by
the same name in two different countries, no owl:sameAs link would be
established between them, it would not be appropriate for data consumers to
infer that the entities are the same.
PRE_PUBLICATION DRAFT --- DO NOT CITE
By allowing for specific classification of entities in this way, instance hubs can
enable developers of linked data applications to determine exactly what sort of
entities they are dealing with, while also not impeding them from gaining access
to a large amount of data about these entities. If a mapping exists between
“cauliflower” from US data sets and “cauliflower” from Canadian data sets, any
properties only associated with one of these entities may easily be associated
with the other. While crops may seem a trivial example, the implications of this
sort of linked data technology become more important in contexts where datasets
may not share so simple a common vocabulary as vegetable names, or where a
single entity may have a number of alternative identifiers. References in datasets
to corporate entities are a good example of this, they may use a common name,
a legal name, a stock ticker, a tax id, or various other identifier standards, any of
which might differ for the same corporation when conducting business in different
countries. [OC 2011] For example, datasets making references to Microsoft may
refer to it as by names such as “Microsoft”, “Microsoft Corporation”,
“MICROSOFT CORP”, as well as its SEC filings identifier, “0000789019”, its US
Senate lobbying disclosure filings identifier “25204”, its ticker symbol “MSFT”, its
US federal tax identifier “91-1144442”, or any number of other unique ways.9,10
9 http://tw.rpi.edu/orgpedia/page/company/0000327629
10 “The Federal Tax Identification Number for Microsoft“, http://support.microsoft.com/kb/834344
PRE_PUBLICATION DRAFT --- DO NOT CITE
TWC RPI Instance Hub Implementation
RPI TWC’s prototype instance hub was first implemented in PHP as an extension
to the TWC Linking Open Government Data (LOGD) Portal. [Ding 2011] This
version was an effective proof-of-concept but presented usability and system
development constraints, so for the second (current) iteration we adopted the
LODSPeaKr platform11, a linked data publishing framework developed at TWC.
[Graves 2012] The move to LODSPeaKr has made instance hub development
and management significantly easier, while also improving the general user
interactive experience.
LODSPeaKr uses a templating system, whereby templated HTML pages are
associated with individual RDF types. When the URI of an entity within the
instance hub is dereferenced (requested), LODSPeaKr initiates a SPARQL
query12 to determine the RDF type of the entity being requested and serves the
appropriate page as determined by the type. Once the type of the URI being
request has been established, LODSPeaKr chooses a type template to display
the URI that being referenced. LODSPeaKr’s templating system allows
developers to define SPARQL queries for data associated with each type that
they present in the instance hub, and it uses the results of these queries to fill in
HTML page templates for each “type” in the instance hub. For the TWC Instance
Hub we defined specific type templates for the classes of entities contained
within it: countries, US federal agencies, toxic chemicals, etc., allowing us to style
and incorporate available data differently for each one.
11 http://lodspeakr.org
12 SPARQL, the “SPARQL Protocol and RDF Query Language” is a query language for Semantic
RDF data, much like SQL is a query language for tabular data.
PRE_PUBLICATION DRAFT --- DO NOT CITE
PRE_PUBLICATION DRAFT --- DO NOT CITE
Two example instance hub pages - the State of New York the US President’s
Council of Economic Advisors. LODSPeaKr’s templates allowed us to style these
pages differently, for example, as a geographic entity, New York is shown on an
embedded map at the top of the page alongside its state flag. See appendix for
New York state RDF document.
PRE_PUBLICATION DRAFT --- DO NOT CITE
Principles for both publishing and consuming linked data were applied in the
design of the TWC Instance Hub. In particular, some entity properties which the
instance hub presents are not natively stored in TWC’s SPARQL endpoint, but
dynamically pulled from other sources. The primary source of external data on
the instance hub is DBpedia. Each entity in the instance hub has an associated
owl:sameAs DBpedia URI, and selected DBpedia data is presented alongside
locally hosted data on instance hub HTML pages and RDF documents. As the
instance hub’s goal is to facilitate data linking and entity disambiguation, not to
serve as a large-scale repository for data itself, the amount of information
presented by each page is far less than the DBpedia page on the same entity
presents. The data that the instance hub pulls in from other sites serves to further
enhance the instance hub user experience by providing additional descriptive
resources (such as country or state flags, or for geographic entities, an
embedded Google map displaying the location of the entity in question, with the
location coordinates derived from DBpedia).
The TWC Instance Hub has been designed to demonstrate the full potential of
the instance hub concept, and therefore provides both internally and externally
sourced data. A government-hosted and managed instance hub may instead
choose to provide only internally sourced data. Authoritative government-run
instance hubs might provide links to external resources that reference the entities
that they house, but directly pulling resources from external sites could expose
users of the instance hub to potentially incorrect or inappropriate, unvetted data,
and potentially could present a vector for malicious individuals to carry out crosssite scripting attacks13. An early demo instance hub created by one of the authors
(Bulazel) during an internship with Data.gov intentionally used only governmenthosted resources for this reason.
13 Cross-site scripting, or XSS, refers to a class of computer security vulnerabilities commonly
found in web applications. XSS enables malicious attackers to inject JavaScript into vulnerable
pages, and when this code is executed by other users it can prevent a serious security issue.
Relying on external data sources for instance hub data could present an XSS vulnerability for
governments. Even if the government sites do not themselves have a vulnerability allowing
malicious code to be injected, if an XSS vulnerability in an external data provider exists and
exploited, and this code is then pulled into a government instance hub, users are harmed just the
same (and in turn, this code may then propagate to end users of applications developed relying
on government instance hub resources). Simply not using external data at all significantly
reduces the attack surface available to malicious individuals.
PRE_PUBLICATION DRAFT --- DO NOT CITE
In order to generate instance hub pages that provide listings of categorial
information at descriptive partial URI fragments (ie: /id/us/state for US states and
everything else “below” them, or /id/country for countries and everything below
them), LODSPeaKr “service” pages were created, allowing for the presentation of
custom developer-defined content. Each page is populated by queries that
retrieve all entities of the each type that should be displayed on the page. in the
interest of extensibility and ease of management, rather than implementing single
queries to retrieve all entities that should be presented on a single page (ie: all
US states, agencies, counties, etc under /id/us), a number of queries were
implemented, one for each of these categories. The queries used in service
pages are all stored in a common directory, and new service pages can easily be
created, with the developer simply writing a few lines of LODSPeaKr template
code and associating it with each query needed to populate the page they are
creating.
Instance Hub Data
The data presented in the current RPI TWC Instance Hub was chosen so as to
show the possibilities for the use of this technology within government. A variety
of hierarchically related and heterogeneously typed entities were chosen to
demonstrate the URI design of the instance hub. While most of the instance hub
data is currently from the United States, we plan on expanding its scope to
include agency and state/province level entities from other countries involved in
the publication of open government data.
As of April 2014 the TWC instance hub housed information about world
countries, United States federal agencies and subagencies, US states and their
counties, toxic chemicals as defined by the US EPA, and crops as defined by the
USDA.
Instance hub pages and RDF documents provide a variety of high level metadata
about the entities that they describe. Alternative entity names and links to
external sites (DBpedia, Freebase, etc) for each entity are presented so as to
assist with linking data. Additional data such as a brief description of the entity
from DBpedia and a link to a flag or logo are also provided to help developers
seeking to develop linked data applications using instance hub entities. Links
between other entities are also presented (sub-agencies, state counties, etc.).
PRE_PUBLICATION DRAFT --- DO NOT CITE
One important use case for the instance hub is data exploration and discovery.
To this end, we have pursued work integrating links to TWC hosted linked open
government datasets with related instance hub entities. Currently, this data
comes from TWC’s International Open Government Dataset Catalogue Search
(IOGDS) dataset repositories. Each time an instance hub page is loaded, a query
for pertinent IOGDS data is sent out to the TWC endpoint to retrieve an related
dataset listing. Due to the overwhelmingly massive number of IOGDS datasets
available from TWC (over one million), only a small selection of relevant datasets
is presented on each instance page, as a way to pique a viewers interest in
further exploring IOGDS. Ways to better integrate these datasets with instance
hub is still an area of ongoing research, and we hope to develop new techniques,
so as to further enhance instance hub’s capabilities as a linked data discovery
tool.
Future Work
In order to take full advantage of linked data, named entities expressed in natural
language form in datasets must be recognized and associated with
corresponding URIs administered in the relevant instance hub. When aliases or
incorrect spellings have been used to name a given entity, disambiguation is
required to ensure that the canonical URI is assigned. Additionally, if identifiers
from proprietary naming systems are used in the data, reconciliation of these with
instance hub URIs should take place. An emerging body of work focused on
named entity recognition, leveraging progress in natural language processing, is
contributing tools and methods to the problem of recognizing entity mentions
within data. [WoLI 2012] Future work should include a focus on the development
of efficient, automated workflows that apply named entity recognition algorithms
at scale and integrate effectively with a network of instance hubs, producing highquality, useful linked data.
PRE_PUBLICATION DRAFT --- DO NOT CITE
The core value of an instance hub lies in its implementation of an authoritative
URI scheme, but is greatly enhanced by the extent to which it provides links to
alternative data. The TWC RPI Instance Hub accomplishes this by providing links
for certain entities to alternative entity identifier schemes appropriate to the
category; for example, for toxic chemicals we link to identifiers for chemicals in
other canonical systems, including PubChem and ChemSpider.14,15 In the current
TWC RPI Instance Hub these alternative schemes have been identified by hand;
future work should include identifying scalable mechanisms to automate this
process using the global web graph and incorporation of resources such
Freebase and DBpedia.
Challenges and Guidance for Government Adopters
Government adoption of the implementation approach we have tested will likely
require several steps. First, governments will need to internally compile the
authoritative entity information that they would like to publish online for open data
consumers. Second, following the instance hub pattern this data will need to be
published as linked open data, requiring conversion into RDF. [Ding 2011, Ding
2012] Third, a web application will be required to present this data, and finally the
data should be published online. Governments should engage the open data
community during this process to ensure that they are in fact creating usable
systems that present desirable, useful data.
The United States’ Data.gov has been a world leader in promoting the use of
Semantic Web technologies in the publication of open government data [Kundra
2010]. During the Summer of 2012 the site hosted one of the authors (Bulazel)
as a summer intern exploring the use of instance hub technology for Data.gov’s
open data publication work. Much of TWC’s recent work with developing instance
hubs builds on this experience, and many valuable lessons for governments
seeking to publish linked data in Semantic formats can be taken from these
experiences.
14 http://pubchem.ncbi.nlm.nih.gov/
15 http://www.chemspider.com/
PRE_PUBLICATION DRAFT --- DO NOT CITE
We recommend that government agencies seeking to implement instance hub
technology do the following: plan their URI schemes before implementation,
consider the security and authority implications of their applications, and finally
carry out implementation having fully prepared the content that will be presented.
Community and intra-government collaboration is crucial to this process, and
these stakeholders should be involved at each step.
For these reasons, we strongly recommend that agencies transparently plan out
the URI schemes that they propose using in their instance hubs, seeking input
from developers, academia, and other agencies. All three of these stakeholders
have valuable contributions to make to the planning process. Developers who will
ultimately use the instance hub in their open data applications have a vested
interest in ensuring that the hub is usable and meets their needs. Academics
have valuable insight into best practices for the deployment of experimental
Semantic Web applications like instance hubs, and should also be engaged in
the process. Finally, agencies should collaborate with other agencies that may be
developing instance hubs, sharing best practices, implementation tips, and
ensuring the interoperability and linkability of their data.
The security and authority implications of instance hubs are another important
consideration for governments. While linked data applications often feature
information dynamically drawn from multiple sources, we recommend that
governments not do this, and instead either rely entirely on internally sourced
data or on vetted data from other sources that is cached for presentation and not
dynamically sourced. Given the relative high-profile of government sites, there is
the risk of malicious actors using dynamically sourced data as way to spread
embarrassing misinformation, or worse as a vector to spread malware via socalled “cross-site scripting” attacks using JavaScript.
The adoption of Semantic Web technologies by many government agencies
would be a large undertaking. For this reason, we recommend that open data
agencies with previous experience in the Semantic Web space (such as
Data.gov or Data.gov.uk) take the lead in exploring the possibilities for the use of
instance hub technologies in their respective governments.
Outcomes For Open Data Consumers
PRE_PUBLICATION DRAFT --- DO NOT CITE
If governments use instance hubs to link entities within their hubs to datasets
referencing these entities, as the RPI TWC instance hub does, these instance
hubs can then serve as an entry point for dataset discovery.
After developers have chosen the datasets that they would like to work with,
instance hubs can play an important role in linking these datasets and making
them interoperable. In an ideal world, government published open datasets
would be easily available in machine readable RDF, and elements within these
datasets would be linked via authoritative instance hub URIs. In reality, many
open datasets are not in RDF format, let alone simple parseable open formats
like CSV. [Ding 2011]
In the case that documents are not in RDF format, developers will need to first
convert these documents to RDF. Instance hubs can be used during the this
process to craft richer RDF documents, as developers can integrate metadata
about referenced resources into these documents during conversion.
For data that is in RDF format, but not linked via instance hub URIs, developers
can use instance hubs to link this data to other data, based on commonalities in
entity reference schemes as negotiated by the instance hub.
After linking their datasets together, instance hubs can further help open data
developers by providing them with high-quality authoritative metadata about
entities referenced in these datasets. Instance hub metadata can include
government agency logos, official websites, short descriptions, or anything else
that governments think would be of use to developers. This metadata can then
be integrated into open data applications with ease, while allowing developers to
know that they are dealing with authoritative data.
Overall, we believe that instance hubs can provide great value to developers
interested in working with open data, and that they would lower the barriers to
entry for this sort of work by making data more easily interoperable.
Conclusion
The RPI instance hub project is an area of undergoing research, and will
continue developing. Instance hubs provide a way of making open government
data (or any data) more readily usable in linked data applications.
PRE_PUBLICATION DRAFT --- DO NOT CITE
What is needed now is adoption of instance hubs by government publishers of
open data. Instance hub technology and the URI schemes we have proposed
have not been developed in a vacuum, rather they are the product of dialogue
with colleagues in government actively involved in open data publication. The
TWC Instance Hub is a realization of the ideas put forward in these discussions
and serves as an important demonstration of the viability of these ideas for use in
real world applications. TWC has proposed a scheme for the naming of
government entities and has created a demonstration of its implementation, but
its adoption remains within the domain of governments themselves. While TWC
will continue to be involved in the promotion of linked open government data and
the technologies such as instance hubs that may be used to promote it, backing
for this technology is now necessary from governments themselves in the form of
implementation in government information systems. As with the greater Web, the
Semantic Web needs the authoritative power of government agencies to help
ensure the quality and accountability of linked open data. We have provided the
first technological step towards that solution, but now need support from
government agencies to embrace and evolve it.
Acknowledgments
This work was supported by a generous gift to TWC RPI from Microsoft
Research. Jim Hendler serves as Open Data Advisor to the New York State
Government and to the US Data.gov project. The ideas expressed in this paper
are those of the authors and do not necessarily reflect the opinions of any of
these organizations.
References
[Archer 2012] Phil Archer, "D7.1.3 - Study on persistent URIs, with identification
of best practices and recommendations on the topic for the MSs and the EC."
Deliverable for ISA Action 1.1 on Semantic Interoperability (Dec 2012).
http://bit.ly/19Rj2M5
[Ding 2011] Ding, L., T. Lebo, J. S. Erickson, D. DiFranzo, A. Graves, G. T.
Williams, X. Li, J. Michaelis, J. Zheng, Z. Shangguan, et al., "TWC LOGD: A
Portal for Linked Open Government Data Ecosystems", Web Semantics:
Science, Services and Agents on the World Wide Web, vol. 9, no. 3: Elsevier,
2011. http://bit.ly/16tmY9q
PRE_PUBLICATION DRAFT --- DO NOT CITE
[Ding 2012] Ding, L., V. Peristeras, and M. Hausenblas, "Linked Open
Government Data", IEEE Intelligent Systems, vol. 27, no. 3, Los Alamitos, CA,
USA, IEEE Computer Society, pp. 11-15, 2012. http://bit.ly/16YYb7s
[DOI] The Digital Object Identifier System. http://www.doi.org/
[Erickson 2011] Erickson, J. S., E. Rozell, Shi, Y. , Zheng, J. , Ding, L. , Hendler,
J.A. "TWC International Open Government Dataset Catalog", the 7th
International Conference Proceedings of the 7th International Conference on
Semantic Systems - I-Semantics '11, Graz, Austria; ACM Press, 2011.
http://bit.ly/1aD4GQq
[Graves 2012a] Alvaro Graves, "Creating Web Applications with LODSPeaKr."
(Feb 22, 2012) http://slidesha.re/16YUyOT
[Graves 2012b] Alvaro Graves, "Publishing Linked Data with LODSPeaKr." (Jul
07, 2012) http://slidesha.re/16YVsLo
[Juty 2012] Juty N., Le Novère N., Laibe C., "Identifiers.org and MIRIAM
Registry: community resources to provide persistent identification." Nucleic Acids
Research. 2012; 40 (Database issue): D580-D586
[Kirkpatrick 2009] Marshall Kirkpatrick, "Data.gov Now Live; Looks Nice But Short
on Data." ReadWriteWeb (created 21 May 2009; accessed 19 Mar 2014)
http://bit.ly/1nEmAwk
[Kundra 2010] Viveck Kundra, "Data.gov: Pretty Advanced for a One-Year-Old."
White House web site (created 21 May 2010; accessed 19 Mar 2014)
http://1.usa.gov/15BNGfC
[LinkedData 2006] Tim Berners-Lee, "Linked Data." W3C Design Issues (created
2006; latest revision 2009). http://www.w3.org/DesignIssues/LinkedData.html
[LODCloud 2011] Richard Cyganiak and Anja Jentzsch (eds.), The Linking Open
Data cloud diagram. (Edited Sep 2011). http://lod-cloud.net/
[OC 2011] The Sunlight Foundation, “Unique Corporate Identifiers.”
OpenCongress Wiki (accessed 30 Oct 2013).
http://www.opencongress.org/wiki/Unique_Corporate_Identifiers
PRE_PUBLICATION DRAFT --- DO NOT CITE
[Orszag 2009] Peter R. Orszag, "Open Government Directive." M10-06,
MEMORANDUM FOR THE HEADS OF EXECUTIVE DEPARTMENTS AND
AGENCIES (8 Dec 2009). http://1.usa.gov/1nEqANo
[Pereira 2012] Bianca Pereira, João C. P. da Silva, and Adriana S. Vivacqua,
"Discovering Names in Linked Data Datasets." In Proceedings of the Web of
Linked Entities Workshop in conjunction with the 11th International Semantic
Web Conference (ISWC 2012) Boston, USA (November 11, 2012).
[WoLI 2012] Web of Linked Entities Workshop 2012 http://ceur-ws.org/Vol-906/
[W3 2002] Klyne, Graham and Carroll, Jeremy (eds.), “Resource Description
Framework (RDF): Concepts and Abstract Data Model” (work in progress) (Aug
29, 2002), http://www.w3.org/TR/2002/WD-rdf-concepts-20020829
PRE_PUBLICATION DRAFT --- DO NOT CITE
Appendix
Example of machine readable “Turtle” format representation of RDF data on the
State of New York:
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix ns0: <http://logd.tw.rpi.edu/source/twc-rpiedu/dataset/instance-hub-us-states-and-territories/vocab/> .
@prefix ns1: <http://www.w3.org/2002/07/owl#> .
@prefix ns2: <http://purl.org/dc/terms/> .
@prefix ns3: <http://xmlns.com/foaf/0.1/> .
@prefix ns4: <http://open.vocab.org/terms/> .
@prefix ns5: <http://rdfs.org/ns/void#> .
@prefix ns6: <http://logd.tw.rpi.edu/source/twc-rpiedu/dataset/instance-hub-us-states-and-territories/vocab/enhancement/1/>
.
@prefix ns7: <http://www.w3.org/2004/02/skos/core#> .
<http://logd.tw.rpi.edu/id/us/state/New_York> rdf:type ns0:State ;
ns1:sameAs <http://dbpedia.org/resource/New_York> ;
ns2:isReferencedBy <http://logd.tw.rpi.edu/source/twc-rpiedu/dataset/instance-hub-us-states-and-territories/version/2011-Oct09> ;
ns2:identifier "36" ,
"ny" ,
"new york" ,
"New York" ,
"NEW YORK" ,
"NY" ;
ns2:title "New York" ;
ns3:isPrimaryTopicOf
<http://logd.tw.rpi.edu/id/us/state_page/New_York> ;
ns4:csvRow "33"^^<http://www.w3.org/2001/XMLSchema#integer> ;
ns5:inDataset <http://logd.tw.rpi.edu/source/twc-rpiedu/dataset/instance-hub-us-states-and-territories/version/2011-Oct09> ;
ns6:fips "36" .
<http://logd.tw.rpi.edu/id/us/state/New_York/Chautauqua> ns2:title
"Chautauqua" ;
ns7:broader <http://logd.tw.rpi.edu/id/us/state/New_York> .
<http://logd.tw.rpi.edu/id/us/state/New_York/Chemung> ns2:title
"Chemung" ;
ns7:broader <http://logd.tw.rpi.edu/id/us/state/New_York> .
PRE_PUBLICATION DRAFT --- DO NOT CITE
<http://logd.tw.rpi.edu/id/us/state/New_York/Erie> ns2:title "Erie" ;
ns7:broader <http://logd.tw.rpi.edu/id/us/state/New_York> .
<http://logd.tw.rpi.edu/id/us/state/New_York/Herkimer> ns2:title
"Herkimer" ;
ns7:broader <http://logd.tw.rpi.edu/id/us/state/New_York> .
[county listing truncated]