Publicación de cuadros de mando para evaluación de uso de

Comments

Transcription

Publicación de cuadros de mando para evaluación de uso de
PUBLICACIÓN DE CUADROS DE MANDO PARA EVALUACIÓN
DE USO DE LAS BIBLIOTECAS DIGITALES UTILIZANDO
TECNOLOGÍAS DE DATOS ENLAZADOS
María Asunción Hallo Carrasco
DEPARTAMENTO DE LENGUAJES Y SISTEMAS INFORMÁTICOS
PUBLICACIÓN DE CUADROS DE MANDO PARA EVALUACIÓN DE USO DE LAS
BIBLIOTECAS DIGITALES UTILIZANDO TECNOLOGÍAS DE DATOS ENLAZADOS
MARÍA ASUNCIÓN HALLO CARRASCO
Tesis presentada para aspirar al grado de
DOCTORA POR LA UNIVERSIDAD DE ALICANTE
MENCIÓN DE DOCTORA INTERNACIONAL
Doctorado en Aplicaciones de la Informática
Dirigida por:
Dr. Sergio Luján Mora
Julio 2016
TESIS DOCTORAL
EN FORMA DE COMPENDIO DE PUBLICACIONES
PUBLICACIÓN DE CUADROS DE MANDO PARA EVALUACIÓN DE
USO DE LAS BIBLIOTECAS DIGITALES UTILIZANDO TECNOLOGÍAS
DE DATOS ENLAZADOS
Este documento contiene un resumen del trabajo realizado por María Asunción
Hallo Carrasco bajo la dirección del doctor Sergio Luján Mora , para optar al grado
de Doctor en Informática. Se presenta en la Universidad de Alicante y se estructura
según la normativa para la presentación de tesis doctorales en forma de compendio
de publicaciones: en primer lugar se presenta un resumen del trabajo realizado y
en segundo lugar se presenta los artículos científicos publicados.
Julio 2016
iii
Agradecimiento
Esta tesis no hubiera podido realizarse sin el soporte de varias personas en los campos
técnico y humano.
Al Dr. Sergio Luján Mora por su acertada dirección y compromiso en todo el proceso de
desarrollo de la tesis.
Al Dr. Alejandro Maté Morga por sus oportunas respuestas y apoyo.
Al Dr. Juan Carlos Trujillo por sus permanentes aportes que hicieron
posible la culminación de esta tesis.
Al Dr. Manuel Marco Such por sus aportes en los inicios de la investigación.
Al Dr. Javier Dámaso por su apoyo en la parte legal.
A los coautores de los diferentes artículos que permitieron un trabajo colaborativo y
generación conjunta de conocimientos en los diferentes proyectos.
A mis hijos Lorena García e Iván García y a mi hermano Francisco Hallo por su
permanente soporte.
A la Escuela Politécnica Nacional de Quito-Ecuador por el apoyo en las investigaciones
asociadas a esta tesis.
Mis especiales agradecimientos a la Universidad de Alicante por permitirme realizar
estos estudios y por el apoyo de las diferentes unidades administrativas y académicas
relacionadas con el programa de doctorado.
Este trabajo fue parcialmente soportado por los proyectos GODAS-BI (TIN20137493- C03-03) y SEQUOIA-UA (TIN2015-63502-C3-3-R.) auspiciados por el
Ministerio de Economía y Competitividad de España.
v
Publicación de cuadros de mando para evaluación de uso de las
bibliotecas digitales utilizando tecnologías de datos enlazados
Índice general
Parte I ................................................................................................................................ 1
Resumen ........................................................................................................................... 1
1.
Presentación ............................................................................................................... 3
2.
Objetivos.................................................................................................................... 5
3.
Metodología ............................................................................................................... 7
4.
Descripción del trabajo realizado ............................................................................. 9
4.1 Las tecnologías de datos enlazados y sus aplicaciones ...................................... 10
4.2 Datos enlazados y bibliotecas digitales .............................................................. 12
4.3 Manejo de versiones de documentos digitales ................................................... 14
4.4 Publicación de metadatos de registros bibliográficos en la web semántica ...... 15
4.5 Publicación de estadísticas de uso de bibliotecas digitales en la web semántica 16
4.6 Publicación de cuadros de mando en la web semántica .................................... 17
Parte II: Trabajos Publicados.......................................................................................... 21
5. Compendio de publicaciones que integran la tesis……………………………….. 23
6.
Current State of Linked Data in Digital Libraries ................................................... 25
7.
Data model for storage and retrieval of legislative documents in Digital Libraries
using Linked Data .................................................................................................... 39
8.
An Approach to Publish Scientific Data of Open-Access Journals using Linked
Data Technologies ................................................................................................... 49
9.
Transforming Library Catalogs into Linked Data ................................................... 61
10. An Approach to Publish Statistics from Open-Access Journals using Linked
Data Technologies ................................................................................................... 73
vii
11. Publishing a Scorecard for Evaluating the use of Open-Access Journals using
Linked Data Technologies .................................................................................... 85
12. Evaluating Open Access Journals using Semantic Web Technologies and
Scorecards…………………………………………………………………………..93
13. Conclusiones y trabajos futuros…………………………………….…..…………109
Referencias ….……………………………………………………………………111
viii
Índice de Figuras
Fig. 1: Proceso de publicación de datos enlazados ......................................................... 18
Fig. 2: Arquitectura para publicación de un cuadro de mando en RDF. ........................ 19
Fig. 3: Ejemplo de consulta sobre el cubo de visitas del OJS (Open Journal System) .. 19
ix
Lista de acrónimos
API
Application Programming Interface
BIBO
Bibliographic Ontology
BIBFRAME Bibliographic Framework
BIO
Biographical Information
DC
Dublin Core
DOAJ
Directory of Open Access Journals
FOAF
Friend of a Friend
FRAD
Functional Requirements for Authority Data
FRBR
Functional Requirements for Bibliographic Records
GRDDL
Gleaning Resource Descriptions from Dialects of Languages
HTML
HyperText Markup Language
IFLA
International Federation of Library Associations and Institutions
ISBD
International Standard Bibliographic Description
OAI
Open Archive Iniciative
OWL
Ontology Web Language
RDA
Resource Description and Access
RDF
Resource Description Framework
ORG
Organization Ontology
SPARQL
Simple Protocol and RDF Query Language
SKOS
System Knowledge Organization System
URI
Uniform Resource Identifier
VoID
Vocabulary of Interlinked Datasets
XML
eXtensible Markup Language
WGS84
World Geodetic System
xi
Parte I
Resumen
Este trabajo aporta un marco de trabajo para la publicación de cuadros de mando para la
evaluación del uso de bibliotecas digitales en la web semántica. Actualmente, los
indicadores publicados de los cuadros de mando no permiten su reúso y fácil combinación
con otros indicadores para tomar mejores decisiones por lo que este trabajo contribuye a
resolver este problema. Esta tesis comprende un estudio de tecnologías de datos enlazados
y sus aplicaciones, los usos actuales en bibliotecas digitales, la elaboración de propuestas
de arquitecturas técnicas y procedimientos de generación y publicación de cuadros de
mando y metadatos de registros bibliográficos en la web semántica. En forma
complementaria, se analizaron también características especiales a considerarse en los
modelos de datos para encadenamientos de información, como son el versionado de
documentos legislativos.
Los resultados de esta investigación han sido aplicados a la generación de cuadros de
mando a partir de metadatos de un tipo de biblioteca digital, una revista científica de acceso
abierto y a datos proporcionados por Google Analytics, incluyendo nuevas funcionalidades
y sin afectar las estructuras ya existentes.
Palabras clave: datos enlazados, web semántica, cuadros de mando, bibliotecas digitales.
1
1. Presentación
Este trabajo se basa en artículos publicados entre 2012 y 2015. En la primera parte se
indican los objetivos, métodos y resultados más relevantes de cada publicación. En la
segunda parte se presentan los artículos publicados.
Los artículos se han publicado en las siguientes revistas y congresos:
Revistas:
 Journal of Information Science, factor de impacto 1.158, ranking of Computer
Science, Information Systems 58 de 139. Fuente: 2014 Journal Citation Reports
(Thomson Reuters), 2015. (2 artículos).
Los congresos en los que se han presentado publicaciones son:

EDULEARN2014 (6th International Conference on Education and New Learning
Technologies). Barcelona, España, julio 7-9, 2014.

ICERI2014 (7th International Conference of Education, Research and Innovation).
Sevilla, España, noviembre 17-19, 2014.

CITS2015 (2015 International Conference on Computer, Information and
Telecommunication Systems). IEEE, Gijón, España, julio 13-17, 2015.

INTED2015 (9th International Technology, Education and Development Conference).
Madrid, España, marzo 2-4, 2015.

EDULEARN15 (7th annual International Conference on Education and New
Learning Technologies). Barcelona, España, julio 6-8, 2015.
Las publicaciones incluidas en esta tesis son en orden de elaboración:
1. Hallo, M., Luján-Mora, S. y Maté, A. (2016). Current State of Linked Data in Digital
Libraries.
Journal
of
Information
Science,
42(2),
117-127.
DOI:
10.1177/0165551515594729.
URI: http://jis.sagepub.com/content/early/2015/07/18/0165551515594729.abstract.
2. Hallo, M., Luján-Mora, S. y Maté, A. (2015). Data model for storage and retrieval of
legislative documents in Digital Libraries using Linked Data. Proceedings of
7th annual International Conference on Education and New Learning Technologies
(EDULEARN15). Barcelona, España, IATED, 7423-7430.
URI: https://library.iated.org/view/HALLO2015DAT.
3. Hallo, M., Luján-Mora, S. y Chávez, C. (2014). An Approach to Publish Scientific
Data of Open-access Journals using Linked Data Technologies. Proceedings of the 6th
3
International Conference on Education and New
(EDULEARN14). Barcelona, España, IATED, 5940-5948.
URI: https://library.iated.org/view/HALLO2014ANA.
Learning
Technologies
4. Hallo, M., Luján-Mora, S. y Trujillo, J. (2014). Transforming Library Catalogs into
Linked Data. Proceedings of the 7th International Conference of Education, Research
and Innovation (ICERI2014). Sevilla, España, IATED, 1845-1853.
URI: https://library.iated.org/view/HALLO2014TRA.
5.
Hallo, M., Luján-Mora, S. y Trujillo, J. (2015). An Approach to Publish Statistics
from Open-access Journals using Linked Data Technologies. Proceedings of the 9th
International Technology, Education and Development Conference (INTED2015).
Madrid, España, IATED, 5940-5948.
URI: https://library.iated.org/view/HALLO2015ANA.
6.
Hallo, M., Luján-Mora, S. y Maté, A. (2015). Publishing a Scorecard for Evaluating
the Use of Open-Access Journals Using Linked Data Technologies. Proceedings of
the 2015 International Conference on Computer, Information and
Telecommunication Systems (CITS 2015). Gijón, España, IEEE: 105-109.
URI: http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7297730.
7.
Hallo, M., Luján-Mora, S. y Maté, A. (2016). Evaluating Open Access Journals
using Semantic Web Technologies and Scorecards, Journal of Information Science.
Publicación anticipada en línea antes de impresión. DOI: 10.1177/0165551515624353.
URI: http://jis.sagepub.com/content/early/2016/01/13/0165551515624353.abstract.
Adicionalmente, se han publicado los siguientes trabajos relacionados con las temáticas en
estudio presentados en orden cronológico de publicación:
1. Hallo, M., De la Fuente, P. y Martínez-González, M. (2012). Las Tecnologías de
Linked Data y sus aplicaciones en el gobierno electrónico. SCIRE, 18(1), 49-61.
2. Hallo, M., De la Fuente, P. y Martínez-González, M. (2013). Data models for version
management of legislative documents. Journal of Information Science, 39(4), 557-572.
3. Hallo, M. (2014). Sistemas de Bases de Datos NoSQL. En: Tópicos avanzados de bases
de datos. Proyecto Latin. URL: http://escritura.proyectolatin.org/topicos-avanzadosde-bd-version-autor-jpg/bases-de-datos-nosql/.
4. Hallo, M. y Mosquera L. (2014). Data Mart para el Sistema de Servicios Sociales del
Conadis : Consejo Nacional de Discapacidades, Revista Politécnica, 33(2), 1-9.
5. Hallo, M., y Luján-Mora, S. (2014). An architecture and process to publish scientific data
using Linked Data technologies. Web Intelligence Summer School, 25-29 agosto
2014, Saint Etienne, Francia.
6. Hallo, M., Luján-Mora, S., Maté, A. y Tubón D. (2015). Evaluación del uso de la Revista
Politécnica usando analíticas de Google y tecnologías de Linked Data. Memorias de las
VI Jornadas de Ingeniería de Sistemas Informáticos y de Computación (JISIC 2015),
Quito, Ecuador, 203-207. ISBN: 978-9978-383-34-6.
4
2. Objetivos
La tesis tiene como objetivo mejorar el uso de recursos bibliográficos a partir de la
hipótesis:
El desarrollo y publicación de un cuadro analítico basado en datos enlazados para
monitoreo del uso de información bibliográfica ayudará a definir mejores indicadores
interrelacionados y a enriquecerlos con información externa.
Los objetivos específicos para los estudios individuales fueron:
1. Analizar las tecnologías de datos enlazados y sus aplicaciones (sección 4.1).
2. Identificar los usos actuales de datos enlazados en las bibliotecas digitales y
potencialidades futuras (sección 4.2).
3. Analizar el manejo de versiones de documentos en bibliotecas digitales que podrían ser
favorecidas con tecnologías de datos enlazados (sección 4.3).
4. Elaborar un marco de trabajo para publicar metadatos de registros bibliográficos en la
web semántica y aplicarlo a metadatos de información científica y de bibliotecas
digitales (sección 4.4).
5. Elaborar un marco de trabajo para publicación de estadísticas y cuadros de mando de
bibliotecas digitales usando tecnologías de datos enlazados (sección 4.5 y 4.6).
5
3. Metodología
Se realizaron los estudios que permitieron cumplir con los objetivos 1, 2 y 3 utilizando
parcialmente la metodología de Kitchenham et al (2009) [1], para las principales etapas de
una revisión sistemática de la literatura, combinada con las guías de revisiones
bibliográficas basadas en mapeo de conceptos [2].
Para el desarrollo de los puntos 4 y 5 se utilizó un método de investigación cualitativo
basado en casos de estudio [3] el mismo que permitió entender un tema complejo
añadiendo nuevos elementos a los ya conocidos en investigaciones previas. Se
desarrollaron los procesos en forma iterativa e incremental en base a buenas prácticas y
literatura existente refinándolos en la aplicación a dos casos de estudio, logrando la
generación y publicación de tripletas RDFs correspondientes a metadatos provenientes de
una biblioteca digital universitaria y de una revista digital científica universitaria de acceso
abierto. Se elaboraron además procesos para la elaboración y publicación de cuadros de
mando para la evaluación de bibliotecas digitales con indicadores interrelacionados y
enriquecidos con información externa gracias al encadenamiento de la información
utilizando tecnologías de datos enlazados con lo cual se probó la hipótesis.
Los resultados parciales fueron presentados en congresos y revistas indizados.
7
4. Descripción del trabajo realizado
Este trabajo aporta un marco de trabajo definiendo procesos y una arquitectura técnica
para la integración, en las bibliotecas digitales, de funcionalidades orientadas al
almacenamiento y publicación en la web semántica de cuadros de mando para la
evaluación del uso de bibliotecas digitales. La web semántica es una extensión de la World
Wide Web en la que el significado de la información y de los servicios está definido
permitiendo la consulta de información por humanos y máquinas [4].
Las tecnologías de datos enlazados (Linked Data) son utilizadas en la web semántica
para vincular los datos distribuidos en la Web permitiendo un paso de la Web de
documentos a la Web de datos enlazados [5]. Las nuevas funcionalidades de explotación
de enlaces se pueden integrar en las bibliotecas digitales sin afectar las estructuras ya
existentes. Los cuadros de mando, herramientas para monitorear objetivos estratégicos en
un negocio [6], se pueden preparar a partir de metadatos enlazados para enriquecer los
análisis y publicar sus datos en la web semántica para facilitar consultas más complejas.
El estudio se inició con la revisión de las tecnologías de datos enlazados más utilizadas
y sus aplicaciones en el gobierno electrónico. Por otra parte se analizó el uso de datos
enlazados en un grupo de bibliotecas digitales importantes por su uso a nivel mundial.
Seguidamente se analizaron y definieron procedimientos de generación y publicación de
metadatos en la web semántica. Paralelamente se analizaron características especiales de
documentos legislativos que manejan las bibliotecas digitales y que podían aprovechar el
uso de datos enlazados en su gestión tales como el manejo de versiones de documentos.
El estudio continuó con la aplicación de datos enlazados para la evaluación del uso de
bibliotecas digitales que requiere contar con indicadores interrelacionados y enriquecidos
con información externa para ayudar de mejor manera en el proceso de toma de decisiones.
Los procesos definidos en esta tesis cubren las fases de generación, almacenamiento y
publicación en la web semántica de metadatos de registros bibliográficos e indicadores
correspondientes a cuadros de mando. Nuestra propuesta apoya a las funcionalidades
básicas en las bibliotecas digitales que son la búsqueda y recuperación de documentos junto
a la navegación sobre documentos relacionados y a la evaluación de las bibliotecas digitales
principalmente de su uso.
El estudio se basó en el análisis y revisión de mejores prácticas encontradas en la
literatura y la construcción y aplicación incremental con un caso de uso de guías de
publicación de cuadros de mando para la evaluación y uso de bibliotecas digitales
utilizando tecnologías de datos enlazados.
Este resumen contiene el desarrollo de los temas:



Tecnologías de datos enlazados y sus aplicaciones.
Datos enlazados y bibliotecas digitales.
Publicación de metadatos de registros bibliográficos en la web semántica.
9


Publicación de estadísticas de uso de registros bibliográficos en la web semántica.
Publicación de cuadros de mando en la web semántica.
4.1 Las tecnologías de datos enlazados y sus aplicaciones
Basados en el interés de desarrollar una plataforma de integración de recursos
bibliográficos en la web semántica para desarrollar cuadros de mando con indicadores
integrados se realizó una publicación: “Las tecnologías de datos enlazados y sus
aplicaciones en el gobierno electrónico” [7], que contiene conceptos y herramientas que
pueden ser útiles para implementar proyectos de datos enlazados tanto para publicación
como para su consumo, así como también algunas aplicaciones en gobierno electrónico y
particularmente en bibliotecas digitales, proporcionando una visión general de esta nueva
área y su potencial uso.
El término datos enlazados se refiere a un conjunto de mejores prácticas para publicar y
enlazar datos estructurados en la Web en tal forma que puedan ser accesible por humanos
y computadores [8]. Es basado en URIs (Uniform Resource Identifier) y en especificaciones
RDF.
En el año 2006, Tim Berners-Lee escribió una nota de diseño [9], proponiendo
soluciones a los problemas que impedían el enlace de los datos, mediante la aplicación de
los principios de datos enlazados que son los siguientes:

Use URIs como nombres de recursos.

Use HTTP URIs de manera que la gente pueda buscar esos nombres.

Cuando alguien busca un URI deberían encontrar información útil usando los
estándares RDF o SPARQL.

Incluya enlaces a otros URIs de manera que la gente pueda encontrar más recursos
relacionados.
Hay reportes de un buen desarrollo de nuevas técnicas que tienen un gran potencial para
publicar y consumir datos enlazados en la web semántica [10]. En el Organismo
Internacional W3C, se han creado grupos especializados en varias tecnologías como RDF,
SPARQL, OWL para coordinar su desarrollo [11]. La adopción de las mejores prácticas de
datos enlazados ha contribuido a que se extienda el espacio global de datos conectando
datos de diversos dominios [12].
A partir del 2009 se ha producido un avance más rápido en el desarrollo de tecnologías
de datos enlazados. Se han liberado y usado algunas especificaciones de la W3C tales como:
SPARQL, GRDDL, RDFa, VoID, se ha formado la comunidad del proyecto Linking Open
Data y cada vez se observa un creciente uso en gobierno electrónico con numerosos
catálogos de datos publicados y aplicaciones con funcionalidades específicas orientadas a
dominios que combinan datos de varias fuentes de datos enlazados [13].
Para publicación de datos enlazados se han desarrollado varias herramientas tales como
D2R server, Open Link virtuoso, Talis, Pubby, Sésame, Jena, 4Store entre otros [7]. Para
facilitar la publicación de datos enlazados se puede migrarlos a RDF desde bases de datos
relacionales, hojas electrónicas, XML, sitios web existentes tras ser indexados y puestos a
disposición para consultas vía un servicio SPARQL. Los datos de salida pueden obtenerse
10
en formatos como XML, RDF. Cada conjunto de datos debe tener sus metadatos y se los
debe catalogar para facilitar las búsquedas.
Como ejemplo se puede citar el proyecto DBpedia que usa scripts de PHP para extraer datos
estructurados desde páginas de Wikipedia. Estos datos son convertidos a RDF y
almacenados en el repositorio OpenLink Virtuoso el mismo que proporciona un punto de
acceso SPARQL.
Para consumo de datos se han desarrollado browsers especializados tales como Disco,
Tabulator, Marbles, etc. Por otra parte se han desarrollado máquinas de búsqueda de datos
en lazados tales como Sig.ma, Falcons, Watson etc. [7].
Para una implementación a gran escala es necesario desarrollar metodologías para
generar y publicar datos enlazados en la Web, desarrollar algoritmos para interconectar
datos automáticamente, desarrollar estándares para registrar información de las fuentes de
datos, asegurar la privacidad, seguridad y preservación de los datos generados, crear
herramientas para facilitar la búsqueda y despliegue de datos enlazados.
Otro problema a considerar es el uso de los enlaces estáticos que desaparecen
continuamente lo cual implica que aún el uso de RDFa no sea una solución para la
publicación simultánea de páginas HTML y páginas RDF por la dificultad de seguimiento
de los enlaces perdidos.
Las experiencias con datos enlazados como la de Inglaterra han demostrado ser de mucha
utilidad para encontrar información relacionada a un caso dado, a un tipo de dominio o
información histórica. Sin embargo, aún no existen patrones de publicación y consumo
simples para una aplicación a gran escala a nivel de gobierno y se continúan desarrollando
alternativas de solución que faciliten la implementación de esta tecnología como es el caso
del desarrollo de APIs (Application Programming Interface), rutinas, protocolos y
herramientas para facilitar el consumo de datos enlazados.
Se ha reportado que en las administraciones públicas en general las aplicaciones son
desarrolladas para cubrir servicios de unidades administrativas jerárquicas, mientras los
ciudadanos tienen necesidades horizontales de acceso a servicios de múltiples agencias
[14]. Las tecnologías de datos enlazados ofrecen una posibilidad para satisfacer los
requerimientos de integración horizontal y búsquedas semánticas de aplicaciones de
administración electrónica, que algunos países ya empiezan a utilizar.
La estrategia de liberar datos a la mayor brevedad posible y paralelamente evolucionar
hacia un modelo cercano a la web semántica es la escogida por algunas administraciones
como Estados Unidos, Inglaterra o España. El hecho de poder construir capas semánticas
sobre la web actual facilita el trabajo en esta dirección.
Por otra parte, en el estudio realizado en este artículo se observa que en las bibliotecas
digitales hay aplicaciones de datos enlazados con esfuerzos significativos para encadenar
catálogos de bibliotecas permitiendo acceso a contenidos de múltiples bibliotecas con
información enlazada a otros archivos o bases de conocimientos. Adicionalmente, se
reportan también numerosos conjuntos de datos publicados como datos enlazados. Esto
motivó a realizar el segundo estudio sobre datos enlazados y bibliotecas digitales con el fin
de conocer el alcance de este tipo de aplicaciones y propuestas de trabajos futuros.
11
4.2
Datos enlazados y bibliotecas digitales
El artículo: “Current State of Linked Data in Digital Libraries” [15], presenta los
resultados de un estudio sobre los beneficios, problemas y potencialidades del uso de datos
enlazados en bibliotecas digitales analizando las implementaciones más importantes
expuestas en el reporte del grupo Library Linked Data Incubator de la W3C [5].
Las bibliotecas digitales permiten acceso en línea a artefactos que contienen
conocimiento digital. Tradicionalmente las bibliotecas han trabajado en publicación de
datos, sin embargo para resolver consultas más complejas es necesaria la conexión de sus
recursos a fuentes de datos externas. Por otra parte, es necesario compartir el conocimiento
en la web lo cual implica mejorar la forma de administración de las bibliotecas actuales
incluyendo nuevos artefactos como ontologías, descripciones semánticas de contenidos,
encadenamiento de datos y nuevas formas de colaboración utilizando: redes sociales,
comunidades especializadas, wikis, juegos colaborativos y mashups (aplicaciones que usan
y combinan datos, presentaciones y funcionalidades procedentes de una o más fuentes).
Sobre los metadatos se puede desarrollar aplicaciones que colectan metadatos y
construyen servicios sobre ellos tales como añadir mapas para localizar los recursos, poner
juntos recursos con similares temas, realizar recomendaciones, permitir las anotaciones
semánticas de los usuarios, etc.
La integración de datos y la construcción de nuevos servicios puede ser favorecida con
el apoyo de las tecnologías de datos enlazados que usan identificadores únicos (URIs) para
recursos (lugares, personas, eventos, etc.) y RDF para expresar las relaciones entre los
recursos. El uso de identificadores permite diferentes descripciones para un mismo recurso
estableciendo relaciones con datos obtenidos de fuentes externas a las bibliotecas. Las
bibliotecas analizadas son: Biblioteca Nacional de Francia, Biblioteca Nacional de España,
Biblioteca Británica, Biblioteca Europeana de la Unión Europea, Biblioteca del Congreso
de Estados Unidos.
Las bibliotecas nacionales a nivel mundial están usando arquitecturas para obtener y
publicar datos enlazados. Los vocabularios y ontologías utilizados varían en cada
implementación pero es posible establecer las equivalencias y publicar las relaciones
respetando los modelos de datos de origen. La tabla I presenta una comparación de las
características relacionadas a datos enlazados en las bibliotecas seleccionadas:
Se reportan beneficios de la publicación y uso de estos datos de las bibliotecas digitales
tales como la mejora en la visibilidad de datos y en procesos de anotación de recursos pero
no hay aún suficientes reportes que cuantifiquen los mismos. Existen problemas a resolver
para utilizar las tecnologías de datos enlazados en bibliotecas digitales tales como la
necesidad de herramientas de apoyo, de mecanismos de control de calidad de datos y de
personal técnico con conocimientos de estas nuevas tecnologías.
Se espera en el futuro obtener un número mayor de proveedores de conjuntos de datos,
lograr una participación ciudadana en enriquecer los mismos mediante procesos de
anotación de datos y que se desarrollen más aplicaciones que consuman los datos enlazados.
Se espera también que los catálogos en línea de las bibliotecas se puedan enriquecer con
recomendaciones y rankings de búsqueda basados en la popularidad de un ítem y en la
actividad de un usuario.
12
Tabla I. Comparación de características relacionadas a datos enlazados en las bibliotecas seleccionadas
Biblioteca
Beneficios
Vocabularios y
ontologías
Problemas
Futuro
Biblioteca
Nacional de
Francia
Incremento en la visibilidad de
datos.
FRBR,
Dificultad en la
catalogación de
datos.
Participación de la
comunidad en la
catalogación y control de
calidad
Ausencia de
convenios para
proporcionar datos,
Mayor involucramiento de
la comunidad con nuevos
conjuntos
de
datos,
especificando links,
mejorando la calidad de
datos.
Publicación de tópicos en
SKOS
Mejores consultas a datos
abiertos.
SKOS,
InterMarc,
XML-EAD,
Dublín Core,
RDF.
Europeana
Permite interoperabilidad sin
afectar los modelos de las
Fuentes de datos.
RDF,OAI_ORE,
SKOS, FOAF,
Dublín Core.
Dificultad para
migrar datos al
modelo EDM
Permite consultas a múltiples
datos enlazados de
instituciones europeas
Biblioteca del
Congreso
USA
Biblioteca
Británica
Interconexión a otras fuentes
de datos.
Modela gente, lugares y eventos
relacionados a un libro
reusando esquemas existentes.
BIBO, BIO
Dublin Core,
ISBD, ORG,
SKOS, RDF
SCHEMA,
OWL, FOAF,
WGS84, RDA.
Necesidad de
desarrollar
herramientas para
transformar
metadatos en datos
enlazados.
Involucrar al usuario en el
chequeo de la calidad de las
fuentes de información
Se require browsers sobre
BIBFRAME.
Falta de expertos en
diferentes áreas
para las
transformaciones.
Se
requiere
mejorar
herramientas de migración
de Marc21 a BIBFRAME.
BIBO, BIO,
Dublin Core,
ISBD, ORG
SKOS, RDF
Schema, OWL,
FOAF,
Ontología de
eventos.WGS84
.
Déficit de
aplicaciones que
consuman datos
enlazados.
Identificar necesidades de
encadenamiento de
información.
Datos enlazados a otras fuentes
de datos. Lista de tópicos en
SKOS.
FRBR, FRAD,
FRSAD, IFLA,
ISBD,
Datos con anotaciones y
recomendaciones.
RDF, DC, RDA,
MADS/RDF,
Dublin Core.
Problemas de
mapeo: no todas las
relaciones básicas de
FRBR se pueden
extraer de los
registros MARC.
Datos enlazados a otros
conjuntos de datos.
Mejor visibilidad de recursos
relacionados con los recursos
bibliográficos tales como gente,
lugares, eventos, temas.
Biblioteca
Nacional de
España
Mejor visibilidad de datos
integrándolos a iniciativas
tales como Schema.org.
Obtener la
retroalimentación del uso
de los datos enlazados.
Herramientas de desarrollo
para visualización.
Refinamiento de mapeos.
Obtener nuevas fuentes de
datos.
Obtener nuevos enlaces.
13
4.3 Manejo de versiones de documentos digitales
En el manejo de bibliotecas digitales existen características especiales de manejo de
documentos tales como el versionamiento que podrían ser mejoradas con datos enlazados
y controlados con cuadros de mando basados en los correspondientes metadatos. Un caso
particular de documentos los legislativos requieren un tratamiento especial de control de
cambios. En la publicación “Data models for version management of legislative
documents” [16] se analizan los principales modelos de datos usados en la administración
de cambios de documentos legislativos en bibliotecas digitales.
Uno de los principales requerimientos al tratar con documentos legislativos es el acceso
y consulta de versiones históricas consolidadas de documentos y de sus fragmentos. En este
contexto se puede requerir consultar versiones válidas de documentos legislativos en un
tiempo dado para analizar su aplicabilidad a un determinado caso. Entre las consultas más
requeridas están:
1)
2)
3)
4)
5)
6)
7)
8)
9)
documentos legislativos válidos a una fecha determinada;
evolución histórica de documentos legislativos con acceso a todas las versiones;
evolución histórica de fragmentos legislativos;
leyes modificadas por una ley;
leyes modificantes de una ley;
intervalo de validez de un documento legislativo;
intervalo de validez de un fragmento legislativo;
intervalo de eficacidad de un documento legislativo;
intervalo de eficacidad de un documento legislativo.
En la publicación indicada se analiza cómo aplicar la historia de cambios a una versión
de un documento legislativo para obtener una nueva versión. Por otra parte es importante
permitir el acceso a todas las versiones de un documento para escoger la versión correcta a
ser aplicada en un caso dado. Los sistemas legales usan varios modelos de datos pero con
diferentes resultados en costo de almacenamiento y mantenimiento de versiones [17].
Recientemente se han reportado algunos sistemas integrando tecnologías de datos
enlazados y web semántica.
Los modelos de datos analizados han sido clasificados basados en criterios orientados a
responder los requerimientos de los usuarios de este tipo de documentos legislativos y se
han analizado proyectos representativos de cada tipo de modelo. Un grupo de modelos usan
tecnologías de datos enlazados para el manejo de versiones de los documentos y sus
fragmentos facilitando la generación dinámica de documentos legislativos válidos en un
momento determinado.
Los parámetros de comparación fueron: tipos de modelos de datos, sistemas de bases de
datos, entidades de información utilizadas, metadatos de tiempo y operaciones de consultas
acerca de documentos y fragmentos.
Existen metadatos propios para documentos legislativos como CENMetalex y su
representación en RDF facilita las funciones requeridas para este tipo especial de
documentos digitales.
Los modelos analizados presentan un grado de complejidad apreciable que impide que
sean utilizados en nuevos proyectos. Basados en esta necesidad se desarrolla una propuesta
de un modelo de datos más simple para el manejo de metadatos de documentos legislativos
y sus fragmentos usando datos enlazados sin interferir con las estructuras actuales de los
14
documentos. Esta estructura permite también consultas de indicadores históricos para
evaluación del uso de los documentos legislativos y su posterior almacenamiento en la web
semántica.
En el artículo: “Data Model for Storage and Retrieval of Legislative Documents in
Digital Libraries using Linked Data” [18], se presenta una propuesta de un modelo de
datos para almacenamiento y recuperación de diferentes versiones de documentos
legislativos y sus fragmentos usando tecnologías de datos enlazados. Por otra parte se
presenta un método para publicar y administrar relaciones entre documentos legislativos y
sus cambios. Se analizan estructuras de documentos, cambios, metadatos y requerimientos
de consultas históricas.
El modelo propuesto usa un modelo en árbol para almacenar los documentos legislativos
almacenados en XML y un modelo en grafo representado con RDF para almacenar los
metadatos de los documentos legislativos y enlaces entre documentos modificantes y
modificados. Los cambios se producen generalmente a partir de fragmentos de nueva
legislación haciendo referencia a fragmentos de legislación antigua. El sistema de base de
datos usado para las pruebas fue el contenido en el software Virtuoso, el mismo que permite
almacenamiento de datos en modelo relacional, XML y RDF. La generación dinámica de
versiones se hace aplicando los cambios válidos hasta una fecha dada a los documentos
originales.
El modelo propuesto facilita la consulta histórica de documentos legislativos y los
procedimientos de consolidación permitiendo la realización de actualizaciones entre
documentos y fragmentos sin cambios a los documentos originales.
El modelo ha sido probado con documentos legislativos ecuatorianos y podría ser
aplicado a sistemas legislativos de otros países dado que el modelo es independiente de la
estructura de los sistemas legislativos actuales. Igualmente el modelo se podría utilizar para
manejar cambios de cualquier tipo de documento.
Este modelo permitiría también definir indicadores de evaluación histórica del acceso de
los usuarios a los documentos históricos y sus fragmentos y relacionarlos con otros externos
tales como los indicadores de evaluación de procesos, de temas desarrollados en las leyes,
de especialistas por tema entre otros. Además se puede enriquecer en forma automática los
indicadores con datos obtenidos de la Web de datos.
Como resultado de los estudios anteriores se evidencia la necesidad de desarrollar un
modelo de referencia para la generación y publicación de metadatos enlazados de recursos
bibliográficos que se desarrolla en los siguientes dos artículos, el uno orientado a la
publicación de metadatos de publicaciones científicas que usan el software Open Journal
System y el otro a partir de metadatos de bibliotecas digitales que usan el sistema de manejo
de bibliotecas Koha.
4.4 Publicación de metadatos de registros bibliográficos en la web
semántica
El artículo: “An Approach to publish scientific data of open-access journals using Linked
Data technologies” [19] reporta un proceso para publicar en la web metadatos de artículos
científicos usando tecnologías de datos enlazados. Adicionalmente se presentan guías
metodológicas con actividades relacionadas. El proceso propuesto fue aplicado extrayendo
metadatos con estándar Dublin Core de una revista digital publicada usando el software
Open Journal System. Los datos extraídos fueron utilizados para la generación de tuplas en
15
formato RDF y publicados usando el software Virtuoso. Con el fin de refinar el proceso se
lo aplicó a la generación y publicación de metadatos en formato Marc21 desde una
biblioteca digital universitaria, el proceso se describe en el siguiente artículo.
El artículo: “Transforming library catalogs into Linked Data” [20] presenta un proceso
para publicar metadatos de recursos bibliográficos usando tecnologías de datos enlazados.
El proceso fue aplicado para extraer metadatos de recursos bibliográficos de una biblioteca
universitaria en formato Marc21, representarlos en RDF y publicarlos en un punto
SPARQL (un servicio que permite a los usuarios consultar bases de conocimiento usando
SPARQL).
El proceso propuesto se basó en las mejores prácticas y recomendaciones de varios
estudios añadiendo tareas y actividades consideradas importantes durante el desarrollo del
proyecto. Los metadatos relacionados con temas fueron encadenados a fuentes externas
tales como otras bibliotecas y relacionados a la bibliografía de syllabus de los cursos para
descubrir falta de recursos bibliográficos o bibliografía desactualizada.
En los tiempos actuales las bibliotecas están dando importancia a la web semántica en
una variedad de formas tales como creando modelos de metadatos, y publicando datos
enlazados desde archivos de autoridades, catálogos bibliográficos, información de
proyectos digitales e información extraída de otros proyectos tales como la Wikipedia, por
lo que el proceso desarrollado puede servir de guía para la generación de tuplas RDF y su
posterior publicación.
Por otra parte es necesario conocer e interrelacionar estadísticas referentes al uso de
bibliotecas digitales y publicarlas en la web semántica lo cual motivó el estudio detallado
en la siguiente sección.
4.5 Publicación de estadísticas de uso de bibliotecas digitales en la web
semántica
El siguiente artículo presentado en esta tesis es: “An approach to publish statistics from
open-access journals using Linked Data technologies” [21]. Las revistas científicas de
acceso abierto coleccionan, publican y preservan información científica en formato digital
usando un proceso de revisión por pares. La evaluación del uso de esta clase de
publicaciones se expresa en su mayor parte como estadísticas y requieren ser enlazadas a
recursos externos para dar una mejor información para la toma de decisiones. Las
estadísticas fueron modeladas en un data mart para facilitar las consultas acerca de los
accesos por diferentes criterios. Estos datos encadenados a otros conjuntos de datos dan
más información como números de accesos por autores, por orígenes de los mismos, por
tópicos de investigación y su relación con el Plan Nacional de Desarrollo entre otros.
El proceso propuesto fue aplicado extrayendo datos estadísticos de una revista abierta
universitaria y publicándolos en un punto SPARQL. Para modelar los datos
multidimensionales se utilizó el vocabulario RDF data cube. La visualización fue realizada
usando el software CubeViz que permitió filtrar información a ser presentada
interactivamente en gráficos. Sin embargo las estadísticas sin relación con los objetivos
institucionales no son de mucha utilidad para una adecuada evaluación por lo que en el
próximo artículo se realiza una propuesta de estructuración y publicación de datos de un
cuadro de mando utilizando técnicas de datos enlazados.
16
4.6 Publicación de cuadros de mando en la web semántica
Los artículos “Publishing a Scorecard for Evaluating the Use of Open-Access Journals
Using Linked Data Technologies” [22] y “Evaluating Open Access Journals using Web
Semantic Technologies and Scorecards” [23], tratan aspectos de la publicación de cuadros
de mando de revistas de acceso abierto en la web semántica con diferente grado de detalle.
El segundo contiene una descripción más amplia de conceptos, procesos y pruebas.
Suber [24], presenta una clasificación para los diferentes tipos de acceso abierto basada
en los derechos que los autores mantienen para diseminar sus trabajos:

Acceso abierto Dorado (Gold Open Access) es adoptado por revistas revisadas por
pares, dando acceso a los artículos de investigación inmediatamente con
publicación luego del envio.

Acceso abierto Verde (Green Open Access) permite a los autores archivar en
repositorios con el consentimiento de la revista o de los editores. Estos repositorios
son orientados a diciplinas específicas o institucionales.

Acceso abierto Verde Débil (Pale Green) permite a los autores archivar las
versiones antes de su impresión (preprints).

Acceso abierto Gris (Gray) permite a los autores hacer accesible su trabajo en
websites institucionales o personales.
Las revista de acceso abierto publican información científica en formato digital pero es
dificil no solamente para los usuarios si no para las bibliotecas digitales evaluar su uso, para
ello se propone introducir indicadores claves de rendimiento en cuadros de mando que
permiten medir el rendimiento de las revistas en relación a los objetivos de las
organizaciones.
Un cuadro de mando es una herramienta que permite monitorear objetivos estratégicos
en una organización. El cuadro de mando integral es uno de los más usados para ayudar a
las organizaciones a alinearse con sus objetivos estratégicos.
En el caso de bibliotecas digitales pocas reportan indicadores relacionados con la misión
y objetivos institucionales. Se analizó una muestra aleatoria de revistas científicas tomadas
del directorio de revistas científicas DOAJ (Directory of Open Access Journals), pocas de
ellas publican objetivos institucionales ligados a la misión, visión y objetivos estratégicos.
Por otra parte hay pocos reportes de indicadores de rendimiento enlazados a indicadores
externos.
Una alternativa para proporcionar mejores evaluaciones es desarrollar en forma
incremental cuadros de mando orientados por objetivos presentándolos en modelos
multidimensionales con indicadores interrelacionados gracias al enlazamiento de
información.
Por otra parte las tecnologías de datos enlazados permiten agregar información a los
indicadores de rendimiento enlazándolos a otros conjuntos de datos relacionados
publicados en la Web de datos. La figura 2 presenta el proceso de publicación de datos
provenientes de un cuadro de mando en RDF.
El cuadro de mando desarrollado [19], está orientado a:

Monitorear las tendencias históricas de uso de la biblioteca digital.
17


Hacer autodiagnóstico.
Usar los resultados para publicitar la revista digital.
Los indicadores del cuadro de mando propuesto son un subconjunto de los presentados
en los estándares ISO 2789:2013 and ISO 11620:2014 para evaluar el uso de documentos
electrónicos y fueron escogidos basados en entrevistas con bibliotecarios, autoridades
locales y evaluaciones de los datos disponibles.
Figura 1: Proceso de publicación de datos enlazados
La Figura 2 presenta una propuesta de arquitectura técnica desarrollada para la
generación y publicación de cuadros de mando de bibliotecas digitales en la web semántica.
Los metadatos son extraídos de la base de datos usando una versión de software libre
Spoon. Los datos son limpiados usando el programa Open Refine, separando los campos
requeridos. El cubo se genera en RDF a partir de archivos cvs utilizando el programa
Ontowiki para posteriormente ser visualizados usando el programa CubeViz.
18
Figura 2: Arquitectura para publicación de un cuadro de mando en RDF.
La Figura 3 presenta un ejemplo de consulta sobre el cubo de visitas de la revista digital
tomada como caso de estudio almacenado en RDF.
Figura 3: Ejemplo de consulta sobre el cubo de visitas del OJS (Open Journal System)
19
De los trabajos realizados se puede concluir que la arquitectura y procedimientos
propuestos son adecuados para la generación y publicación de cuadros de mando en la
web semántica.
20
Parte II
Trabajos Publicados
21
5. Compendio de publicaciones que integran la tesis
En esta sección se incluyen las versiones de los artículos publicados y autorizados por las
editoriales para difusión en el repositorio institucional y son en orden de elaboración:
1. Hallo, M., Luján-Mora, S. y Maté, A. (2016). Current State of Linked Data in Digital
Libraries. Journal of Information Science, 42(2), 117-127. DOI: 10.1177/
0165551515594729.
URI: http://jis.sagepub.com/content/early/2015/07/18/0165551515594729.abstract
2. Hallo, M., Luján-Mora, S. y Maté, A. (2015). Data model for storage and retrieval of
legislative documents in Digital Libraries using Linked Data. Proceedings of
7th annual International Conference on Education and New Learning Technologies
(EDULEARN15). Barcelona, España, IATED, 7423-7430.
URI: https://library.iated.org/view/HALLO2015DAT.
3. Hallo, M., Luján_Mora, S. y Chávez, C. (2014). An Approach to Publish Scientific
Data of Open-access Journals using Linked Data Technologies. Proceedings of the 6th
International Conference on Education and New Learning Technologies
(EDULEARN2014). Barcelona, España, IATED, 5940-5948.
URI: https://library.iated.org/view/HALLO2014ANA.
4. Hallo, M., Luján-Mora, S. y Trujillo, J. (2014). Transforming Library Catalogs into
Linked Data. Proceedings of the 7th International Conference of Education, Research
and Innovation (ICERI2014). Sevilla, España, IATED, 1845-1853.
URI: https://library.iated.org/view/HALLO2014TRA.
5. Hallo, M., Luján-Mora, S. y Trujillo, J. (2015). An Approach to Publish Statistics
from Open-access Journals using Linked Data Technologies. Proceedings of the 9th
International Technology, Education and Development Conference (INTED2015).
Madrid, España, IATED, 5940-5948.
URI: https://library.iated.org/view/HALLO2015ANA.
6. Hallo, M., Luján-Mora, S. y Maté, A. (2015). Publishing a Scorecard for Evaluating
the Use of Open-Access Journals Using Linked Data Technologies. Proceedings of the
2015 International Conference on Computer, Information and Telecommunication
Systems (CITS 2015). Gijón, España, IEEE, 105-109.
URI: http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7297730.
7. Hallo, M., Luján-Mora, S. y Maté, A. (2016). Evaluating Open Access Journals using
Web Semantic Technologies and Scorecards, Journal of Information Science. DOI:
10.1177/0165551515624353.
URI: http://jis.sagepub.com/content/early/2016/01/13/0165551515624353.abstract
23
6. Current State of Linked Data in Digital Libraries
El contenido de este capítulo corresponde al siguiente artículo:
Hallo, M., Luján-Mora, S. and Maté, A. (2016). Current State of Linked Data in Digital
Libraries. Journal of Information Science, 42(2), 117-127. DOI: 10.1177/
0165551515594729.
URI: http://jis.sagepub.com/content/early/2015/07/18/0165551515594729.abstract.
Reproduced by permission of SAGE Publications Ltd., London, Los Angeles, New Delhi,
Singapore and Washington DC. 2016
Journal of Information Science, factor de impacto 2014: 1.158 (JCR).
Ranking of Computer Science, Information Systems 58/139.
Fuente: 2014 Journal Citation Reports (Thomson Reuters).
Con el fin de conocer los avances en el uso de tecnologías de datos enlazados en las
bibliotecas digitales se realizó un estudio sobre los beneficios, problemas y potencialidades
del uso de datos enlazados en bibliotecas digitales analizando las implementaciones más
importantes en bibliotecas digitales sugeridas en el reporte del grupo Library Linked Data
Incubator de la W3C.
Los resultados se presentan en el artículo expuesto a continuación.
25
Current State of Linked Data in Digital Libraries
Hallo Maria
National Polytechnic School (ECUADOR)
Luján-Mora Sergio
University of Alicante (SPAIN)
Maté Alejandro
University of Alicante (SPAIN)
Trujillo Juan
University of Alicante (SPAIN)
Abstract
The Semantic Web encourages institutions, including libraries, to collect, link and share their data across the Web in
order to ease its processing by machines to get better queries and results. Linked Data technologies enable us to connect
related data on the Web using the principles outlined by Tim Berners-Lee in 2006.
Digital libraries have great potential to exchange and disseminate data linked to external resources using Linked Data.
In this paper, a study about the current uses of Linked Data in digital libraries, including the most important
implementations around the world, is presented. The study focuses on selected vocabularies and ontologies, benefits and
problems encountered in implementing Linked Data on digital libraries. Besides, it also identifies and discusses specific
challenges that digital libraries face offering suggestions for ways in which libraries can contribute to the Semantic Web.
The study uses an adapted methodology for literature review, to find data available to answer research questions. It
is based on the information found in the library websites recommended by W3C Library Linked Data Incubator Group
in 2011, and scientific publications from Google Scholar, Scopus, ACM, and Springer from the last 5 years. The selected
libraries for the study are National Library of France, Europeana Library, Library of Congress of USA, British Library, and
National Library of Spain. In this paper, we outline the best practices found in each experience and identify gaps and future
trends.
Keywords:
Linked Data; Digital Libraries; Semantic Web.
1. Introduction
Digital libraries allow online access to devices that contain digital knowledge. Libraries have traditionally
worked on publishing data, however in order to solve more complex queries it is necessary the connection to
external data sources. Moreover, it is necessary to share knowledge on the Web which involves improving the
method of administration of current libraries including new artifacts as ontologies, semantic content
descriptions, data links and new forms of collaboration using: social networks, specialized communities, wikis,
collaborative games and mashups (applications that use and combine data, presentation and functionality from
one or more sources). Furthermore, applications can be developed on metadata to build services on top of them
such as adding maps to locate resources, putting together resources with similar issues, making
recommendations, allowing users to create semantic annotations, etc.
Hallo, M., Luján-Mora, S. and Maté, A. (2016). Current State of Linked Data in Digital Libraries. Journal of Information Science, 42(2),
117-127. DOI: 10.1177/ 0165551515594729.
27
Linked Data is a publication technique using standard web technologies to connect related data and make
them available on the Web by following principles recommended by Tim Berners-Lee [1].
Data integration and the building of new services may be favored with the support of Linked Data
technology using unique identifiers, such as URI (Uniform Resource Identifier), for resources (places, people,
events, etc.) and RDF (Resource Description Framework) to express relations between resources. The use of
URIs allows for different descriptions of the same resource building relationships from external data sources
to libraries. Moreover, it is possible to add data in a collaborative way by providing information about a
resource that would be integrated into the overall graph.
In 2010, Tim Berners-Lee proposed a 5-star model to encourage people to publish in a Linked Open Data
environment. In Table I, a summary of the 5-star model is presented.
Table 1. 5-star Linked Open Data Model.
Stars
Description
1-star (*)
2-star (**)
3-star (***)
4-star (****)
Available on the Web but with open license to be Open Data.
Available as machine readable structured data.
As non-proprietary format.
All the above plus, use open standards from W3C (RDF and SPARQL) to identify things, so that
people can point at your stuff.
All the above, plus link your data to others people data to provide context.
5-star (*****)
In recent years, digital libraries have given more importance to Linked Data in several ways: creating
metadata models such as Europeana Data Model [2] or the bibliographic framework of the Library of Congress
[3] and publishing Linked Data on library catalogs or integrating authority files of national libraries of several
continents using services like the Virtual International Authority File (VIAF). Other Linked Data has been
obtained by extracting information from Wikipedia on the Dbpedia1 or from projects such as LinkedGeoData2
that exports data generated by users from those contained in OpenStreetMap 3. Some software systems such as
Evergreen4, Hydra5 and Omeka6 have started producing Linked Data automatically. Several global digital
libraries such as the Library of France, the Library of the Congress in USA, the British Library, the National
Library of Spain, among others, are implementing this new technology. In the future, it is expected to be able
to obtain relationships from the community to improve data quality, annotations, recommendations, links, etc.
[4].
However, libraries are facing several problems to implement Linked Data services. In 2010, Hannemann
and Kett [5] classified these problems in technical (e.g. poorly documented development tools), conceptual
(little experience with Linked Data models for libraries, variables schemes, e.g. Marc21vs RDF) and legal
(publishing rights and licensing of Linked Data). Another problem they also exposed was the lack of experience
in defining Linked Data services.
The purpose of this study is to review Linked Data applications in digital libraries (LDDL), analyzing papers
and other publications across the last five years related to the world's major national digital libraries.
2. Methodology
This study was carried out based partially on the guidelines proposed by Kitchenham et al (2009) for the main
stages of a Systematic Literature Review (SLR) [6], combined with Mapping Review (MP) Guides [7],
extracting information from a group of selected publications and national libraries websites that expose Linked
Data. The use of SLR provides a systematic approach to answering a research question by finding relevant
research outcomes from primary empirical studies. MP is an open form of SRL, intended to map out the
research that has been undertaken related to a topic. Such a study tries to identify gaps in the set of primary
studies, where new primary studies are required.
The stages followed in our research were:
Hallo, M., Luján-Mora, S. and Maté, A. (2016). Current State of Linked Data in Digital Libraries. Journal of Information Science, 42(2),
117-127. DOI: 10.1177/ 0165551515594729.
28





definition of research questions,
search process,
data collection,
data synthesis, and,
discussion.
The definition of research questions outlines the problems to be addressed by the review. In the MP, the
research questions address a wider scope of the study.
The search process includes the identification of primary studies that may contain relevant research results.
This stage could be a targeted manual search of selected journals and conferences or an automated search.
According to Kitchenham [6] automated searches find more studies than manual restricted searches but they
may be of poor quality. Target manual searches intend to omit low quality papers. In both cases, the selected
papers are examined for quality.
Data collection stage focuses on the design of data extraction procedures to record the information needed
to address the review questions such as: the source of data, the topic area, the authors and institutions working
in the related research topics.
Data synthesis requires summarising the results of the included primary studies: data are tabulated in relation
to the review questions showing the features and findings of the studies. Data from the studies could be
presented narratively and/or statistically.
Discussion stage includes analysis of results, main findings related to the research questions, their meanings,
magnitude of the effects, and applicability of the findings.
3. Results
The methodology proposed in section 2, was applied to find the current state of the use of LDDL.
3.1. Definition of research questions
In the first stage, we decided to look for the current state of development of LDDL taking the most important
applications in the world, the research questions are:
(1)
(2)
(3)
(4)
What are the benefits of LDDL.
What are the vocabularies and ontologies used in LDDL.
What are the problems reported for LDDL.
What are the future trends in research on LDDL.
3.2. Search Process
Based on the study by the W3C (2011), five national digital libraries presenting advances in the use of Linked
Data were chosen. In this study, publications reported in Google Scholar, Scopus, ACM and Springer from the
last 5 years were studied. Besides, information from the websites of libraries were also reviewed. The queries
in the search process included the terms linked data combined with the name of the studied libraries. Few
publications for each specific library were found and cited in the corresponding sections.
3.3. Data Collection
Data extracted from each study were focused on the research questions: benefits of using LDDL, vocabularies
used, current issues and future trends.
Hallo, M., Luján-Mora, S. and Maté, A. (2016). Current State of Linked Data in Digital Libraries. Journal of Information Science, 42(2),
117-127. DOI: 10.1177/ 0165551515594729.
29
3.4. Data Synthesis
A summary of the studies started by the W3C Library Linked Data Incubator Group is presented in section
3.4.1 and a table of features found for each digital library is developed in Table II. In the following, we
present narratively the main features of each project.
3.4.1. W3C Library Linked Data Incubator Group
The mission of the W3C Library Linked Data Incubator Group was stated in 2010: “to help increase global
interoperability of library data on the Web, by bringing together people involved in Semantic Web activities—
focusing on Linked Data—in the library community and beyond, building on existing initiatives, and
identifying collaboration tracks for the future” [4]. In its final 2011 report, the W3C group summarize: the
benefits of Linked Data, the current situation of libraries and datasets published in the Semantic Web, the
problems of copyright and the recommendations for using Semantic Web standards and Linked Data principles
in order to ensure more visibility and reuse of resources such as bibliographic data and concept schemes.
According to the final report of the W3C Library Linked Data Incubator Group, “few libraries have
published their data as Linked Data with great variation in support and quality”. However, there has been
interest in Linked Data projects in national libraries of Sweden, Hungary, Germany, France, United States of
America, United Kingdom and international organizations such as the Food and Agriculture Organization of
the United Nations among others. The recommendations cited in the study include defining an organization
that assigns URIs to local resources to avoid duplication and to use Linked Data design guidelines. It is also
recommended to include activities designed to preserve Linked Data which mostly are in the cloud.
The final report of the group also suggests that the success of Linked Data in any domain depends on the
ability to identify, reuse or connect data and data models available. Moreover, the same report highlights the
complexity of the vocabularies available and the lack of similarity between the vocabulary and Linked Datasets
used in research communities in the Semantic Web. In addition, the W3C study [4] presents an inventory of
vocabularies and ontologies mainly for classification systems, authority files, themes, thesauri and other
controlled vocabularies. Some examples of vocabularies and ontologies published in RDF widely used in
digital libraries are: SKOS (Simple Knowledge Organization System), Marc21 (Machine Readable
Cataloging), FRBR (Functional Requirements for Bibliographic Records), FOAF (Friend of a friend), VoiD
(Vocabulary of InterLinked Datasets), CITO (Citation Type Ontology), BIBO (Bibliographic Ontology),
Dublin Core, and NeoGeo to describe geographic data. Finally, a large number of national libraries are using
Linked Data. In the following, we present the experiences of a group of them chosen by the importance of the
LDDL and the size of the datasets.
3.4.2. National Library of France (NLF)
The National Library of France (data.bnf.fr) brings together information from different catalogs using Semantic
Web technologies created from data in various formats such as INTERMARC Bibliographic format for the
main catalog, XML-EAD (Encoded Archival Description) for inventory files and Dublin Core for the digital
library [8]. These data are automatically retrieved, shaped, enriched and published in RDF. The NLF provides
URIs for resources built based on ARK (Archival Resource Key) identifiers and designed to access information
objects. Among the motivations for publishing Linked Data are helping on cataloging a growing number of
resources without increasing the staff. Moreover, if metadata are published as Linked Data it can be
disseminated more widely and enriched from external sources also increasing its reuse [9]. Furthermore, data
from relational systems, not accessed from search engines on the Web, could become more visible.
Most of the Linked Data in the National Library of France are based on the Functional Requirements for
Bibliographic Records (FRBR) model. The website generates web pages providing standardized information,
references and links on authors, works and themes. The web pages provide filtering capabilities by type, export
links to other online services and open data recovery. In addition, the vocabulary used is Rameau and was
transformed to a version of Simple Knowledge Organization System (SKOS) vocabulary exposing as Linked
Data about 160,000 concepts [10].
Hallo, M., Luján-Mora, S. and Maté, A. (2016). Current State of Linked Data in Digital Libraries. Journal of Information Science, 42(2),
117-127. DOI: 10.1177/ 0165551515594729.
30
3.4.3. Europeana
Europeana (data.europeana.eu) is a European digital library of free access, offering multilingual publications
and linked metadata from multiple European institutions. This library provides information on millions of items
digitized from museums, libraries, archives and multimedia collections [11].
The data model EDM (Europeana Data Model) is based on the principles and practice of the Semantic Web
and Linked Data. The model is built using RDF standards and established vocabularies as: OAI-ORE (Open
Archives Object Reuse and Exchange), SKOS, Dublin Core. EDM acts as a common top-level ontology that
enables interoperability while maintaining original data models.
Other ontologies that are of special interest to EDM are Dublin Core and FOAF. Dublin Core provides a
vocabulary for describing the essential features of cultural objects (creators, relationships with other resources,
subject indexing, etc.) in an appropriate manner for the Semantic Web [12]. FOAF is an ontology describing
persons, their activities and their relations to other people and objects.
RDF has been used as the meta-model of EDM, and any object of interest in the space of Europeana (either
an object of cultural heritage, a person, a place, a concept, etc.) is considered a resource, identified by a URI.
EDM provides a set of constructors (classes and properties) that can be used by the metadata providers. The
model captures a description of the digital environment of the object sent to Europeana adding descriptive
information resources that are part of that environment. Furthermore, EDM includes descriptive and contextual
properties that capture the features of a resource and relate to others in this context [13]. A pilot project has
built links to several datasets like: people in DBpedia, Geonames gazetteer, GEMET thesaurus. The concepts
are described using SKOS [14].
It is expected that the new knowledge generated with LDDL will play an important role in the usability of
Europeana in areas such as improving functionality and content search queries semantically related. The
principal problems reported for the adoption of Linked Data are: lack of metadata expressed in EDM, lack of
links to other sources and lack of agreements to provide data [15].
In the future, it is expected to solve the cited problems and to have a greater involvement of the community
resulting in the addition of new datasets, the definition of specifications of links in each institution and the
improvement of data quality. It is also expected an increase in the visibility of data integrating initiatives such
as the collaborative project Schema.org8.
3.4.4. Library of Congress of the USA
The Linked Data Service defined by the Library of Congress (www.loc.gov) began in 2009, exposing around
260,000 authority records. Currently, it provides access to standards and vocabularies promulgated by the
Library of Congress such as: classification files, themes, geographic areas, countries, languages, preservation
vocabularies, event types, types of software, etc. [16].
In 2011, the Library of Congress began a process of change to implement a new bibliographic environment
facilitating the interconnection of network resources. The BIBFRAME (Bibliographic Framework) model of
Linked Data was implemented like a starting point for integration on the Web of data [17]. The BIBFRAME
model contains classes: item (creative work), body, authority, annotation. The origin of the relationship item /
instance may reflect FRBR relationships in terms of graphs rather than hierarchical relationships. In the context
of the entity-relationship model, FRBR recognizes entities, attributes and relationships between entities like
web resources. The BIBFRAME model is defined in RDF, identifying all entities (resources), attributes and
relationships between entities (properties). Moreover, this model enables the use of annotations such as
mapping to other vocabularies. In addition, there are connection defined between BIBFRAME and
vocabularies such as Dublin Core, FOAF, SKOS and others.
Hallo, M., Luján-Mora, S. and Maté, A. (2016). Current State of Linked Data in Digital Libraries. Journal of Information Science, 42(2),
117-127. DOI: 10.1177/ 0165551515594729.
31
Figure 1. Linked Data Service of the Library of Congress
To support the developed model there are tools to migrate from Marc21 to BIBFRAME model and interfaces
for publication like Exhibit7. This model could be integrated with an event model with annotation capabilities.
Moreover, projects like Reference extract can be used to extract references to answer questions in libraries that
hold this service (this allows the user to verify the quality of the sources of the information provided). In
addition, the Library of Congress uses the services of linkable authority files and virtual international authority
VIAF. In the future, the development of tools for migrating from Marc21 to BIBFRAME and its use are
expected to improve the services associated with the model. Figure 1 shows the Linked Data Service of the
Library of Congress.
3.4.5. British Library
The British Library (www.bl.uk) has millions of items from various countries in printed and digital format.
The British Library is developing a version of Linked Data from books and serial publications among others.
Unlike previous initiatives, they do not transform their collections from MARC records to RDF / XML. Instead,
they model things of interest such as people, places and events related to a book [18]. Figure 2 shows the
Linked Data Platform of the British National Library.
Following the practices adopted by various institutions on the Web, the British Library has modeled the
“things of interest” reusing the most descriptive existing schemes and adding their own terms in the cases
where there were no published terms. The following vocabularies are used to describe the resources of the
library: BIBO (Bibliographic Ontology), Bio (Biographic Information), Dublin Core, ISB (International
Standard Bibliographic Description), Org (Ontology Organizational), SKOS, RDF Schema, OWL, FOAF,
EVENT (Event ontology), WGS84 Geo Positioning and draft of RDA (Resource Description and Access). The
data is connected to other sources of Linked Open Data particularly VIAF, LCSH (Library of Congress Subject
Headings), Lexvo, GeoNames (for country of publication), MARC (for country code and language), and RDF
Hallo, M., Luján-Mora, S. and Maté, A. (2016). Current State of Linked Data in Digital Libraries. Journal of Information Science, 42(2),
117-127. DOI: 10.1177/ 0165551515594729.
32
Book Mashup (application to makes information about books, their authors, reviews, and online bookstores
available on the Semantic Web), Dewey.info (a prototype for linked Dewey Decimal Classification data) [19].
Figure 2. Linked Data Platform of the British National Library
3.4.6. National Library of Spain (NLS)
The project Linked Data on the NLS (datos.bne.es) began in 2011 with the support of the Polytechnic
University of Madrid. It is published using the IFLA (International Federation of Library Associations) models
and the vocabularies: FRBR, FRAD, FRSAD, ISBD (International Standard for Bibliographic Description).
The objectives of this project are: to expose the bibliographic and authority data as Linked Data for improving
user experience navigating with internal and external interrelated data and to achieve multilingualism. In the
project RDF is used to add elements of several vocabularies such as: DC, RDA and to MADS / RDF. Metadata
are built in FRBR and RDF from Marc21 using the software Marimba. Authority files, monographs, and sound
recordings were migrated establishing links to Open library, VIAF, LIBRIS, SUDOC and Dbpedia [20].
Several observations were made to the model namely: field names are numeric, not comprehensible codes
based on IFLA standards. This is correct internationally speaking but makes the initial mapping more time
consuming. In addition, there are not reports about the reuse of published data outside the library environment.
In the process, mapping problems were identified, as not all basic relations of FRBR could be extracted from
MARC records [21].
In the future, it is expected to work with better tools for visualization, mapping refinements, links to selected
databases, connections to new data sources, and better cataloging procedures. Other advances include the
development of the list of topics (subjects) in SKOS, the development of NLS for schools with teachers
recommending annotated content and MARC21 records converted with the ontology of Learning Object Model
(LOM) enriched with other sources. It is also expected to migrate resources with specific vocabularies that are
increasingly commons [22].
Hallo, M., Luján-Mora, S. and Maté, A. (2016). Current State of Linked Data in Digital Libraries. Journal of Information Science, 42(2),
117-127. DOI: 10.1177/ 0165551515594729.
33
3.5. Discussion
In this section, we discuss the answers to the research questions that were presented in section 2.1. A
comparison of the features of the selected libraries is shown in Table II.
3.5.1. Benefits of LDDL
The first research question was “What are the benefits of LDDL?”. The benefits of LDDL found in the libraries
are summarized as follows:








The visibility of the data is improved.
It is possible to establish links to other online services.
The transformation of topics in SKOS is facilitated.
Open data recovery is improved.
Interoperability is enabled without affecting the data source models.
It is possible to query linked metadata from multiple institutions.
It allows modelling things of interest related to a bibliographic resource such as people, places, events,
themes.
The end user resources annotations improve their credibility.
In 2004, W3C recommended libraries to publish their data warehouses using semantic web technologies to
increase its digital impact and social utility [23]. When the library data becomes more richly linked and open
the user will have improved capabilities for discovering and using data browsing in a global information graph.
First, the library linked data, using RDF as a unique format, can help to improve and link the knowledge in
several domains, improving the networked science. In addition, the library data could be reused for new
services in the library and outside helping to increase the impact of using library resources. Finally, different
kinds of users will be beneficiated of this technology: librarians, scientists, students, all citizens. Despite of the
benefits cited, no data have been submitted to quantify them.
3.5.2. Vocabularies and ontologies used in LDDL
The second research question was “What are the vocabularies and ontologies used in LDDL?”.
In the analyzed libraries, the following vocabularies and ontologies are used: Dublin Core, BIBO, BIO, FOAF,
FRBR, FRAD, FRSAD, IFLA, ISBD, INTERMARC, MADS / RDF, XML-EAD, OAI_ORE, RDF, RDF
Schema ORG, OWL, RDA, SKOS, WGS84, EVENT. The cited vocabularies in RDF are represented using
elements of RDF Schema (RDFS) and Web Ontology Language (OWL) helping to the interoperability and
reusing of existing vocabularies where possible but there is difficult to match the terms without a supporting
tool. Moreover, each library is developing his own tool for the RDF generation such as Marimba in the National
Library of Spain. Furthermore, the National Library of France has adapted the FRBR model linking information
about documents, persons, organizations and subjects. Finally, the RDF datasets must have the corresponding
metadata to ensure adequate provenance identification using ontologies such as PROV-O (Provenance
Ontology).
Hallo, M., Luján-Mora, S. and Maté, A. (2016). Current State of Linked Data in Digital Libraries. Journal of Information Science, 42(2),
117-127. DOI: 10.1177/ 0165551515594729.
34
Table 2. Comparison of features related to Linked Data in the selected libraries.
Library
Benefits
Vocabularies
and ontologies
Problems
Future
National
Library of
France.
Increase data visibility.
Links to other online services.
Publication of topics in SKOS.
Open data recovery.
FRBR,
SKOS,
InterMarc, XMLEAD,
Dublín Core,
RDF.
Difficulty of
cataloging resources.
Community participation in
cataloging and quality control
(Citizen Science).
Europeana
Enables interoperability without
affecting the source data
models.
Provides queries to multiple
linked metadata from European
institutions.
Interconnection to other data
sources.
Model people, places and events
related to a book reusing
existing schemes.
RDF,OAI_ORE,
SKOS, FOAF,
Dublín Core.
Lack of agreements
to provide data,
difficulty of migrating
data to EDM.
BIBO, BIO
DUBLÍN CORE,
ISBD, ORG,
SKOS, RDF
SCHEMA, OWL,
FOAF, WGS84,
RDA.
Need to develop
tools to transform
metadata to Linked
Data.
Lack of experts in
different areas for
the transformations.
Greater involvement of the
community with new datasets,
specifying links, improving the
quality of data.
Improve visibility integrating to
initiatives such as Schema.org.
The user could check the quality of
the information sources.
It is required to develop browsers
on BIBFRAME.
Improve Marc21 to BIBFRAME
migration tools.
British
Library
Data linked to other data
sources.
Shape things of interest related
to a bibliographic resource such
as people, places, events,
themes.
Improved visibility.
BIBO, BIO,
Dublin Core,
ISBD, ORG
SKOS, RDF
Schema, OWL,
FOAF,
Ontología de
eventos.
WGS84.
Lack of applications
consuming
Linked
Data.
Identify linking needs.
Improve data linking tools.
Obtain data use feedback.
National
Library of
Spain
Linking to other data sources.
List of topics in SKOS.
NSL scholar with annotations
and recommendations.
FRBR, FRAD,
FRSAD, IFLA,
ISBD,
RDF, DC, RDA,
MADS/RDF,
Dublin Core.
Mapping problems:
not all the basic
relationships of
FRBR could be
extracted from
MARC records.
Develop tools for visualization.
Mapping refinement.
To get new links.
To obtain new data sources.
Library of
Congress
USA
3.5.3. Problems
The third research question was “What are the problems reported in publications?”.
In the analyzed publications the following problems are reported:








difficulty of cataloging resources,
so much vocabularies for the same metadata,
lack of agreements to provide data, difficulty of migrating data to new models,
need to develop tools for Linked Data transformation,
lack of experts in different areas for the transformations,
lack of applications consuming Linked Data,
mapping problems: for example not all basic relations of FRBR could be extracted from MARC
records,
need for more useful links of datasets,
Hallo, M., Luján-Mora, S. and Maté, A. (2016). Current State of Linked Data in Digital Libraries. Journal of Information Science, 42(2),
117-127. DOI: 10.1177/ 0165551515594729.
35



ownership definition and control,
quality control of the datasets, and,
lack of indicators about the use of Linked Data.
In this study, it is observed that digital library community and the Semantic Web community have different
vocabularies for the same metadata, causing a complication in the mapping process. In addition, the objectives
are not well defined for the use of the RDF datasets, the potential users should be linked to this process to
define priorities of datasets and applications to consume linked data. Another complex problem cited is the
ownership rights, since policies and rules vary from country to country thus making it difficult to publish the
datasets on an Open Linked Data environment.
Despite the presented problems, migration from new datasets to RDF in national libraries is worldwide
observed.
3.5.4. Future trends
The fourth research question was “What are the future trends in research on LDDL?”.
From the information gathered in our research, the following trends are expected in the near future:












community participation in cataloging and quality control of published data,
increased participation of the community with new datasets, specifying links to external sources,
improved visibility through integration to initiatives such as Schema.org,
quality of information sources checked by the user,
new models browsers and data migration tools,
data links needs to be defined,
better tools for mapping the data links,
feedback about data usage,
tools for visualization, mapping refinement and data analysis,
policies for managing RDF vocabularies and URIs,
library data standards compatible with Linked Data, and,
a better discussion about rights of Open Linked Data.
In short, a greater diffusion of this technology is generally expected with the contribution of the librarians,
community and the development of standards and applications to exploit Linked Data. However, but several
cited issues related to technical, administrative and social aspects such as: the development of new tools, or the
definition of policies to publish and manage the Web of data among others must be solved.
4. Conclusions
A growing number of national libraries worldwide are using architectures to obtain and publish Linked Data.
Mainly, authority and bibliographic records are published using RDF data model. The vocabularies and
ontologies used vary depending on the implementations but it is possible to establish and publish equivalence
relations keeping the source data models.
Benefits of the publication and use of these Linked Data from digital libraries are reported, such as improved
data visibility, data linked with external resources, easy resource annotation process and reuse of data, but there
are not enough reports that quantify them. Four of the five reviewed libraries are achieving the 5-star in the
corresponding open Linked Data model, publishing open bibliographic data with open license, in machine
readable and non-proprietary format, using open standards from W3C (RDF and SPARQL) and linking the
data to external sources.
There are problems to be solved for using Linked Data technologies in digital libraries, such as the need of
support tools, mechanisms for data quality control, better querying interfaces, the lack of technical staff with
knowledge of these new technologies and the difficulty of defining Linked Data rights among others.
Furthermore, few data, more of them presented like statistics, about the use of Linked Data in library evaluation
were found.
Hallo, M., Luján-Mora, S. and Maté, A. (2016). Current State of Linked Data in Digital Libraries. Journal of Information Science, 42(2),
117-127. DOI: 10.1177/ 0165551515594729.
36
It is expected to have a larger number of datasets providers, achieving citizen participation in enriching data
through data annotation processes and to develop more applications to consume Linked Data. It is also expected
that online library catalogues are enriched with recommendations and search rankings based on the popularity
of an item and the activity of the user. Moreover, the preservation of Linked Datasets should be considered in
the librarian’s tasks. In addition, the dissemination of best-practice design patterns will help to increase library
participation in Semantic Web. In conclusion, libraries should adopt the Web of Data considering the 5-star
model, by publishing the datasets following the new non-proprietary open formats, linking to other resources,
bringing new services and using the community to enrich the data and its quality to improve the knowledge
discovery and management.
Notes
1. http://dbpedia.org.
2. http://linkedgoedata.org.
3. http://www.opnestreetmap.org.
4. http://evergreen-ils.org/.
5. http://projecthydra.org/.
6. http://omeka.org/.
7. http://www.cni.org/topics/digital-libraries/exhibit-3-0/.
8. http://schema.org/
Acknowledgements
This work was supported by the Prometeo Project from the Secretary of Higher Education, Science, Technology and
Innovation (SENESCYT) of the Ecuadorian Government and by the project GEODAS-BI (TIN2012-37493-C03-03)
supported by the Ministry of Economy and Competitiveness of Spain (MINECO). Alejandro Maté was funded by the
Generalitat Valenciana (APOSTD/2014/064).
References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
Berners-Lee T. Design Note: Linked Data, http://www.w3.org/DesignIssues/LinkedData.html (2006, accessed
December 2014).
Doerr M, Gradmann S, Hennicke S, Isaac A, Meghini C and Van de Sompel H. The Europeana Data Model (EDM).
In: Proc. of the World Library and Information Congress: 76th IFLA general conference and assembly, Gothenburg,
Sweden 10-15 August 2010, pp. 10-15.
Kroeger A. The road to Bibframe: the evolution of the idea of bibliographic transition into a post MARC future.
Cataloging & classification quarterly, 2103; 51(8) 873-890.
Library Linked Data Incubator Group. W3C Incubator Group Report 25 October 2011.
http://www.w3.org/2005/Incubator/lld/XGR-lld-20111025. (2011, accessed February 2015).
Hannemann J and Kett J. Linked Data for libraries. In: Proc. of the World Library and Information Congress: 76th
IFLA general conference and assembly, Gothenburg, Sweden 10-15 August 2010, 1-11.
Kitchenham B, Pearl O, Budgen D, Turner M, Bailey J., and Linkman S. Systematic literature reviews in software
engineering. Information and software technology 2009; 51(1): 7-15.
Petersen, K., Feldt, R., Mujtaba, S., and Mattsson, M. Systematic mapping studies in software engineering. In 12th
International Conference on Evaluation and Assessment in Software Engineering 2008; 17(1).
Agnès S, Wenz R, Vincent M, Adrien D. Publishing bibliographic records on the Web of data: opportunities for the
BnF (French National Library). In: The Semantic Web: Semantics and big data. 2013; 563-577.
Wenz R.. Linked Data and libraries. Catalogue & index 2010; 160: 2-5.
Wenz R. Linked open data for new library services: the example of data. bnf. fr. Italian Journal of Library &
Information Science, 2013; 4(1).
Haslhofer B and Isaac A. Data. europeana. eu: The Europeana Linked Open Data Pilot. In: International Conference
on Dublin Core and Metadata Applications, Netherlands, 21 -23 September 2011; 94-104.
Borst T, Fingerle B, Neubert J, & Seiler. A How do Libraries Find their Way onto the Semantic Web?. Liber
Quarterly, 2010. 19(3/4), 336-343.
Aloia N, Concordia C, Meghini C andTrupiano L. EuropeanaLabs: An Infrastructure to Support the Development
of Europeana. In: Bridging Between Cultural Heritage Institutions, Springer Berlin. 2014; 53-58.
Europeana.
Europeana
Data
Model
Primer.
Technical
report.
August
2010.
http://version1.europeana.eu/web/europeana- project/technicaldocuments/. (2010, accessed January 2014).
Hallo, M., Luján-Mora, S. and Maté, A. (2016). Current State of Linked Data in Digital Libraries. Journal of Information Science, 42(2),
117-127. DOI: 10.1177/ 0165551515594729.
37
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
Hanshofer B, Elaheh M, Gay M, and Rainer S. Augmenting Europeana content with linked data resources. In: Proc.
of the 6th International Conference on Semantic Systems. ACM. 2010; 40.
Gheen T. Library of congress launches beta release of Linked Data classification. In: Custodia Legis,
http://blogs.loc.gov/law/2012/07/library-of-congress-launches-beta-release-of-linked-data-classification/
(2012,
accessed January 2015).
Miller E, Ogbuji U, Mueller V and MacDougall K. Bibliographic Framework as a Web of Data: Linked Data Model
and Supporting Services. Technical report, Library of Congress; Washington, DC, USA, 2012.
Bizer C, Cyganiak R and Gauß T. The RDF Book Mashup: From Web APIs to a Web of Data. In: Proc. of the 3rd
International Workshop on Scripting for the Semantic Web (SFSW2007), June 6, 2007, Austria, online CEURWS.org/vol-248/paper4.pdf.
Deliot C. Publishing the British National Bibliography as Linked Open Data. Catalog & Index, 2014. 174, 13-18.
http://www.bl.uk/bibliographic/pdfs/publishing_bnb_as_lod.pdf. (2014, accessed January 2015).
Vila Suero D and Escolano Rodríguez E. Linked Data at the Spanish National Library and the Application of IFLA
RDFS Models. IFLA ScatNews, 2011; 35: 5-6.
Vila Suero D, Villazón Terrazas B, Gómez-Pérez A. Datos.bne.es: A library linked dataset. Semantic Web, 2013;
4(3): 307-313.
Pattuelli M. Personal name vocabularies as linked open data: A case study of jazz artist names. Journal of Information
Science, 2012; 38, 6: 558-565.
Miller E. Digital libraries and semantic web. W3C.2004. http://www.w3.org/2001/09/06-ecdl/Overview.html.(2004,
accessed June 2015).
Hallo, M., Luján-Mora, S. and Maté, A. (2016). Current State of Linked Data in Digital Libraries. Journal of Information Science, 42(2),
117-127. DOI: 10.1177/ 0165551515594729.
38
7.
Data model for storage and retrieval of legislative documents in
Digital Libraries using Linked Data
El contenido de este capítulo corresponde al siguiente artículo:
Hallo, M., Luján-Mora, S. and Maté, A. (2015). Data model for storage and retrieval of
legislative documents in Digital Libraries using Linked Data. Proceedings of 7th annual
International Conference on Education and New Learning Technologies EDULEARN15.
Barcelona, España, IATED, 7423-7430.
URI: https://library.iated.org/view/HALLO2015DAT.
En el manejo de bibliotecas digitales existen características especiales de manejo de
documentos tales como el versionamiento que podrían ser mejoradas con datos enlazados y
controlados con cuadros de mando basados en los correspondientes metadatos. Un caso
particular de documentos los legislativos requieren un tratamiento especial de control de
cambios. En la publicación “Data model for storage and retrieval of legislative documents in
Digital Libraries using Linked Data” se analizan los principales modelos de datos usados en
la administración de cambios de documentos legislativos en bibliotecas digitales y se
presenta una propuesta de un modelo de datos para almacenamiento y recuperación de
diferentes versiones de documentos legislativos y sus fragmentos usando tecnologías de
datos enlazados. Por otra parte se presenta un método para publicar y administrar relaciones
entre documentos legislativos y sus cambios. Se analizan estructuras de documentos,
cambios, metadatos y requerimientos de consultas históricas.
39
DATA MODEL FOR STORAGE AND RETRIEVAL OF LEGISLATIVE
DOCUMENTS IN DIGITAL LIBRARIES USING LINKED DATA
María Hallo1, Sergio Luján-Mora2, Alejandro Mate3
1
Department of Computer Science, National Polytechnic School, Quito (ECUADOR)
2
National Polytechnic School, Quito (ECUADOR)
2,3
Department of Software and Computing Systems, University of Alicante (SPAIN)
Abstract
Many countries have provided online access to some types of legislative documents by subject,
keywords or date. Nevertheless, the possibility of querying historical versions of the documents is
usually an uncommon feature. The dispersion of laws and other legislative documents and their
continuous changes make difficult the generation and querying of valid legislative information at a
given date. Furthermore, the ripple effect of modifications such as updates, insertions or derogations
affecting the entire body of a law or part of it is not always visible for the citizens who are looking for
legislative information.
Some issues related to change management of legislative documents can be identified: how to apply
the history of changes to a version of a legislative document to obtain a new version, and what type of
data model might be better to satisfy temporal queries, to store new versions of documents or to
obtain them dynamically. The access to all versions of a document and its fragments is important in
legislative queries to be sure which law was in force to apply when a case happened.
Law documents are produced and stored in information systems with different data models to access
and retrieve information about them in a large-scale manner, but most of them do not have law
change management functions. Web standards, such as XML, XSLT and RDF, facilitate the
separation between content, presentation and metadata, thus contributing to a better annotation and
exploitation of information from these documents and their fragments to improve the historical queries
and the version generation of legislative documents.
This paper presents a proposal of a data model for storage and retrieval of different versions of
legislative documents using Linked Data, a method of publishing structured interlinked data, for
managing relations between legislative documents and its changes. Document structures, changes to
legislation, metadata, requirements of historical queries are analyzed in this work. Furthermore, the
proposed model facilitates historical querying of legislative documents and consolidation procedures,
allowing update relationships between documents and fragments without changes on the original
documents. The model has been tested with Ecuadorian laws, but it could be used for law systems of
other countries because the model is independent of the legislative framework.
Keywords: Digital libraries, Linked Data, data models, legislative documents.
1
INTRODUCTION
Legislative documents can be classified into laws, decrees, resolutions and agreements. They are
made in different official bodies and published in Official State Gazette such as: BOE in Spain, State
Gazette in Bulgaria, the Gazette in United Kingdom, Official Gazette in Ecuador and so on.
Many countries have provided online access to some of the types of legal documents through
information systems that allow searches by subject, keyword or chronological accesses. One of the
most important requirements is querying enacted and consolidated versions of valid laws at certain
date. In order to facilitate the self-description, comparison, integration of documents and special
queries some law systems use standards such as: XML (Extensible Markup Language), XSLT
(Extensible Stylesheet Language Transformations), RDF ( Resource Description Framework) [1,2].
However, recent experiences using Linked Data and Semantic Web with legislative information in
England, Italy, Netherlands, Spain, and others [3,4] have shown that the updating process is slow,
manual consolidations of versions are difficult, and there is a lot of outdated information.
In Ecuador and in other countries, there are legal systems maintained by private companies and
official websites. The National Assembly of Ecuador begun to publish laws on its website after 2009
41
7423
without relations between documents. In Ecuador, periodically consolidations of common laws are
made and are published after approval by the National Assembly.
This paper presents a data model to meet the needs of historical queries and version generation of
valid legislative documents on a given date allowing updates of relationships between documents
without modifications on the original documents. The proposed model supports the consolidation of
legislative documents developing consolidated documents integrated applying the corresponding
changes. The model could be applied in several countries to their legal systems because it is
independent of the legislative framework.
2
GENERAL CONCEPTS
The proposed model to store and retrieval legislative documents is based on the Linked Data
principles for publishing and interlinking structured data on the Web. In addition, the model use
metadata, information about the data, for making easier finding and working on legislative documents
and their fragments.
2.1
Linked Data
Linked Data is a method of publishing data on the Web, using standards such as: RDF, XML, XSLT.
RDF is a data model for describing resources on the Web. It models statements about resources in
the form of triples subject (resource), predicate (property), object (value) building a graph. As such,
RDF is suited to knowledge representation. RDF has been used to develop semantic models for
legislation.
In 2006, Tim Berners-Lee presented the design principles of Linked Data [2]:
• Use URIs (Uniform Resource Identifiers) as resource names.
• Use HTTP URIs so that people can find these names.
• When someone searches for a URI should find useful information using RDF and SPARQL
standards.
• Include links to other URIs so that people can find more related resources.
These principles are useful to share information in a way that can be read automatically by computers.
This enables data from different sources being connected and queried.
2.2
Metadata
Metadata has been described as data about data. Metadata can be obtained from controlled
vocabularies as ontologies, thesauri or uncontrolled vocabularies from particular designs. Metadata
may be defined at the level of items or collections and may be embedded in the described object or
defined outside with pointers to the objects they describe. There are multiple metadata standards,
many of them for the same purpose with defined patterns of interaction allowing interoperability
between them.
Definitions and criteria for comparing diagrams are useful for reusing vocabularies and are presented
in the book “Introduction to metadata” edited by Mirtha Backa [5]. This proposal will reuse existing
vocabularies with few additions to the amendments of laws, as specified in section 3.6.
3
DETAILS OF THE PROPOSAL
In this work, we suggest the use of a tree data model for legislative documents stored in XML format
and a graph data model represented in RDF for metadata of legislative documents and links relating
modifying and modified documents. The changes are produced generally from fragments of new
legislation referencing fragments of old legislation. The tree data models fit well when component interconnectivity is a key feature for the legislative change management systems. The database system
1
used is Virtuoso , a database management system supporting relational, XML and RDF data. The
1
Virtuoso Open-Source edition: http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/
42
7424
dynamic generation of versions is made applying the valid amendments until a given date to the original documents. The details of the model are shown in the following sections.
3.1
Content of Legislative Documents
The legislative documents may be of different types: laws, decrees, resolutions and so on. Each one
has its own hierarchical structure and could have appropriated tags in XML.
A law in the case of Ecuador has the following parts:
−
−
−
−
−
Heading
Preamble
Body
Provision
Annexes
The Body is composed of Titles, Chapters, Sections, Articles and Paragraphs, but not always we
found all the components in a law. In some cases titles, chapters or sections are omitted.
Metadata are developed to describe the content of legislative documents, fragments and change links.
3.2
Metadata on Legislative Documents
Metadata of legislative documents and their changes will be represented in RDF in our proposal using
the LegX ontology. This helps to dynamically query the metadata of a legislative item as it was
enacted, its contents and history of their amendments on a given date. For specification of official
versions the FRBR (Functional Requirements for Bibliographic Records) vocabulary [6] is used.
FRBR represents the products described in bibliographic records as work (law), expression (versions),
manifestation (formats) and item (copy). Associating legislative information to FRBR vocabulary the
work is a legislative document, it has expressions (versions) and each expression has manifestations
(formats). Each manifestation can have items (copies).
Each work has a number of expressions that represent a different version of the legislation, each one
with its respective date of enactment represented with the Dublin Core term: dct:issue (date of formal
issuance of the resource). Dublin Core is a metadata model for use in resource description.
The FRBR vocabulary is used for structuring information about legislative documents. For the
definition of links we used a vocabulary similar to Mets [7]. Mets is a metadata standard for encoding
and transmission of digital objects.
The concepts and relations between the legislative documents, fragments and amendments are
modelled in the proposed LegX (Legislative) ontology and it is represented in Figure 1. The terms
legX: Changes and the properties changeFrom and changeTo pointing to the modified and modifying
legislative documents are added to the proposal model. Moreover, metadata about global documents,
fragments and amendments must be defined in each case using any of the existing metadata
2
standards oriented to legislative systems such as CenMetalex o Akoma-Ntoso . CenMetalex [8]
defines a machine readable representation (in XML format) of parliamentary and legislative
documents. Moreover, the standard suggests a naming convention for URI-­‐based identifiers for all structural elements of a legal document. 3.3
Identifiers
3
Similarly to the proposal of legislation.gov.uk , URIs are used as identifier for legislative documents
and fragments and also for each change link. URIs must be defined hierarchically showing the parts
and version dates. Following is described a proposed URI structure for each component:
• The URI structure for legislative documents (work) is:
{domain/id/publication-year/publication-number/{legislative-item(promulgation-date)}
Legislative-item {doc-type|title|chapter|section|article}
Doc-type {law|decree..}
2
Akoma Ntoso XML for parliamentary, legislative & judiciary documents: http://www.akomantoso.org/
3
UK legislation website by National Archives: http: www.legistation.gov.uk
43
7425
• The URI structure for documents-versions (expression) is:
{domain/publication-year/law-id/{legislative-item/ver(version-date)}
Legislative-item {doc-type|title|chapter|section|article}
Doc-type {law|decree..}
• The URI structure for documents-formats (manifestations) is:
{domain/publication-year/law-id/{legislative-item/ver(version-date)}/nom_file.extension}
Legislative-item {doc-type|title|chapter|section|article}
Doc-type {law|decree..}
• The URI structure for documents-formats-copies (items) is:
{domain/publication-year/law-id/{legislative-item/ver(versiondate)}/nom_file.extension/numcopy}
Legislative-item {doc-type|title|chapter|section|article}
Doc-type {law|decree..}
• The URI structure for amendments links is:
{domain/legX/link/link-number/validity-date}
3.4
Temporal modeling of legislative documents and fragments
There have been several studies to define time dimensions related to legislative documents and
applications. Boer et al. [9] define the following time data: publication date, enactment date, start of
validity date that in some cases matches with the publication date, repeal date (the end of the validity),
effectiveness date, applicability date. Moreover, Boer et al. define the state of a document as active or
inactive. In addition, Grandi et al. [10] propose timestamps for multiversion XML documents: start and
end of validity, and effectiveness of transactions.
The time dimensions may be associated with the works (legislative documents) and expressions
(versions of legislative documents) being inherited by the manifestations (formats). It is proposed to
use the following data: publishing date, start of validity date, end of validity date (repeal), start of
effectiveness date, end of effectiveness date for versions of documents and fragments.
Figure 1. LegX ontology.
44
7426
3.5
Changes in legislation
Acts and regulation have continuous changes. Changes must be recorded and applied to old laws.
The changes are producing from fragments of new legislation referencing fragments of old legislation.
Different types of changes could be produced: inserted, updated, repealed. A common request is to
view summaries of significant changes to specific acts or regulations. For applying changes we define
links with the following attributes:
• ChangeDate: is the date on which the amendment applies to the target document.
• From: URI of a modifying fragment.
• To: URI of a modified fragment.
• Type of change: Inserted, deleted, repealed, updated.
• Description of change: Description of the reasons for change.
The changes applied to the original document generate a new version dynamically.
3.6
Description RDF / XML
For a description of legislative documents, versions, modifications and fragments we propose reuse
the vocabularies shown in Table 1. These vocabularies have the concepts needed to our model.
Table 1. Vocabularies used in the data model.
Vocabulary
Prefix
Namespace
RDF
Schema
Rdfs
http://www.w3.org/2000/01/rdfschema#
Dublin
Core
dct
http://purl.org/dc/terms/
FRBR
frbr
http://purl.org/vocab/frbr/core#
Legislative
legX
http://mydomain/gob/ec/legX
The legX ontology was developed to represent the FRBR elements and links relating the changes.
Versioned documents (expressions) point to its manifestations (formats) with the Dublin Core element
dct:hasFormat. A legislative item of type work points to its parent item with dct: isPartOf. A legislative
item of type work points to its component items with dct:hasPart. Each legislative document is linked
with its versions with dct:hasVersion: with a reverse link dct: isVersionOf, it’s shown in Figure.1.
Following we present a fragment of RDF/XML format for metadata of a legislative document from
Ecuador defined at work (document) level, showing the relation to two consolidated versions. Each
element has his own tag in XML.
45
7427
Metadata about changes are represented by links in RDF/XML vocabulary. In the next example we
show the metadata of a link relating the modifying and the modified norms identified in (link:from) and
(link:to) tags.
3.7
Consolidation
Consolidation is the process of applying to a legislative document all changes made over time. In order to have official resulting documents, they must obtain the approval of the competent bodies as in
the case of the National Assembly in Ecuador.
As an aid to consolidation process for each legislative document or fragment it is possible to query the
changes made to it at a given date or in a range of dates.
46
7428
The proposed model allows handling versions of official documents and unofficial consolidated
versions dynamically generated supporting the official consolidations or being a reference for system
users.
3.8
Queries
Most of the legislative information systems allow to query the legislative document valid at a specific
date and to get a full-text search upon the legislative document. Moreover, the historical evolution of a
document is also covered by several systems, but, the historical evolution of a fragment is less
supported. In the same way, information about amending and amended acts of a rule is not always
considered and the queries related with validity and efficacy is less supported [11]. In general, the
Linked Data oriented proposals support a wider set of queries using SPARQL [12].
Our proposed model allows the following queries:
a) A valid legal document to a certain date.
b) Historical development of a legislative document with access to all the approved versions.
c) A piece of legislation and its amendments.
d) Laws modified by a law.
e) Validity range of a legislative document.
f)
Validity range of a piece of legislation.
g) Efficacy range of a legislative document.
h) Efficacy range of a piece of legislation.
3.9
Advantages of the proposed model
With this proposal, metadata are separated from the legislative documents and can be expanded
(maintained independently) adding new metadata without changes in the original document. This
model allows generating a new version of a legislative document dynamically applying the changes to
the original document. Moreover, the model allows linking in the future the documents and the
modifications to other resources as the proponents, the voting results, etc. [13]. This model has been
implemented using Virtuoso, software with capabilities to manage relational data, XML fields and RDF
structures.
4
CONCLUSIONS AND FUTURE WORKS
The proposed model integrates the information of metadata from legislative documents and
amendments using RDF. Moreover, it´s possible to add another relationships in RDF in order to cover
the process of issuing laws and future changes. In addition, metadata are separated from the
documents and can be expanded independently. Furthermore, the proposed model supports the
consolidation of legislative documents developing integrated documents with the corresponding
amendments and could be applied in several countries to their legal systems. For the initial loading of
metadata in RDF, it is necessary to define an extraction process from XML fields, according to the
type of data sources and their corresponding metadata. In the future, another kind of legislative
process should be analyzed to integrate new documents and relationships in the model. In addition,
tests of performance will be addressed.
ACKNOWLEDGMENTS
This work has been partially supported by the Prometeo Project by SENESCYT, Ecuadorian
Government.
47
7429
REFERENCES
[1]
Peri, A. (2009).SIDRA:XML en la gestión y explotación de la documentación jurídica. SCIRE.
15, pp.111-124.
[2]
Martínez, M., et al. (2003). Relationship-based dynamic versioning of evolving legal documents.
Web-knowledge Management and Decision Support, Lecture Notes on Artificial Intelligence.
2543, pp. 298-314.
[3]
Sheridan, J. and Tennison, J. (2010). Linking UK Government Data, Statistics, ACM Press,
North Carolina, pp.1-4.
[4]
Hoekstra, R. (2011).The MetaLex Document Server, in VoxPopulLII. Available from:
http://blog.law.cornell.edu/voxpop/2011/10/25/the- metalex-document-server/ .[Accessed: 2 Jan
2015].
[5]
Baca M., Gill T., Gilliland A.J., Whalen M. and Woodley M. (2008). Introduction to Metadata,
Getty Research Institute, Getty Publications, Los Angeles.
[6]
IFLA Study Group. (1997). Functional Requirements for Bibliographic Records, IFLA Series on
Bibliographic Control, 19, Munich: K.G. Saur Verlag.
[7]
Cundiff, M. V. (2004). An introduction to the metadata encoding and transmission standard
(METS). Library Hi Tech, 22(1), pp. 52-62.
[8]
Palmirani,M., Cervone, L. and Vitali, F.(2009). Legal metadata interchange framework to match
CEN metalex. In Proceedings of the 12th International Conference on Artificial Intelligence and
Law (ICAIL '09). ACM, New York, NY, USA, pp. 232-233.
[9]
Boer, A., Hoekstra, R., Winkels, R., van Engers, T., Breebaart, M. (2004). Time and Versions in
METALex XML. In: Proceeding of the Workshop on Legislative XML. Kobaek Strand.
[10]
Grandi, et ál. (2005) Temporal Modelling and Management of Normative Documents in XML
Format. Data & Knowledge Engineering, 54, pp. 327–354.
[11]
Hallo, M. Martínez, M., De la Fuente, P. (2013). Data models for version management of
legislative documents, Journal of Information Science. 39(4), pp. 557-572.
[12]
Sartor, G., Palmirani, M., Francesconi, E., Biasiotti, M. (ed). (2011). Legislative XML for the
Semantic Web, 1st Ed. Netherlands: Springer.
[13]
Hallo, M., Luján-Mora, S. Trujillo, J. (2014).Transforming Library Catalogs into Linked Data,
ICERI2014 Proceedings, pp. 1845-1853.
48
7430
8.
An Approach to Publish Scientific Data of Open-access Journals
using Linked Data Technologies
El contenido de este capítulo corresponde al siguiente artículo:
Hallo, M., Luján_Mora, S. and Chavez, C. (2014). An Approach to Publish Scientific Data
of Open-access Journals using Linked Data Technologies. Proceedings of the 6th
International Conference on Education and New Learning Technologies
(EDULEARN2014). Barcelona, España, IATED, 5940-5948.
URI: URI: https://library.iated.org/view/HALLO2014ANA.
Este estudio reporta un proceso para publicar en la web metadatos de artículos científicos
usando tecnologías de datos enlazados. Adicionalmente se presentan guías metodológicas
con actividades relacionadas. El proceso propuesto fue aplicado extrayendo metadatos con
estándar Dublin Core de una revista digital publicada usando el software Open Journal
System y la publicación se la realizó usando el software Virtuoso en su versión comunitaria.
49
AN APPROACH TO PUBLISHING SCIENTIFIC DATA OF OPENACCESS JOURNALS USING LINKED DATA TECHNOLOGIES
M.Hallo1, S. Luján-Mora2, C.Chávez1
1
2
National Polytechnic School, Faculty of System Engineering (ECUADOR)
National Polytechnic School, University of Alicante, Department of Software and Computing
Systems (SPAIN)
Abstract
Semantic Web encourages digital libraries, including open access journals, to collect, link and share
their data across the Web in order to ease its processing by machines and humans to get better
queries and results. Linked Data technologies enable connecting related data across the Web using
the principles and recommendations set out by Tim Berners-Lee in 2006.
Several universities develop knowledge through scholarship and research with open access policies
for the generated knowledge, using several ways to disseminate information. Open access journals
collect, preserve and publish scientific information in digital form related to a particular academic
discipline in a peer review process having a big potential for exchanging and spreading their data
linked to external resources using Linked Data technologies. Linked Data can increase those benefits
with better queries about the resources and their relationships.
This paper reports a process for publishing scientific data on the Web using Linked Data technologies.
Furthermore, methodological guidelines are presented with related activities. The proposed process
was applied extracting data from a university Open Journal System and publishing in a SPARQL
endpoint using the open source edition of OpenLink Virtuoso. In this process, the use of open
standards facilitates the creation, development and exploitation of knowledge.
Keywords: Scientific data, Linked Data, Open access journals, Semantic Web.
1
INTRODUCTION
Open access (OA) is the free unrestricted online access to digital content. The open access
movement began in the 1990s, at the same time the World Wide Web became widely available and
Open Access Journal Publishing begin to grow. In 2003, the Budapest Open Access Initiative (BOAI)
launched a worldwide campaign for open access to peer-reviewed research [1].
Several universities like Harvard, Stanford, MIT, have adopted guides to good practices for university
open access policies. University transmits knowledge through scholarship and research. In both roles
there are many initiatives to openly share knowledge and resources. For example, in education there
is the Open Education Resource (OER) initiative [2]. On the other hand, in different research fields
several forms of open knowledge diffusion have been implemented like open archives, open access
journals, blogs and websites, helping users to easily create new developments and new knowledge.
Open access journals collect, preserve and publish scientific information in digital form related to a
particular subject. The development of Information and Communication Technologies (ICTs) has
increased the number of open access scientific journals in digital format, speeding up dissemination
and access to content [3]. However, bibliographic data are dispersed, without relationship between
resources and data sets making difficult their discovery and reuse for other information systems. To
address these issues, we propose a process for publishing bibliographic data from open journal
systems following the Linked Data principles.
The proposed approach has been applied in a case study, “Revista Politécnica”, in the context of the
interuniversity project for publishing library bibliographic data using Linked Data technologies, funded
by CEDIA (“Consorcio Ecuatoriano para el Desarrollo de Internet Avanzado”) and developed in
Ecuador by National Polytechnic School, University of Cuenca and Privated Technical University of
Loja.
51
1145
Open access journals have a big potential for exchanging and spreading their data linked to external
resources using Linked Data technologies, especially in the context of the Open Data Movement [4].
The term Linked Data refers to a set of best practices for publishing and interlinking structured data on
the Web in a human and machine readable way [5]. The Linked Data principles are:
• Use Uniform Resource Identifiers (URIs) as names for things.
• Use HTTP URIs so that people can look up those names.
• When someone looks up a URI, provide useful information, using common standards such as
RDF (Resource Description Framework) and SPARQL (RDF query language).
• Include links to other URIs so that they can help to discover more things.
The URI is used to identify a web resource. In addition, RDF is used for modelling and representation
of information resources as structured data. In RDF, the fundamental unit of information is the subjectpredicate-object triple. In each triple the “subject” denotes the source; the “object” denotes the target;
and, the “predicate” denotes a verb that relates the source to the target. Using a combination of URIs
and RDF, it is possible to give identity and structure to data. However, using these technologies alone,
it is not possible to give semantics to data.
The Semantic Web Stack (Architecture of the Semantic Web) includes two technologies: RDFS (RDF
Schema) and OWL (Web Ontology Language). RDFS is an extension of RDF that defines a
vocabulary for the description of entity-relationships [6]. RDFS provides metadata terms to create
hierarchies of entity types (referred to as “classes”) and to restrict the domain and range of predicates.
OWL is an extension of RDFS [7], which provides additional metadata terms for the description of
complex models, which are referred to as “ontologies”.
Some movements like LODLAM (Linked Open Data in Library, Archives and Museums) are working in
1
sharing knowledge, tools and expertise using Linked Data in Libraries . Several national libraries, such
as British Library, French Library, Spanish National Library and libraries from universities, such as
Michigan, Stanford, Cambridge, etc. have published linked datasets of bibliographic data that they
have created. European Library is promoting Linked Open Data innovations in libraries across Europe
[8].
The proposed process for publishing bibliographic linked data was developed based on best practices
and recommendations from several authors and tested with data from the electronic version of the
journal “Revista Politécnica” edited by National Polytechnic School.
Some existing vocabularies and ontologies are used, such as FOAF (Friend of a friend), BIBO
(Bibliographic Ontology), ORG (Organization Ontology) and DC (Dublin Core). In addition, the dataset
created was linked to external data giving information that goes far beyond the bibliographic data
provided by publishers giving information, such us authors, publishing papers with similar subjects or
organizations sponsoring research in specific subjects, etc.
2
THE PROCESS FOR PUBLISHING OPEN JOURNAL SCIENTIFIC LINKED
DATA
Several approaches are being proposed to generate and publish linked data [9, 10], each one
represented by activities and each activity composed of several task.
Our approach proposes six main activities: data source analysis, metadata extraction, modelling, RDF
generation, linking and publishing. Fig. 1 shows the Life cycle of the Open Linked Data.
1
LODLAM Linked Open Data in Library, Archives and Museums: http://lodlam.net.
52
1146
Fig. 1 Open Linked Data Life Cycle.
2.1
Data source analyze
The objective of this activity is: to identify data sets that provide benefits to others for reuse. The
selected data sets are analyzed looking for attributes useful for answering the queries. The steps in
this activity are:
a) Identification of the data source and the attributes of interest to be published and linked to another
datasets.
In this study, we have chosen a dataset with publications from a university open journal considering
the importance of the diffusion and interlinking of this information.
b) Engaging stakeholders
In this step we explain stakeholders (principals of several universities and a funding organization) the
process and benefits of creating and maintaining Linked Open Data related to a scientific information
published in the academic journal, after we develop an inter university project for funding.
c) Data source analyze
2
Several journals affiliated to the open access initiative have adopted the Open Journal System ,
software that provides an Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH)
Endpoint. Although this protocol has been widely adopted, it has some problems, including use of nondereferenceable identifiers and limitations of selective access to metadata [11]. The open journal
analyzed uses the open source OJS (Open Journal Systems) for the management of peer-reviewed
academic journals. The used data set is stored in a MySQL database. In order to have a better
knowledge about the scientific publications, the work was focused in the articles stored with Dublin
Core metadata, which is a vocabulary for resource description. Following, there are some examples
coded with Dublin Core metadata:
• Identifier (dc:identifier): http://tomheath.com/papers/bizer-heath-berners-lee-ijswis-linkeddata.pdf
• Titles (dc:title): “Linked Data - The Story So Far”
• Authors (dc:creator): “Tim Berners-Lee”
• Keywords (dc:subjects): “Linked Data , Web of Data, Semantic Web, Data Sharing”.
This data linked to another datasets will give us better knowledge about similar subjects published, the
authors who work in them, the organizations sponsoring similar research.
d) Identification of the licensing and provenance information
There is general information about the licensing in the analyzed open journal. It is possible to get
information from the online journal and reproduce citing the source. This text is added in a dc:rights
property.
2
Open Journal System: https://pkp.sfu.ca/ojs/
1147
53
Provenance information about a data item is information about the history of the item, including
information about its origins. It is a measure about the quality of data.
In our case study, the provenance data are the name of the journal, the type of publication (peerreview) and the name of the publisher. In the future, another data about the peer review process could
be added.
2.2
Metadata extraction
In this activity the metadata are extracted from the original source and stored in an intermediate
database for cleaning. The tasks in this activity are:
a) Metadata extraction and storage
Metadata were extracted using the open source software Spoon-Pentaho Data Integration and stored
in a relational database. The data extracted were metadata from the entities: article, authors and
organization.
b) Disambiguation of entities with different values
In the case study we found some problems in the data like typographic mistakes or several authors
with similar names, the same for the affiliation data. Additionally, author names were formatted
differently. For example, the same author could appear in one document as “Lujan, S.”, in another as
“Luján-Mora, S.” and in another as “Luján, Sergio”. A data cleaning process matches the documents of
this author and groups these name variants together so that authors, even if cited differently, are
linked to their papers. In this step, author names are grouped together under a single identifier
number, a process matches author names based on their affiliation, and email address grouping
together all of the documents written by that author.
When grouping author names under a unique author identifier number, we should take into
consideration last name variations, all possible combinations of first and last names, and the author
name with and without initials. As a result, searches for a specific author include a preferred name and
variants of the preferred name. This problem is solved in some systems like Scopus, showing potential
author matches like in Fig 2. In addition, the authors have the possibility of reporting mistakes. This
functionality is planned to be added to our system in a next stage.
An initial pre-processing of the data applying data clean techniques, was performed. Spoon and Silk
were used to get the catalogues of authors and author’s affiliations.
Discipline, data delivered by the authors in the study case, and keywords were cleaned by matching
data with catalogues of features and thesaurus for linking with SKOS (Simple Knowledge Organization
System) concepts.
Fig. 2 Scopus author profile with different name formats.
2.3
Modelling
The goal of this activity is to design and implement a vocabulary for describing the data sources in
RDF. The steps in this activity are:
54
1148
a) Selection of vocabularies
The most important recommendation from several studies is to reuse available vocabularies as much
as possible to develop the ontologies. An ontology represents knowledge as a hierarchy of concepts
within a domain, using a shared vocabulary to denote the types, properties and interrelationships of
those concepts [12]. We use the following controlled vocabularies and ontologies for modelling
journals, authors, affiliation:
3
• BIBO (The Bibliographic Ontology) provides main concepts and properties for describing
citations and bibliographic references (e.g. books, articles, etc.) on the Semantic Web using
RDF.
4
• Dublin Core is a set of terms that can be used to describe web resources as well as physical
resources such as books. It consists of fifteen fields, e.g., creator, contributor, format, identifier,
language, publisher, relation, rights, source, title, type, subject, coverage, description, and date.
5
The full set of Dublin Core metadata terms can be found on the Dublin Core Metadata . Dublin
Core Metadata may be used to provide interoperability in Semantic Web implementations
combining metadata vocabularies of different metadata standards.
6
• FOAF (Friend of a Friend) is a machine-readable ontology describing persons, their activities
and relations to other people and objects in RDF format.
7
• ORG (Organization) is an ontology for organizational structures, aimed at supporting linked
data publishing of organizational information. It is designed to add classification of organizations
and roles, as well as extensions to support information such as organizational activities. The
namespaces used are shown in Table 1.
b) Vocabulary development and Documentation
8
The vocabulary was documented using Protégé (Ontology Editor Tool) .
Table 1. Vocabularies and Namespaces.
Vocabulary/Ontology
Namespaces
ORG
http://www.w3.org/ns/org#
FOAF
http://xmlns.com/foaf/0.1/
DC
http://xmlns.com/dc/0.1/
DCTERMS
http://purl.org/dc/terms/
BIBO
http://purl.org/ontology/bibo/
c) Vocabulary validation
Ontology validation is a key activity in different ontology engineering scenarios such as development
and selection, that is, assessing their quality and correctness [13].
9
The generate vocabulary was validate with OOPS! .
d) Specify a license for the dataset
10
The license to publish the datasets was Creative Commons .
3
The Bibliographic Ontology: http://bibliontology.com/
4
Dublin Core Metadata Element Set, version 1.1: http://dublincore.org/documents/dces/
5
DCMI Metadata Terms: http://dublincore.org/documents/dcmi-type-vocabulary/index.shtml
6
The Friend of a Friend (FOAF) project: http://www.foaf-project.org/
7
The Organization Ontology: http://www.w3.org/TR/vocab-org/
8
Protégé: http://protege.stanford.edu/
9
Ontology Pitfall Scanner: http://www.oeg-upm.net/oops
55
1149
2.4
RDF generation
The goal of this activity is to define a method and technologies to transform the source data in RDF
and produce a set of mappings from the data sources to RDF. The tasks in this activity are:
a) Selection of development of technologies for RDF generation
11
For the study case the Triplify tool with some modifications has been used to perform the
transformation of the intermediate relational database in RDF.
b) Mappings from data sources to RDF
Mappings were defined from the intermediate data base with metadata extracted from the source
system to RDF.
c) Transformation of data
The process of transformation was run with the open source software Triplify getting RDF triples
stored in RDF/XML format. Fig. 3 shows part of this process.
Fig. 3 RDF Generation using Triplify.
2.5
Interlinking
The objective of this activity is to improve the connectivity to external datasets enabling other
applications to discover additional data sources.
The tasks corresponding to this activity are:
a) Target datasets discovery and selection
12
For this task we used the website the Datahub to find some datasets useful for linking. We found
several open linked datasets from scientific journals.
b) Linking to external datasets
10
Creative Commons: http://creativecommons.org/
11
Triplify: http://triplify.org/
12
Datahub: http://datahub.io/
56
1150
13
The open source software Silk was used to find relationship between data items of our dataset and
the external datasets generating the corresponding RDF links that were stored in a separated dataset.
2.6
Publication
The goal of this activity is to make RDF datasets available on the Web to the users following the
Linked Data principles. The steps in this activity are:
a) Dataset and vocabulary publication on the web
The generated triples were loaded into a SPARQL endpoint (a conformant SPARQL protocol service)
14
based on OpenLink Virtuoso , which is a database engine that combines the functionality of RDBMS,
virtual databases, RDF triple stores, XML store, web application server and file servers. On the top of
15
OpenLink Virtuoso; Pubby is used as a Linked Data interface to the RDF data. Fig. 4 shows a view
of the SPARQL endpoint with a partial result of the query about an article in a test platform:
select * from <http://192.168.203.128:8890/DAV/home/datasetojs3>
where {<http://www.revistapolitecnica.epn.edu.ec/ojs2/triplify/article/41> ?y ?z}
Fig. 4 SPARQL endpoint query.
b) Metadata definition and publication
Metadata recommended for publishing Linked Data sets are: organization and/or agency, creation
date, modification date, version, frequency of updates, and contact email address [14].
16
The metadata were published in the site Datahub using DCAT (Data Catalog Vocabulary) , an RDF
vocabulary designed to facilitate interoperability between data catalogues published on the Web.
The whole architecture used in this project is shown in the Fig. 5.
13
Silk – A Link Discovery Framework for the Web of Data: http://wifo5-03.informatik.uni-mannheim.de/bizer/silk/
14
Virtuoso Universal Server: http://virtuoso.openlinksw.com/
15
Pubby – A Linked Data Frontend for SPARQL Endpoints: http://www4.wiwiss.fu-berlin.de/pubby/
16
Data Catalog Vocabulary(DCAT): http://www.w3.org/TR/vocab-dcat/
57
1151
Fig.5. Open Linked Data used architecture.
3
CONCLUSIONS AND FUTURE WORK
In this paper we analyze and use a process for publishing scientific data from Open Journal systems
on the Web using Linked Data technologies. The process was based in best practices and
recommendations from several studies, adding tasks and activities considered important during the
project development. The process was applied to the transformation of metadata from “Revista
Politécnica” to RDF. For publishing we use OpenLink Virtuoso and Pubby.
The process could be also applied to bibliographic metadata harvested through of the OAI-PMH
Protocol linking integrated metadata from Open Journal Systems.
The Dublin Core standard used in the source metadata was enough for the integration of data
from open journal systems to help answering our questions. For another bibliographic resources it
should be important analyze FRBR (Functional Requirements for Bibliographic Records ) and RDA
(Resource Description and Access ) standards to get interoperability with another digital libraries.
For the future a new interface is being developed to ask users to fix errors in author disambiguation,
grouping papers par author and organization disambiguation. In addition a team will make the
maintenance’s task to be able to publish all the new data with the better quality possible, data curation
is the key of the success of Linked Data. Moreover, we will work in using SKOS (Simple Knowledge
Organization System) to link the papers subjects and disciplines to another works to offer better
queries to the users. We are also analyzing the best way to validate the generated external links.
Another work for the future is the alignment of the data model with activities of the publication process.
ACKNOWLEDGEMENT
This research has been partially supported by the Prometeo project by SENESCYT, Ecuadorian
Government and by CEDIA (Consorcio Ecuatoriano para el Desarrollo de Internet Avanzado)
supporting the project: “Platform for publishing library bibliographic resources using Linked Data
technologies”.
REFERENCES
[1]
Chan, L. et ál (2002). Read the open access initiative. Available at:
http://www.budapestopenaccessinitiative.org/read.
[2]
Center for Educational Research and Innovation (CERI), 2007. Giving Knowledge for Free: The
Emergence of Open Educational Resources, Organisation for Economic Co-operation and
Development. Available at: http://www.oecd.org/dataoecd/35/7/38654317.pdf [Accessed May 8,
2014].
[3]
Harnad, S. (2009). Open access scientometrics and the UK Research Assessment Exercise.
Scientometrics, 79(1), 147-156.
58
1152
[4]
Suber, P. (2009). Timeline of the open access movement. Available at: http://www. earlham.
edu/~ peters/fos/timeline. Htm. [Accessed May 12, 2014].
[5]
Berners-Lee, T. (2006). Linked Data - Design Issues. Available at:
http://www.w3.org/DesignIssues/LinkedData.html. [Accessed May 15, 2014].
[6]
Guha, RV., Brickley, D.: RDF vocabulary description language 1.0: RDF Schema.W3C
Recommendation, W3C. 2004. Available at: http://www.w3.org/TR/2004/REC-rdf-schema20040210/.[Accessed May 15, 2014].
[7]
Hayes, P., Patel-Schneider, PF., Horrocks, I. (2004): OWL web ontology language semantics
and abstract syntax. W3C Recommendation, W3C. 2004. Available at:
http://www.w3.org/TR/2004/REC-owl-semantics-20040210/.[Accessed May 10, 2014].
[8]
European Library (2013). Linked Open Data. Available at:
http://www.theeuropeanlibrary.org/tel4/lod. [Accessed May 11, 2014].
[9]
Villazón-Terrazas, B., Vilches-Blázquez, L. M., Corcho, O., & Gómez-Pérez, A. (2011).
Methodological guidelines for publishing government linked data. In Linking Government Data.
Springer New York, pp 27-49.
[10]
Hausenblas, M., et al. (2013). Linked Data, Manning Publications Company.
[11]
Hakimjavadi, H., Masrek, M. N. & Alam, S. (2012). SW-MIS: A Semantic Web Based Model for
Integration of Institutional Repositories Metadata Records. Science Series Data Report, 4(11),
pp 57–66.
[12]
Kim, J. A., & Choi, S. Y. (2007). Evaluation of Ontology Development Methodology with CMM-i.
In Software Engineering Research, Management & Applications, SERA 2007. 5th ACIS
International Conference, IEEE, pp. 823-827.
[13]
Poveda-Villalón, M. Suárez-Figueroa, M., and Gómez-Pérez, A. (2012). Validating ontologies
with OOPS!. In Proceedings of the 18th international conference on Knowledge Engineering
and Knowledge Management (EKAW'12), Teije,A., Völker,J., Handschuh,S., Heiner
Stuckenschmidt, H., and d'Acquin,M. (Eds.). Springer-Verlag, Berlin, Heidelberg, pp 267-268.
[14]
Gómez-Pérez, A. Vila-Suero, D., Montiel-Ponsoda, D., Gracia, J. and Aguado-de-Cea, G.
(2013). Guidelines for multilingual linked data. In Proceedings of the 3rd International
Conference on Web Intelligence, Mining and Semantics (WIMS '13). ACM, New York, NY, USA.
59
1153
9.
Transforming Library Catalogs into Linked Data
El contenido de este capítulo corresponde al siguiente artículo:
Hallo, M., Luján-Mora, S. and Trujillo, J. (2014). Transforming Library Catalogs into
Linked Data. Proceedings of the 7th International Conference of Education, Research and
Innovation (ICERI2014). Sevilla, España, IATED, 1845-1853.
URI: https://library.iated.org/view/HALLO2014TRA.
El artículo: “Transforming library catalogs into Linked Data” presenta un proceso para
publicar metadatos de recursos bibliográficos usando tecnologías de datos enlazados. El
proceso fue aplicado para extraer metadatos de recursos bibliográficos de una biblioteca
universitaria en formato Marc21, representarlos en RDF y publicarlos en un punto SPARQL.
Este estudio permitió validar y ampliar el proceso expuesto en el capítulo 8.
61
TRANSFORMING LIBRARY CATALOGS INTO LINKED DATA
M. Hallo1, S. Luján-Mora2, J. Trujillo2
1
2
National Polytechnic School, Faculty of System Engineering (ECUADOR)
University of Alicante, Department of Software and Computing Systems (SPAIN)
Abstract
Traditionally, in most digital library environments, the discovery of resources takes place mostly
through the harvesting and indexing of the metadata content. Such search and retrieval services
provide very effective ways for persons to find items of interest but lacks the ability to lead users
looking for potential related resources or to make more complex queries. In contrast, modern web
information management techniques related to Semantic Web, a new form of the Web, encourages
institutions, including libraries, to collect, link and share their data across the web in order to ease its
processing by machines and humans offering better queries and results increasing the visibility and
interoperability of the data.
Linked Data technologies enable connecting related data across the Web using the principles and
recommendations set out by Tim Berners-Lee in 2006, resulting on the use of URIs (Uniform
Resource Identifier) as identifiers for objects, and the use of RDF (Resource Description Framework)
for links representation.
Today, libraries are giving increasing importance to the Semantic Web in a variety of ways like
creating metadata models and publishing Linked Data from authority files, bibliographic catalogs,
digital projects information or crowd sourced information from another projects like Wikipedia.
This paper reports a process for publishing library metadata on the Web using Linked Data
technologies. The proposed process was applied for extracting metadata from a university library,
representing them in RDF format and publishing them using a Sparql endpoint (an interface to a
knowledge database). The library metadata from a subject were linked to external sources such us
another libraries and then related to the bibliography from syllabus of the courses in order to discover
missing subjects and new or out of date bibliography. In this process, the use of open standards
facilitates the exploitation of knowledge from libraries.
Keywords: Linked Data, Semantic Web, Library Catalogs, RDF, Marc21.
1
INTRODUCTION
Libraries and other cultural institutions are experiencing a time of changes. Metadata generated
through the use of contemporary metadata standards and technical formats is mainly designed for
human consumption rather than machine processing failing to interoperate with external information
providers [1]. One possible improvement is provided by the standards established by the World Wide
Web Consortium (W3C) to build the Semantic Web, a new form of the Web, to increase the visibility
and interoperability of the data. Linked data, in particular, is an implementation of these standards
useful to work with the metadata produced and maintained by libraries and other cultural institutions
[2].
The term Linked Data refers to a set of best practices for publishing and interlinking structured data on
the Web in a human and machine readable way [3]. The Linked Data principles are:
• Use Uniform Resource Identifiers (URIs) as names for things.
• Use HTTP URIs so that people can look up those names.
• When someone looks up a URI, provide useful information, using common standards such as
RDF (Resource Description Framework) and SPARQL (RDF query language).
• Include links to other URIs so that they can help to discover more things.
The URIs are used to identify a web resource. In addition, RDF is used for modelling and
representation of information resources as structured data. In RDF, the fundamental unit of
information is the subject-predicate-object triple. In each triple the “subject” denotes the source; the
Proceedings of ICERI2014 Conference
17th-19th November 2014, Seville, Spain
ISBN:978-84-617-2484-0
1845
63
“object” denotes the target; and, the “predicate” denotes a verb that relates the source to the target.
However, using these technologies alone, it is not possible to give meaning to data.
The Semantic Web Stack (Architecture of the Semantic Web) includes two technologies: RDFS (RDF
Schema) and OWL (Web Ontology Language). RDFS is an extension of RDF that defines a
vocabulary for the description of entity-relationships [4]. RDFS provides metadata terms to create
hierarchies of entity types (referred to as “classes”) and to restrict the domain and range of predicates.
OWL is an extension of RDFS [5], which provides additional metadata terms for the description of
complex models, which are referred to as “ontologies”.
Some movements like LODLAM (Linked Open Data in Library, Archives and Museums) are working in
1
sharing knowledge, tools and expertise using Linked Data in Libraries . Several national libraries, such
as British Library, French Library, Spanish National Library and libraries from universities, such as
Michigan, Stanford, Cambridge, etc. have published linked datasets of bibliographic data that they
have created. European Library is promoting Linked Open Data innovations in libraries across Europe
[6].
This paper reports a process for publishing Library metadata on the Web using Linked Data
technologies. The source metadata are Marc21 records from libraries using the Koha software
system, based on best practices and recommendations from several authors. The proposed approach
has been applied in a case study, “Electrical Enginnering Libray”, in the context of the interuniversity
project for publishing library bibliographic data using Linked Data technologies, funded by CEDIA
(“Consorcio Ecuatoriano para el Desarrollo de Internet Avanzado”). The extracted metadata were
published in RDF format in a SPARQL endpoint. The library metadata were linked to external sources
like another libraries and related to the pensums of the studies to discover missing subjects or out of
date bibliography. In this process, the use of open standards facilitates the exploitation of knowledge
from Libraries.
Some existing vocabularies and ontologies are used, such as FOAF (Friend of a friend), BIBO
(Bibliographic Ontology), ORG (Organization Ontology) and DC (Dublin Core). In addition, the dataset
created was linked to external data giving information that goes far beyond the bibliographic data
provided by traditional libraries giving additional information, such us books used by similar educative
institutions and faculties, books with similar subjects, authors of similar books, books cited in syllabus
that should be replaced etc.
2
LIBRARY METADATA
Zeng and Qin define four kinds of metadata standards used in the library profession [7]:
• Structures like the Dublin Core Metadata Element Set (DCMES).
• Content like the Anglo-American Cataloging Rules, Second Edition (AACR2).
• Values like the Library of Congress Subject Headings (LCSH).
• Exchange like the MARC 21 Format for Bibliographic Data (MARC 21).
The data structure standards “will normally specify the metadata elements that are included in the
scheme by giving each of them a name and a definition” [1]. Content standards “specify how values
for metadata elements are selected and represented” .Zeng and Qin note that data value standards
“include controlled term lists, classification schemes, thesauri, authority files, and lists of subject
headings”. Finally, the data exchange standards allows libraries to exchange metadata coherently [7].
“MARC21 was developed by the Library of Congress (LC) in the mid-1960s, primarily to enable the
computer production of catalog cards that could subsequently be distributed through the Cataloging
Distribution Service” [1].
In addition to metadata standards, the metadata itself falls generally into three categories: descriptive,
administrative and structural [8]. “Traditional library cataloging viewed as metadata is primarily
descriptive”, however, digital resources are more complex and require more than traditional
description [7].
In many ways, libraries are traditional and continue to employ a variety of open or proprietary
informational models such as MARC21 such as in the Koha integrated library system origin of this
1
LODLAM Linked Open Data in Library, Archives and Museums: http://lodlam.net
64
1846
work. In contrast, modern web information management techniques related to Semantic Web
encourages institutions, including libraries, to collect, link and share their data across the web in order
to ease its processing by machines and humans to get better queries and results. Linked Data
technologies enable connecting related data across the Web using the principles and
recommendations set out by Tim Berners-Lee in 2006 [9]. The convergence between library metadata
and linked data is based on the library interests (constructing vocabularies, describing properties of
resources, identifying resources, exchanging and aggregating metadata) that are driving the
development of Semantic Web technologies [10].
3 THE PROCESS FOR PUBLISHING LINKED DATA FROM MARC21
METADATA
Several approaches are being proposed to generate and publish linked data [11, 12], each one
represented by activities and each activity composed of several task.
Our approach use six main activities: data source analysis, metadata extraction, modelling, RDF
generation, linking and publishing. This process was also used to extract and publish scientific
metadata from Open Journal Systems.
3.1
Data source analysis
The selected data sets were analyzed looking for attributes useful for answering the main queries. The
steps in this activity are:
a) Identification of the data source and the attributes of interest to be published and linked to another
datasets.
In this study, we have chosen a dataset with bibliographic metadata from the electrical engineering
faculty from the National Polytechnic School and will be extended to the integrated library metadata
considering the importance of the diffusion and interlinking of this information for the students and
teachers.
b) Data source study
The sample library is using the Koha system. Koha is a web-based Integrated Library System, with
MySQL database, which provides a simple and clear interface for library users to perform tasks such
as searching for and reserving items and suggesting new items. Several national libraries are using
the Koha system. In order to have a better knowledge about the books and their relation to scientific
curriculums, the work was focused in the bibliographic metadata stored with Marc21 standard.
The analyzed system has the bibliographic material shown in Table 1:
Table 1: Bibliographic material in the analyzed library
CAT
Technical catalogs.
LIEE
Books specialized in Electrical and Electronic Engineering.
NTEC
Technical standards.
OLIT
literary Works.
PTEC
Institutional technical publications.
RELE
Electronic Resources (CDs, DVDs)
REV
Journals
TIEE
Engineering Thesis
The data in a MARC bibliographic record is organized into variable fields, each identified by a threecharacter numeric tag that is stored in the Directory entry for the field. The data fields, shown in the
table 2 are grouped into blocks according to the first character of the tag, which with some exceptions
identifies the function of the data within the record. The type of information in the field is identified by
the remainder of the tag.
1847
65
2
Table 2: Field types in Marc21 .
Field
Description
0XX
Control information, identification and classification numbers, etc.
1XX
Main entries
2XX
Titles and title paragraph (title, edition, imprint)
3XX
Physical description, etc.
4XX
Series statements
5XX
Notes
6XX
Subject access fields
7XX
Added entries other than subject or series; linking fields
8XX
Series added entries, holdings, etc.
9XX
Reserved for local implementation
Within the 1XX, 4XX, 6XX, 7XX and 8XX blocks, certain parallels of content designation are usually
preserved. The meanings, with some exceptions, are given to the final two characters of the tag of
fields (see table 3):
Table 3: Additional meanings of fields in Marc21
X00
Personal names
X40
Bibliographic titles
X10
Corporate names
X50
Topical terms
X11
Meeting names
X51
Geographic names
X30
Uniform titles
This data linked to another datasets will give us better knowledge about another books from an author,
books published with similar subjects, the authors who work in a subject, the origin of an author, etc.
The metadata are stored in the biblio, biblioitem, itemtype tables.
Table 4: Some Marc21 Fields used in the test
2
MARC
Descripción
003
Control Number Identifier
020 a
International Standard Book Number
041 a
Language code of text/sound track or separate title
082 2
Edition number
100 a
Personal name
245 a
Title
250 a
Edition statement
260 a
Place of publication, distribution, etc.
260 c
Date of publication, distribution, etc.
856 u
Uniform Resource Identifier
Marc21 Format for Bibliographic Data: http://www.loc.gov/marc/bibliographic/
66
1848
c) Identification of the licensing and provenance information
There is general information about the licensing of the data sets. In our case the data set has Creative
3
Commons Attribution-ShareAlike 4.0 International license .
Provenance information about a data item is information about the history of the item, including
information about its origins. It is a measure about the quality of data.
In our case study, the provenance data are the name of the library: Electrical Engineering library from
National Polytechnic School, initial load data: 01-09-2014.
3.2
Metadata extraction
In this activity the metadata are extracted from the original source and stored in an intermediate
database for cleaning. The tasks in this activity are:
a) Metadata extraction and storage
Metadata were extracted using the open source software Spoon-Pentaho Data Integration and stored
in a relational database. The data extracted were metadata from the entities: work (book, Journal,
etc.), Format, language (Expression), editions (Manifestation), authors, organization.
b) Data cleaning
In the case analyzed we found absence of data in several fields that could be filled from another open
data sets.
An initial pre-processing of the data applying data clean techniques, was performed. Spoon and Silk
were used to get the catalogues of authors, works, expressions, manifestations and organizations.
3.3
Modelling
The goal of this activity is to design and implement a vocabulary for describing the data sources in
RDF.
The steps in this activity are:
a) Selection of vocabularies
The most important recommendation from several studies is to reuse available vocabularies as much
as possible to develop the ontologies. An ontology represents knowledge as a hierarchy of concepts
within a domain, using a shared vocabulary to denote the types, properties and interrelationships of
those concepts [13]. We use the following controlled vocabularies and ontologies for modelling works
(books, articles), manifestations (formats, language), expressions (editions), authors, organizations.
4
• BIBO (The Bibliographic Ontology) provides main concepts and properties for describing
citations and bibliographic references (e.g. books, articles, etc.) on the Semantic Web using
RDF.
5
• Dublin Core is a set of terms that can be used to describe web resources as well as physical
resources such as books. It consists of fifteen fields, e.g., creator, contributor, format, identifier,
language, publisher, relation, rights, source, title, type, subject, coverage, description, and date.
6
The full set of Dublin Core metadata terms can be found on the Dublin Core Metadata . Dublin
Core Metadata may be used to provide interoperability in Semantic Web implementations
combining metadata vocabularies of different metadata standards.
7
• FOAF (Friend of a Friend) is a machine-readable ontology describing persons, their activities
and relations to other people and objects in RDF format.
8
• ORG (Organization) is an ontology for organizational structures, aimed at supporting linked
data publishing of organizational information. It is designed to add a classification of
3
Creative Commons Licences: https://creativecommons.org/licenses/
4
The Bibliographic Ontology: http://bibliontology.com/
5
Dublin Core Metadata Element Set, version 1.1: http://dublincore.org/documents/dces/
6
DCMI Metadata Terms: http://dublincore.org/documents/dcmi-type-vocabulary/index.shtml
7
The Friend of a Friend (FOAF) project: http://www.foaf-project.org/
8
The Organization Ontology: http://www.w3.org/TR/vocab-org/
67
1849
organizations and roles, as well as extensions to support information such as organizational
activities.
9
• FRBR (Functional Requirements for Bibliographic Records) conceptual model relating data
from bibliographic records.
10
• SKOS (Simple Knowledge Organization System) is a OWL ontology that provides a way to
represent controlled vocabularies, taxonomies and thesauri.
b) Vocabulary development and Documentation
The vocabulary was documented using Protégé (Ontology Editor Tool)
11
and is shown in table 5.
c) Vocabulary validation
Ontology validation is a key activity in different ontology engineering scenarios such as development
and selection, that is, assessing their quality and correctness [14].
12
The generate vocabulary was validate with OOPS! .
d) Specify a license for the dataset
The license to publish the datasets is Creative Commons Attribution-ShareAlike 4.0 International
13
.
Table 5. Vocabularies and Namespaces
3.4
Vocabulary/Ontology
Namespaces
ORG
http://www.w3.org/ns/org#
FOAF
http://xmlns.com/foaf/0.1/
DC
http://xmlns.com/dc/0.1/
DCTERMS
http://purl.org/dc/terms/
BIBO
http://purl.org/ontology/bibo/
SKOS
http://www.w3.org/2004/02/skos/core#
RDFS
http://www.w3.org/2000/01/rdf-schema#
OWL
http://www.w3.org/2002/07/owl#
FRBR-RDA
http://rdvocab.info/uri/schema/FRBRentitiesRDA/
RDF generation
The goal of this activity is to define a method and technologies to transform the source data in RDF
and produce a set of mappings from the data sources to RDF. The tasks in this activity are:
a) Selection of development technologies for RDF generation.
14
For the study case the Triplify tool with some modifications has been used to perform the
transformation from the intermediate relational database to RDF.
b) Mappings from data sources to RDF.
Mappings were defined from the intermediate data base with metadata extracted from the
source system to RDF.
c) Transformation of data.
9
Functional Requirements for Bibliographic Records: http://www.ifla.org/publications/functional-requirements-for-bibliographicrecords
10
Introductin to SKOS: http://semanticweb.com/introduction-to-skos_b33086
11
Protégé: http://protege.stanford.edu/
12
Ontology Pitfall Scanner: http://www.oeg-upm.net/oops
13
Creative Commons: http://creativecommons.org/
14
Triplify: http://triplify.org/
1850
68
The process of transformation was run with the open source software Triplify 1.0 getting RDF
triples stored in RDF/XML format. Fig. 1 shows part of this process.
3.5
Interlinking
The objective of this activity is to improve the connectivity to external datasets enabling other
applications to discover additional data sources.
The tasks corresponding to this activity are:
a) Target datasets discovery and selection
15
For this task we used the website Datahub.io to find some datasets useful for linking. We
16
found several open linked datasets like Open Library with books records useful to help in the
17
18
cataloguing process, Europeana Linked Open Data , Library of Congress Subject Headings .
b) Linking to external datasets
19
The open source software Silk was used to find relationship between data items of our dataset
and the external datasets generating the corresponding RDF links that were stored in a
separated dataset. The links with test books from the syllabus of the courses in the Electrical
Engineering Faculty at National Polytechnic School are also generated in order to discover
missing subjects and new or out of date bibliography.
Fig. 1 Marc21 to RDF transformation
3.6
Publication
The goal of this activity is to make RDF datasets available on the Web to the users following the
Linked Data principles. The steps in this activity are:
a) Dataset and vocabulary publication on the web.
The generated triples were loaded into a SPARQL endpoint (a conformant SPARQL protocol
20
service) based on OpenLink Virtuoso , which is a database engine that combines the
functionality of RDBMS, virtual databases, RDF triple stores, XML store, web application server
15
Datahub: http://datahub.io/
16
Open Library: http://datahub.io/dataset/open-library
17
Europeana Linked Open Data: http://datahub.io/dataset/europeana-lod
18
Library of Congress Subject Headings: http://datahub.io/dataset/lcsh
19
Silk – A Link Discovery Framework for the Web of Data: http://wifo5-03.informatik.uni-mannheim.de/bizer/silk/
20
Virtuoso Universal Server: http://virtuoso.openlinksw.com/
1851
69
21
and file servers. On the top of OpenLink Virtuoso; ELDA is used as a Linked Data interface to
the RDF data. Fig. 2 shows a view of the SPARQL endpoint with a partial result of the query
about an article in a test platform:
b) Metadata definition and publication
Metadata recommended for publishing Linked Data sets are: organization and/or agency,
creation date, modification date, version, frequency of updates, and contact email address.
22
The metadata will be published in the site Datahub.io using DCAT (Data Catalog Vocabulary) ,
an RDF vocabulary designed to facilitate interoperability between data catalogues published on
the Web.
Fig. 2 SPARQL endpoint query
The whole architecture used in this project is shown in the Fig. 3.
Fig.3. Open Linked Data architecture
4
CONCLUSIONS AND FUTURE WORK
In this paper we refine and use a process for publishing metadata from Koha Library systems on the
Web using Linked Data technologies. The process was based in best practices and recommendations
from several studies, adding tasks and activities considered important during the project development.
21
Elda– the linked-data API in Java: http://www.epimorphics.com/web/tools/elda.html
22
Data Catalog Vocabulary(DCAT): http://www.w3.org/TR/vocab-dcat/
70
1852
The process was applied to the transformation of metadata from a Koha Libray System to RDF. The
source of metadata are bibliographic records in Marc21 format. A preliminary mapping from Marc21 to
RDF was made before the generation of RDF. .For publishing we use OpenLink Virtuoso and Elda
that was tested like interface to the SQL enpoint. In the future, we will work using SKOS (Simple
Knowledge Organization System) to link the subjects and disciplines to another works to offer better
queries to the users. We are also analyzing the best way to validate the generated external links.
Another work for the future is the alignment of the data model with activities of the publication process.
ACKNOWLEDGEMENT
This research has been partially supported by the Prometeo project by SENESCYT, Ecuadorian
Government, by CEDIA (Consorcio Ecuatoriano para el Desarrollo de Internet Avanzado) supporting
the project: “Platform for publishing library bibliographic resources using Linked Data technologies”
and by the project GEODAS-BI (TIN2012-37493-C03-03) supported by the Ministry of Economy and
Competitiveness of Spain (MINECO).
REFERENCES
[1]
Caplan, P. (2003). Metadata Fundamentals for All Librarians. Chicago: American Library
Association.
[2]
Baker, T., E. Bermès, K. Coyle, G. Dunsire, A. Isaac, P. Murray, M. Panzer, J. Schneider, R.
Singer, E. Summers, W. Waites, J. Young and M. Zeng. (2011). Library Linked Data Incubator
Group Final Report. World Wide Web Consortium, [Accessed September 18, 2012],
www.w3.org/2005/Incubator/lld/XGR-lld-20111025.
[3]
Berners-Lee, T. (2006). Linked Data - Design Issues. Available at:
http://www.w3.org/DesignIssues/LinkedData.html. [Accessed sept 15, 2014].
[4]
Guha, RV., Brickley, D. (2004). RDF vocabulary description language 1.0: RDF Schema.W3C
Recommendation, W3C. 2004. Available at: http://www.w3.org/TR/2004/REC-rdf-schema20040210/.[Accessed Sep 16, 2014].
[5]
Hayes, P., Patel-Schneider, PF., Horrocks, I. (2004): OWL web ontology language semantics
and abstract syntax. W3C Recommendation, W3C. 2004. Available at:
http://www.w3.org/TR/2004/REC-owl-semantics-20040210/.[Accessed May 10, 2014].
[6]
European Library (2013). Linked Open Data. Available at:
http://www.theeuropeanlibrary.org/tel4/lod. [Accessed Sept 2, 2014].
[7]
Zeng, M. L. and J. Qin. (2008). Metadata. New York: Neal-Schuman.
[8]
NISO. (2004). Understanding Metadata, Available at:
www.niso.org/publications/press/UnderstandingMetadata.pdf. . [Accessed September 18,
2012].
[9]
Berners-Lee, T. (2006). Linked Data - Design Issues. Available at:
http://www.w3.org/DesignIssues/LinkedData.html. [Accessed Sept 15, 2014].
[10]
Heery, R. (2004). “Metadata Futures: Steps Toward Semantic Interoperability.” In Metadata in
Practice, edited by D. I. Hillman and E. L. Westbrooks, 257–71. Chicago: American Library
Association.
[11]
Auer, S. and Lehmann, J. (2010). Making the Web a Data Washing Machine—Creating
Knowledge Out of Interlinked Data. Semantic Web Journal, Available at: www.semantic-webjournal.net/content/new-submission-making-web-data-wash. [Accessed September 18, 2012].
[12]
Villazón-Terrazas, B., Vilches-Blázquez, L. M., Corcho, O., & Gómez-Pérez, A. (2011).
Methodological guidelines for publishing government linked data. In Linking Government Data.
Springer New York, pp 27-49.
[13]
Kim, J. A., & Choi, S. Y. (2007). Evaluation of Ontology Development Methodology with CMM-i.
In Software Engineering Research, Management & Applications, SERA 2007. 5th ACIS
International Conference, IEEE, pp. 823-827.
71
1853
10. An Approach to Publish Statistics from Open-access Journals
using Linked Data Technologies
El contenido de este capítulo corresponde al siguiente artículo:
Hallo, M., Luján-Mora, S. and Trujillo, J. (2015). An Approach to Publish Statistics from
Open-access Journals using Linked Data Technologies. Proceedings of the 9th
International Technology, Education and Development Conference (INTED2015).
Madrid, España, IATED, 5940-5948.
URI: https://library.iated.org/view/HALLO2015ANA.
Las estadísticas de uso de publicaciones científicas requieren ser enlazadas a recursos
externos para dar una mejor información para la toma de decisiones. Las estadísticas fueron
modeladas en un data mart para facilitar las consultas acerca de los accesos por diferentes
criterios. Estos datos encadenados a otros conjuntos de datos dan más información como
números de accesos por autores, por orígenes de los mismos, por tópicos de investigación
entre otros. El proceso propuesto fue aplicado extrayendo datos estadísticos de una revista
abierta universitaria y publicándolos en un punto SPARQL. Para modelar los datos
multidimensionales se utilizó el vocabulario RDF data cube. La visualización fue realizada
usando el software CubeViz que permitió filtrar información a ser presentada
interactivamente en gráficos.
73
AN APPROACH TO PUBLISH STATISTICS FROM OPEN-ACCESS
JOURNALS USING LINKED DATA TECHNOLOGIES
M. Hallo1, S. Luján-Mora2, J.Trujillo3
1
2
National Polytechnic School (ECUADOR)
Visiting teacher at the National Polytechnic School, University of Alicante (SPAIN)
3
University of Alicante (SPAIN)
Abstract
Semantic web encourages digital libraries that include open access journals, to collect, link and share
their data across the web in order to ease its processing by machines and humans to get better
queries and results. Linked Data technologies enable connecting structured data across the web using
the principles and recommendations set out by Tim Berners-Lee in 2006.
Several universities develop knowledge, through scholarship and research, under open access
policies and use several ways to disseminate information. Open access journals collect, preserve and
publish scientific information in digital form using a peer review process. The evaluation of the usage
of this kind of publications needs to be expressed in statistics and linked to external resources to give
better information about the resources and their relationships. The statistics expressed in a data mart
facilitate queries about the history of journals usage by several criteria. This data linked to another
datasets gives more information such as: the topics in the research, the origin of the authors, the
relation to the national plans, and the relations with the study curriculums.
This paper reports a process to publish an open access journal data mart on the web using Linked
Data technologies in such a way that it can be linked to related datasets. Furthermore, methodological
guidelines are presented with related activities. The proposed process was applied extracting data
from a university open journal system data mart and publishing it in a SPARQL endpoint using the
open source edition of the software OpenLink Virtuoso. In this process the use of open standards
facilitates the creation, development and exploitation of knowledge. The RDF data cube vocabulary
has been used as a model to publish the multidimensional data on the web. The visualization was
made using CubeViz a faceted browser filtering observations to be presented interactively in charts.
The proposed process help to publish statistical datasets in an easy way.
Keywords: Linked Data, university institutional repositories, semantic web, statistical data, RDF data
cube vocabulary, SDMX, data modeling, data transformation, knowledge management.
1
INTRODUCTION
Open access journals collect, preserve and publish scientific information in digital form related to a
particular subject. The development of Information and Communication Technologies (ICTs) has
increased the number of open access scientific journals in digital format, speeding up dissemination
and access to content [1].
A growing number of scholarly journals are using Open Journal Systems (OJS), an open source
software platform, specially designed to manage articles trough author submission, peer review,
editing and publication. This system provides the journal manager with the ability to extract year-byyear statistics about the history of the journal's usage for data on submissions, editorial practices, and
users grouped by editor and reviewer [2].
Statistical data from open journal systems are important for policy definition, planning and control with
relevant impact into the society. Libraries routinely collect statistics on digital collection use for
assessment and evaluation purposes. Libraries then report those statistics to a variety of
stakeholders. However, statistics from bibliographic data are dispersed, without relationship between
resources and data sets making difficult their discovery and reuse for other information systems.
Moreover, there exists no simple way for researchers, journalists and interested people to compare
statistical data retrieved from different data stores on the web because of lack of standardization.
To address these issues, we propose a process to publish a data mart, a part of a data warehouse, of
statistical data from OJS following the Linked Data principles.
Proceedings of INTED2015 Conference
2nd-4th Marchr 2015, Madride, Spain
ISBN:978-84-606-5763-7
5940
75
The proposed approach was developed based on best practices and recommendations from several
authors [3,4,5] and tested with data from the electronic version of the journal “Revista Politécnica”
edited by National Polytechnic School of Quito-Ecuador. In addition, the dataset created was linked to
external data giving information that goes far beyond the bibliographic data provided by publishers
giving information, such us authors, publishing papers with similar subjects and high number of visits,
or organizations sponsoring research in specific subjects, statistical indicators below national
standards, etc.
The term Linked Data refers to a set of best practices for publishing and interlinking structured data on
the web in a human and machine readable way [6].
The URI (Uniforme Resource Identification) is used to identify a web resource. In addition, RDF
(Resource Description Framework) is used for modeling and representation of information resources
as structured data. In RDF, the fundamental unit of information is the subject-predicate-object triple. In
each triple the “subject” denotes the source; the “object” denotes the target; and, the “predicate”
denotes a verb that relates the source to the target. Using a combination of URIs and RDF, it is
possible to give identity and structure to data. However, using these technologies alone, it is not
possible to give semantics to data.
The Semantic Web Stack (Architecture of the semantic web) includes two technologies: RDFS (RDF
Schema) and OWL (Web Ontology Language). RDFS is an extension of RDF that defines a
vocabulary for the description of entity-relationships [7]. RDFS provides metadata terms to create
hierarchies of entity types (referred to as “classes”) and to restrict the domain and range of predicates.
OWL is an extension of RDFS [8], which provides additional metadata terms for the description of
complex models, which are referred to as “ontologies”.
Some existing vocabularies and ontologies are used, such as FOAF (Friend of a friend), BIBO
(Bibliographic Ontology), ORG (Organization Ontology) and DC (Dublin Core).
The RDF data cube vocabulary has been proposed to describe statistical data organized in a
multidimensional model, helping to publish, discover and link statistical data in a uniform way. The
model has been proposed based on the ISO 17369:2013 [9].
There are a number of benefits to being able to publish multi-dimensional data, such as statistics,
using RDF and the linked data technologies. The W3C recommendation about the RDF data cube
vocabulary, present the following benefits:
• The individual observations, and groups of observations, become (web) addressable. This
allows publishers and third parties to annotate and link to this data; for example for fine grained
provenance trace-back.
• Data can be flexibly combined across datasets. The statistical data becomes an integral part of
the web of linked data.
• For publishers who currently only offer static files then publishing as linked-data offers a flexible,
non-proprietary, machine readable means of publication that supports an out-of-the-box web
API for programmatic access.
• It enables reuse of standardized tools and components.
2
LINKED DATA PUBLICATION PROCESS FOR STATISTICAL DATA
The proposed process allows to publish an statistical data mart in the RDF format using common
vocabularies such us RDF data cube from a OJS data mart.
Our approach proposes five main activities:
• Data source analysis.
• RDF data modeling.
• RDF generation.
• Linking.
• Publishing.
76
5941
2.1
Data source analysis
In this activity the data mart for the publication is selected and the licensing and provenance
information is defined. Following we describe those steps:
 Data source selection
In this step we chose a data mart to publish RDF format, considering that linked to another datasets
will give us better knowledge.
The data mart is a subset of the data warehouse that is usually oriented to a specific business topic, is
represented with the multidimensional model, or else cube model, comprised of three basic
components: dimensions, measures and attributes.
A data mart is represented with a multidimensional model. In this case we have worked with a data
mart about OJS article visits.
1
The open journal selected for analysis uses the open source OJS for the management of peerreviewed academic journals. Several universities have adopted the OJS, software that provides an
Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) Endpoint.
The data mart used for the test of this proposal is shown in the fig.1.
VISIT _T IME_DIM
Visit_T ime_Id
Visit_Date
Visit_Year
Visit_Month
<pi> Integer
Date & T ime
Integer
Integer
Visit_T
ime_Id
...
<pi>
KEYWORDS
ART ICLE_DIM
Keyword_Id
<pi> Integer <M>
keyword_Name
T ext
Keyword_Concept
T ext
contains
Article_Id
Article_T itle
Article_Language
Article_Academic_Dicipline
Article_Subject_classification
...
<pi> Integer
T ext
T ext
T ext
T ext
VISIT _FACT S
is in
num_visits Integer
............
Integer
has
write for
in
GEOGRAPHIC_LOCAT ION
Location_Id
Location_Name
Location_T ype_Name
State_Name
<pi> Integer
T ext
T ext
T ext
Dependency relationship 1-N
AUT HORS
Author_Id
Author_Name
Author_Gender
Author_Rol
Author_Affiliation
<pi> Integer <M>
T ext
T ext
T ext
T ext
Authors_ID
<pi>
...
Fig. 1. OJS visits data mart
The used dataset is stored in a MySQL database.
This data linked to another datasets will give us better knowledge about similar subjects published, the
authors who work in them, the organizations sponsoring similar research.
 Identification of the licensing and provenance information
There is general information about the licensing in the analyzed open journal. It is possible to get
statistical information from the online journal and reproduce citing the source.
Provenance information about a data item is information about the history of the item, including
information about its origins. It is a measure about the quality of data.
2
In our case study, the provenance data are documented using terms of the PROV data model. PROV
defines a data model building representations of the entities, people and processes involved in
producing a piece of data or thing in the world.
1
Open Journal System: https://pkp.sfu.ca/ojs/
2
Prov: http://www.w3.org/TR/prov-primer/
77
5942
In the future the source data mart will be documented using the SDMX (Statistical Data and Metadata
Exchange) standard, which in turn define a set of cross-domain concepts, code lists and categories in
order to provide compatibility and interoperability across institutions.
2.2
RDF data modeling
The goal of this activity is to design and implement a vocabulary for describing the statistics datasets
in RDF. The steps in this activity are:
 Selection of vocabularies
The most important recommendation from several studies is to reuse available vocabularies as much
as possible to develop the ontologies. An ontology represents knowledge as a hierarchy of concepts
within a domain, using a shared vocabulary to denote the types, properties and interrelationships of
those concepts [10]. We use the following controlled vocabularies and ontologies for modelling
statistical datasets in RDF:
3
• RDF data cube vocabulary is an standard that provides a means to publish multi-dimensional
data, such as statistics, on the web.
4
• BIBO (The Bibliographic Ontology) provides main concepts and properties for describing
citations and bibliographic references (e.g. books, articles, etc.) on the semantic web using
RDF.
5
• Dublin Core is a set of terms that can be used to describe web resources as well as physical
resources such as books. The full set of Dublin Core metadata terms can be found on the
6
Dublin Core Metadata . Dublin Core Metadata may be used to provide interoperability in
semantic web implementations combining metadata vocabularies of different metadata
standards.
7
• FOAF (Friend of a Friend) is a machine-readable ontology describing persons, their activities
and relations to other people and objects in RDF format.
8
• ORG (Organization) is an ontology for organizational structures, aimed at supporting linked
data publishing of organizational information. It is designed to add classification of organizations
and roles, as well as extensions to support information such as organizational activities.
9
• SKOS (Simple Knowledge Organization System) for concepts and concept schemes.
The namespaces used are shown in Table 1.
 Vocabulary development and Documentation
10
The vocabulary was documented using Protégé (Ontology Editor Tool) .
Table 1. Vocabularies and Namespaces
Vocabulary/Ontology
Namespaces
QB
http://purl.org/linked-data/cube#
ORG
http://www.w3.org/ns/org#
FOAF
http://xmlns.com/foaf/0.1/
DC
http://xmlns.com/dc/0.1/
DCTERMS
http://purl.org/dc/terms/
3 RDF data cube vocabulary: http://www.w3.org/TR/vocab-data-cube/
4 The Bibliographic Ontology: http://bibliontology.com/
5 Dublin Core Metadata Element Set, version 1.1: http://dublincore.org/documents/dces/
6 DCMI Metadata Terms: http://dublincore.org/documents/dcmi-type-vocabulary/index.shtml
7 The Friend of a Friend (FOAF) project: http://www.foaf-project.org/
8 The Organization Ontology(ORG): http://www.w3.org/TR/vocab-org/
9 Simple Knowledge Organization System(SKOS): http://www.w3.org/2004/02/skos/
10 Protégé: http://protege.stanford.edu/
5943
78
BIBO
http://purl.org/ontology/bibo/
SKOS
http://www.w3.org/2004/02/skos/core#
The reduced RDF data cube model used in this work is presented in fig. 2.
Fig 2. A reduced RDF data cube vocabulary
Following the RDF data cube vocabulary we define the data model corresponding to the selected data
mart with the dimensions, measures and attributes components.
The reduced RDF data model developed is presented in fig. 3
Fig 3. The reduced OJS RDF data model
Each concept is mapped with the corresponding concept of the multi-dimensional model, such as
dimension, measure, code list, etc.
The URI structure was defined by:
1. Schema components (dimensions, measures, and attributes), which are identified by a URI of
the form {Base_URI }/dc/cube_name/prop/{dimension_name | measure_name |attribute}.
2. Datasets are identified by {Base_URI}/}/dc/cube_name/dataset/{DatasetName} and the dataset
component specification by { Base_URI}/dc/cube_name/dccs /{dimension_name|measure_
name} respectively, (c) Concepts and their values reused across multiple datasets are identified
by { Base_URI}/concept/ {ConceptName} and { Base_URI }/concept/{ConceptName}/{value}.
5944
79
 Specify a license for the dataset
The license to publish the RDF datasets is Creative Commons Attribution-ShareAlike 4.0 International
11
.
2.3
RDF Generation
The goal of this activity is to define a method and technologies to transform the source data into RDF
and produce a set of mappings from the data sources to RDF. The tasks in this activity are:
 Selection of development technologies for RDF generation
12
For the study case the Open Refine tool has been used to perform the transformation from the
multidimensional model stored in a relational database to RDF.
 Mappings from data sources to RDF
Mappings were defined from the multidimensional data base to RDF.
This step involves mapping of dataset’s concepts to the RDF data cube elements, e.g., dimensions as
qb:DimensionProperty, measures as qb:MeasureProperty or attributes as qb:AttributeProperty, the
identification of the data (observations) as qb:Observation instances. Concepts within the datasets
may be mapped with another concepts and code lists providing compatibility and interoperability. The
mappings are used to create the dataset’s structure, the dataset itself and the observations, using the
appropriate URI Scheme for each type of resource. A default URI scheme has been designed as an
input to this step to easily map the instances of the data cube vocabulary and the resources. The code
lists that are used to give a value to any of the components are also defined using SKOS vocabulary.
The data are then exported as RDF in an RDF compliant serialization, such as RDF/XML and
validated. Fig.4 shows part of the data cube generated.
@prefix qb: <http://purl.org/linked-data/cube#>
<http://opendata.epn.edu.ec/dc/ojsvisits/prop/articleDim> a
qb:DimensionProperty ;
rdfs:label "ArticleDim"@en .
<http://opendata.epn.edu.ec/dc/ojsvisits/prop/timeDim> a qb:DimensionProperty ;
rdfs:label "TimeDim"@en .
<http://opendata.epn.edu.ec/dc/ojsvisits/prop/OJSvisitsMeasures> a
qb:MeasureProperty ;
rdfs:label "OJSVisits"@en .
<http://opendata.epn.edu.ec/dc/ojsvisits/dataset/dataset-ojsvisits> a qb:DataSet ;
rdfs:comment "OJS Visits Data Set"@en ;
a qb:DataStructureDefinition .
<http://opendata.epn.edu.ec/dc/dataset-ojsvisits/dccs/articleDim> a
qb:ComponentSpecification ;
qb:dimension <http://opendata.epn.edu.ec/dc/ojsvisits/prop/articleDim> .
<http://opendata.epn.edu.ec/dc/ojsvisits/dataset/dataset-ojsvisits> qb:component
<http://opendata.epn.edu.ec/dc/dataset-ojsvisits/dccs/articleDim> .
<http://opendata.epn.edu.ec/dc/dataset-ojsvisits/dccs/timeDim> a
qb:ComponentSpecification ;
qb:dimension <http://opendata.epn.edu.ec/dc/ojsvisits/prop/timeDim> .
Fig. 4 Partial turtle code generated for the data cube
11
Creative Commons: http://creativecommons.org/
12
Open Refine: http://openrefine.org/
5945
80
 Transformation of data
The process of transformation was run with the software Open Refine getting RDF triples stored in
RDF/XML format using RDF data cube vocabulary.
2.4
Interlinking
The objective of this activity is to improve the connectivity to external datasets enabling other
applications to discover additional data sources.
The different versions of code lists coming from the same resource are interlinked with each other
using the appropriate linking property, e.g. skos:exactMatch for concepts
The tasks corresponding to this activity are:
 Target datasets discovery and selection
13
For this task we used the website “the Datahub” to find some datasets useful for linking. We found
several open linked statistics datasets from scientific journals.
 Linking to external datasets
14
The open source software Silk was used to find relationship between data items of our dataset and
the external datasets generating the corresponding RDF links that were stored in a separated dataset.
The code lists coming from the resources are interlinked with another similar using the appropriate
linking property, e.g. skos:exactMatch for concepts.
2.6 Publication
The goal of this activity is to make RDF datasets available on the web to the users following the
Linked Data principles. The steps in this activity are:
 Dataset and vocabulary publication on the web
The generated triples were loaded into a SPARQL endpoint (a conformant SPARQL protocol service)
15
based on OpenLink Virtuoso , which is a database engine that combines the functionality of RDBMS,
virtual databases, RDF triple stores, XML store, web application server and file servers. On the top of
16
OpenLink Virtuoso; Cubeviz is used as a Linked Data interface to the RDF data cube [11]. Datasets
may be further “announced” to the public, to be more discoverable, by publishing the data to
international or national open data portals.
Fig. 5 shows a view of the SPARQL endpoint with a partial result of the query about the structure of
the OJSvisits data cube.
13
Datahub: http://datahub.io/
14
Silk – A link discovery framework for the web of data: http://wifo5-03.informatik.uni-mannheim.de/bizer/silk/
15
Virtuoso Universal Server: http://virtuoso.openlinksw.com/
16
CubeViz : http://cubeviz.aksw.org/
81
5946
Fig. 5 . SPARQL endpoint query results about the structure of the OJSvisits data cube.
 Metadata definition and publication
The metadata about the dataset produced was published in the site Datahub using DCAT (Data
17
Catalog Vocabulary) , an RDF vocabulary designed to facilitate interoperability between data
catalogues published on the web [12]. In addition provenance data were added.
The whole architecture used in this project is shown in the fig. 6.
Fig 6. Architecture of statistical RDF publication
3
CONCLUSIONS AND FUTURE WORK
In this paper we analyse and use a process for publishing statistical scientific data from Open Journal
systems on the web using Linked Data technologies. The process was based in best practices and
recommendations from several studies, adding tasks and activities considered important during the
project development. The process was applied to the transformation of a data mart from “Revista
Politécnica” to RDF. For publishing we use OpenLink Virtuoso and CubeViz. The RDF data cube
vocabulary and the Open Refine software were successfully applied for the RDF generation process.
The process could be also applied to data marts build from bibliographic metadata harvested through
the OAI-PMH Protocol.
In the future to get statistical data from OJS we can restrict site- and article-level access through the
user registration interface. The advantage of selecting these options is that anyone wanted to read the
content will need to register, providing reliable readership statistics. In addition, the Logging and
Auditing option enables logging of submission actions and user emails sent by the system. This is a
very useful feature to make available to the readers. Furthermore we will explore to look for similar
properties and class from open linked dataset catalogs to link projects results.
17
Data Catalog Vocabulary (DCAT): http://www.w3.org/TR/vocab-dcat/
82
5947
ACKNOWLEDGMENTS
This work has been partially supported by the Prometeo Project by SENESCYT, Ecuadorian
Government.
REFERENCES
[1]
Harnad, S. (2009). Open access scientometrics and the UK Research Assessment Exercise.
Scientometrics, 79(1), 147-156.
[2]
Brian, D. Willinsky, E. (2010), A Survey of Scholarly Journals Using Open Journal Systems,
Scholarly and Research Communication, 1(2),1-22.
[3]
Salas, P. E. R., Mota, F. M. D., Martin, M., Auer, S., Breitman, K., & Casanova, M. A. (2012).
Publishing Statistical Data on the web. In Proc. IEEE ICSC 2012 (Sept. 19th- 21st, 2012).
Palermo, Italy, 285-292.
[4]
Ermilov, I., Martin, M., Lehmann, J., & Auer, S. (2013). Linked open data statistics: Collection
and exploitation. In Knowledge Engineering and the Semantic Web, 242-249.
[5]
Villazón-Terrazas, B, Vilches-Blázquez,L., Corcho O., & Gómez-Pérez, A. (2011)
"Methodological guidelines for publishing government linked data." In Linking Government Data,
27-49.
[6]
Berners-Lee, T. (2006). Linked Data - Design Issues. Available at:
http://www.w3.org/DesignIssues/LinkedData.html. [Accessed Jan 15, 2015].
[7]
Guha, RV., Brickley, D. (2004).: RDF vocabulary description language 1.0: RDF Schema.W3C
Recommendation, W3C.. Available at: http://www.w3.org/TR/2004/REC-rdf-schema20040210/.[Accessed Jun 15, 2015].
[8]
Hayes, P., Patel-Schneider, PF., & Horrocks, I. (2004): OWL web ontology language semantics
and abstract syntax. W3C Recommendation, W3C. 2004. Available at:
http://www.w3.org/TR/2004/REC-owl-semantics-20040210/.[Accessed Jan 10, 2015].
[9]
Cyganiak, R., Reynolds D., & Tennison, J. (2014): The RDF Data Cube Vocabulary, World
Wide Web Consortium. Available at: http://www.w3.org/TR/vocab-data-cube/. [Accessed Jan 2,
2015].
[10]
Kim, J. A., & Choi, S. Y. (2007). Evaluation of Ontology Development Methodology with CMM-i.
In Software Engineering Research, Management & Applications, SERA 2007. 5th ACIS
International Conference, 823-827.
[11]
Mader, C., Martin, M., & Stadler, C. (2014). Facilitating the Exploration and Visualization of
Linked Data. In Linked Open Data--Creating Knowledge Out of Interlinked Data, 90-107.
[12]
Cyganiak, R., Maali, F., & Peristeras, V. (2010). Self-service linked government data with dcat
and gridworks. In Proceedings of the 6th International Conference on Semantic Systems, 37-39.
5948
83
11.
Publishing a Scorecard for Evaluating the Use of
Open-Access Journals Using Linked Data Technologies
El contenido de este capítulo corresponde al siguiente artículo:
Hallo, M., Luján-Mora, S. and Maté, A. (2015). Publishing a Scorecard for Evaluating
the Use of Open-Access Journals Using Linked Data Technologies. Proceedings of the
2015 International Conference on Computer, Information and Telecommunication
Systems (CITS 2015). Gijón, España, IEEE, 105-109.
URI: http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7297730.
Los datos estadísticos sin relación con los objetivos institucionales no son de
mucha utilidad para una adecuada evaluación por lo que en este artículo se realiza una
propuesta de estructuración y publicación de datos de un cuadro de mando utilizando
técnicas de datos enlazados permitiendo agregar nuevas actividades al proceso
propuesto en el artículo del capítulo 10.
85
Publishing a Scorecard for Evaluating the Use of
Open-Access Journals Using Linked Data
Technologies
María Hallo
Sergio Luján-Mora
Department of Computer Science
National Polytechnic School
Quito, Ecuador
[email protected]
Department of Software and
Computing Systems
Visiting teacher at the National
Polytechnic School
University of Alicante
Alicante, Spain
[email protected]
Libraries routinely collect statistics about the use of their
digital collection for evaluation purposes. However, these
statistics are dispersed, stored across data stores lacking a
standard structure, and unrelated to the business objectives. As
a result, it is difficult for researchers and users to compare
statistical information, while for DL it becomes a challenge to
develop policies, assess the impact of OJS in society, and share
their discoveries.
Abstract Open access journals collect, preserve and publish
scientific information in digital form, but it is still difficult not
only for users but also for digital libraries to evaluate the usage
and impact of this kind of publications. This problem can be
tackled by introducing Key Performance Indicators (KPIs),
allowing us to objectively measure the performance of the
journals related to the objectives pursued. In addition, Linked
Data technologies constitute an opportunity to enrich the
information provided by KPIs, connecting them to relevant
datasets across the web.
This paper describes a process to develop and publish a
scorecard on the semantic web based on the ISO 2789:2013
standard using Linked Data technologies in such a way that it
can be linked to related datasets. Furthermore, methodological
guidelines are presented with activities. The proposed process
was applied to the open journal system of a university, including
the definition of the KPIs linked to the institutional strategies, the
extraction, cleaning and loading of data from the data sources
into a data mart, the transforming of data into RDF (Resource
Description Framework), and the publication of data by means of
a SPARQL endpoint using the OpenLink Virtuoso application.
Additionally, the RDF data cube vocabulary has been used to
publish the multidimensional data on the web. The visualization
was made using CubeViz a faceted browser to present the KPIs
in interactive charts.
Keywords Linked Data, semantic web, RDF data cube
vocabulary, knowledge management.
I.
Alejandro Maté
Department of Software and
Computing Systems
University of Alicante
Alicante, Spain
[email protected]
In order to tackle this problem, this paper proposes a
scorecard for evaluating and comparing digital libraries based
on statistics suggested in the ISO 2789:2013 standard [3], as
well as a technical architecture for publishing them based on
Linked Data technologies. The proposed approach was
developed based on best practices and recommendations from
several authors [4, 5] and tested with data extracted from the
1
, edited
by National Polytechnic School of Quito (Ecuador). In
addition, the dataset created was linked to external data
providing information that goes far beyond the bibliographic
data supplied by publishers, such as: number of papers in
similar subjects, number of visits, statistical indicators below
national standards, etc. The results of these evaluation
strategies can have a number of significant implications for the
continued development of digital libraries.
INTRODUCTION
The remainder of this paper is structured as follows. Section
II presents the background on Linked Data technologies.
Section III describes the metrics used for evaluating DL.
Section IV presents our proposal for defining and publishing a
scorecard for the evaluation of DL. Finally, Section V
describes the conclusions and sketches future works.
Open access journals collect, preserve and publish scientific
information related to a particular subject in digital form [1].
Open access (OA) is the free unrestricted online access to
digital content. A growing number of scholarly journals are
using Open Journal Systems (OJS), a software platform,
designed to manage articles through author submission, the
peer review process, editing and publication [2]. While such
system fosters the publication process, little attention has been
paid to analyse the impact of digital libraries (DL).
1
Revista Politécnica: http://www.revistapolitecnica.epn.edu.ec/
978-1-4673-6946-6/15/$31.00 ©2015 IEEE
87
II.
precision, search time, error rate, etc. Very few address the
benefits of a DL on the user. Furthermore, there are few
metrics devised specifically for this goal interlinked with
external information.
BACKGROUND
The term Linked Data refers to a set of best practices for
publishing and interlinking structured data on the web in a
human and machine readable way [6]. It is based on the URI
(Uniforme Resource Identification) and RDF (Resource
Description Framework) specifications.
A.
Scorecards
A scorecard is a tool to monitor strategic objectives in a
business. The Balanced Scorecard is one of the best corporate
scorecards, it is used to help organizations to align them with
their strategic objectives [14].
URI is used to identify a web resource, whereas RDF is
used for modeling and representing information resources as
structured data. In RDF, the fundamental unit of information is
the subject-predicate-
B. Scorecards and libraries
Performance metrics and indicators should be related to
institutional and library mission and objectives [15]. But,
analyzing a random sample of OJS from DOAJ2 (Directory of
Open Access Journals), few of them publish their vision,
mission, strategic objectives, or statistics.
Using a combination of URIs and RDF, it is possible to give
identity and structure to data. However, using only these
technologies, it is not possible to add semantics to data.
A primary purpose of using library performance indicators
is self-diagnosis, including comparisons within the same
library in several years [16]. We focus our study mainly on
this requirement using Linked Data technologies to allow
future analysis based on interlinked indicators.
The Semantic Web Architecture includes two technologies:
RDFS (RDF Schema) and OWL (Web Ontology Language).
RDFS is an extension of RDF that defines a vocabulary for the
description of entity and relationships [7]. OWL is an extension
of RDFS [8], which provides additional metadata terms for the
The ISO 2789:2013 standard defines statistics for
For our work, some existing vocabularies and ontologies are
used, such as FOAF (Friend of a Friend), BIBO (Bibliographic
Ontology), ORG (Organization Ontology), and DC (Dublin
Core). In addition to these standards, it is necessary to describe
the KPIs in a multidimensional model in order to enable its
analysis, with this purpose we use the RDF data cube
vocabulary to publish, discover, and link statistical data
organized in a multidimensional model.
promoting, marketing and advocating the value that libraries
provide f
The objectives of
the library statistics defined in the ISO 2789:2013 standard are
summarized as follows:
Using these technologies we are able to publish scorecards
as multidimensional data using RDF and Linked Data
technologies, obtaining a number of advantages as described
by the W3C recommendation [9]:
to provide a base for planning, decision making,
improving service quality, and feedback of the results;
The individual observations, and groups of
observations, become (web) addressable. This allows
publishers and third parties to annotate and link to this
data.
to demonstrate the value of library services obtained by
users, including the potential value to users in future
generations.
to monitor operating results against standards and data
of similar organizations;
to monitor trends over time;
to inform national and regional organizations in their
support, funding and monitoring roles;
For our work, we have developed a scorecard to: monitor
use trends over time, make self-diagnosis, and use the results
in marketing. The proposed model can be used as a strategic
scorecard which can also be navigated. We have used a subset
of indicators of the ISO 2789:2013 and ISO 11620:2014
standard [17], for the use of electronic documents, based on
interviews with librarians, local authorities and the data that
was possible to retrieval from the OJS records.
The indicators are: (i) number of visits, (ii) number of
rejected accesses, (iii) number of downloads, (iv) number of
internet accesses, (v) % external users, (vi) % of items not
Statistical data can be combined across datasets.
Publishing scorecards as Linked Data offers a flexible,
nonproprietary, machine readable means of publication.
It enables reuse of standardized tools and components.
III.
EVALUATION OF THE USE OF DIGITAL LIBRARIES
The evaluation approaches, methods, and criteria vary
among the existing DL evaluation studies [10, 11, 12, 13]. The
majority of the studies adopt Information Retrieval (IR)
evaluation approaches at a restricted level (either at the system
or the user level) while employing traditional criteria, such as
2
88
DOAJ: http://www.doaj.org
used, (vii) user satisfaction, (viii) number of downloads by
document, (ix) number of digital documents stored, (x)
number of digital documents added. Along with these
indicators extracted from the standard, we have included
several dimensions of analysis that help in aggregating or
disaggregating the information at hand: (i) visit time, (ii)
article, (iii) author, (iv) geographic location, (v) keywords, (vi)
objective.
This data linked to other datasets will give us better
knowledge about: similar subjects, the authors who work in
them, the objectives accomplished related to national goals.
However, in order to be able to link this data, we need to
transform it into RDF.
B. RDF data modeling
The goal of this activity is to design and implement the
vocabularies for describing the datasets in RDF. The most
important recommendation from several studies is to reuse
available vocabularies as much as possible to develop the
ontologies. An ontology represents knowledge as a hierarchy
of concepts within a domain, using a shared vocabulary to
denote the types, properties and interrelationships of those
concepts [18]. To this aim, we use the following controlled
vocabularies and ontologies for modelling statistical datasets
in RDF:
IV. LINKED DATA PUBLICATION PROCESS FOR A SCORECARD
In order to publish and feed a scorecard from an OJS data
mart transformed into RDF format we propose five main
activities:
Data source analysis.
RDF data modeling.
RDF generation.
RDF data cube vocabulary3 is a standard to publish multidimensional data, such as statistics, on the web.
Linking.
BIBO4 (The Bibliographic Ontology) provides concepts
and properties for describing citations and bibliographic
references on the semantic web using RDF.
Publishing.
A.
Data source analysis
In this initial activity, we analyzed the information provided
by the OJS data source that could be useful for the proposed
scorecard. This data source has the information about
publications, which we needed to link with another datasets to
give us better knowledge about the use of publications. First,
we represented the OJS data source in the form of a
multidimensional model, comprised of three basic
components: dimensions, measures, and attributes. This
allowed us to approach the data source as a data mart, a subset
of the target data warehouse for DL evaluation. Data marts are
usually oriented to specific business topics (the topic in this
case would be publications), and they allow us to build
specialized scorecards for each area.
Dublin Core5 is a set of terms that is used to describe web
resources as well as physical resources. Dublin Core
Metadata may be used to provide interoperability in
semantic web implementations.
FOAF6 (Friend of a Friend) is an ontology describing
persons, their activities and relations to other people and
objects in RDF format.
ORG7 (Organization) is an ontology for describing
organizations, roles and organizational activities.
SKOS8 (Simple Knowledge Organization System) an
standard for sharing and linking concepts and concept
schemes.
The data mart obtained as a result of this activity for testing
our proposal is shown the Fig.1, and is implemented in
MySQL.
The reduced RDF data cube model obtained as a result of
this step is presented in Fig. 2. In this RDF model, each
concept is mapped with the corresponding concept of the
multi-dimensional model, such as dimension, measure, code
list, etc.
3
RDF data cube vocabulary: http://www.w3.org/TR/vocab-data-cube/
4
The Bibliographic Ontology: http://bibliontology.com/
5
Dublin Core Metadata Element Set, version 1.1:
http://dublincore.org/documents/dces/
Dependency relationship 1-N
6
The Friend of a Friend (FOAF) project: http://www.foaf-project.org/
7
The Organization Ontology(ORG): http://www.w3.org/TR/vocab-org/
8
Simple Knowledge Organization System(SKOS):
http://www.w3.org/2004/02/skos/
Fig. 1. OJS use datamart
89
D. Interlinking
The objective of this activity is to improve the connectivity
to external datasets enabling other applications to discover
additional data sources. For this task we perform two steps: (i)
discovery, and (ii) linking.
Discovery comprises finding new target datasets. For this
step we used the website the Datahub 10. We found several
open linked statistics datasets from scientific journals.
Linking allows us to relate external sources for additional
information. For this step we used the open source software
Silk11 to find relations between data items in our datasets and
the external datasets generating the corresponding RDF links
that were stored in a separated dataset.
The URI structure was defined by:
Schema components (dimensions, measures, and
attributes), which are identified by:
{Base_URI}/dc/cube_name/prop/{dimension_name|
measure_name |attribute}.
Datasets are identified by:
{Base_URI}/}/dc/cube_name/dataset/{DatasetName}
The dataset component is specified by: {Base_URI}/dc/
cube_name/dccs /{dimension_name| measure_name}
Concepts and their values reused across multiple datasets
are identified by:
{ Base_URI}/concept/ {ConceptName} and
{ Base_URI }/concept/{ConceptName}/{value}.
E. Publishing
The goal of this activity is to make RDF datasets available
on the web to the users, following the Linked Data principles.
For this activity, we need a RDF server, usually in the form of
a SPARQL endpoint. In our case the generated triples were
loaded into a SPARQL endpoint (a conformant SPARQL
protocol service) based on OpenLink Virtuoso 12, which is a
database engine that combines: the functionality of RDBMS,
virtual databases, RDF triple stores, XML store, web
application server and file servers. On top of OpenLink
Virtuoso, Cubeviz13 is used as a Linked Data interface to the
RDF data cube [20].
the public, to be more discoverable, by publishing the data to
international or national open data portals. Fig. 3 shows a
view of the SPARQL endpoint with a partial result of the
query on the OJS visits data cube, giving the number of visits
by subject and by article.
C. RDF generation
The goal of this activity is to define a method and
technologies to transform the source data into RDF and
produce a set of mappings from the data sources to RDF. For
the case study we have used Open Refine9 tool to perform the
transformation from the multidimensional model stored in a
relational database to RDF data cube vocabulary.
Fig. 2. A reduced RDF data cube vocabulary
Mappings were defined from the multidimensional database
to RDF data cube elements, e.g., dimensions as
qb:DimensionProperty, measures as qb:MeasureProperty or
attributes as qb:AttributeProperty, the identification of the data
(observations) as qb:Observation instances. Concepts within
the datasets may be mapped with other concepts and code lists
(controlled vocabularies) providing compatibility and
interoperabilit
structure, the dataset itself and the observations, using the
appropriate URI Scheme for each type of resource [19]. The
code lists that are used to give a value to each of the
components are also defined using SKOS vocabulary. The
data are then exported as RDF in a RDF compliant
serialization, such as RDF/XML.
9
Fig. 3. Query example on the OJS visits data cube.
The architecture used in this case is shown in the Fig. 4.
Open Refine: http://openrefine.org/
90
10
Datahub: http://datahub.io/
11
Silk : http://wifo5-03.informatik.uni-mannheim.de/bizer/silk/
12
Virtuoso Universal Server: http://virtuoso.openlinksw.com/
13
CubeViz : http://cubeviz.aksw.org/
[5]
I. Ermilov, M. Martin, J. Lehmann,
the Semantic Web, pp. 242-249, 2013.
[6]
In Knowledge Engineering and
C. Bizer, T. Heath, K. Idehen, and T. BernersProceedings of the 17th international conference on World
Wide Web, ACM. pp. 1265-1266, 2008.
[7]
http://www.w3.org/TR/2004/REC-rdf-schema-20040210.[Accessed Feb
15, 2015].
[8]
P. Hayes, P. Patel-
[9]
R. Cyganiak, D. Reynolds
Fig. 4. Architecture of scorecard RDF publishing
V. CONCLUSIONS AND FUTURE WORK
In this paper we described a process for publishing a
scorecard about the use of scientific data from Open Journal
systems on the web using the principles of Linked Data. The
process is based on best practices and recommendations from
several studies, adding tasks and activities considered
important during the project. The process was applied to the
development and the transformation of a scorecard from
into RDF using the RDF data cube
vocabulary. For publishing we used OpenLink Virtuoso, Ontowiki and CubeViz applications. The Open Refine software
was applied for the RDF generation process. As a result, the
developed process fulfilled the requirements of the study.
http://www.w3.org/TR/vocab-data-cube/. [Accessed Feb 2, 2015].
[10] Reeves, T., Apedoe, Woo, Y. Evaluating digital libraries: A user
friendly guide, University Corporation for Atmospheric Research, 2005.
Available at:http://www.dpc.ucar.edu/projects/evalbook/
EvaluatingDigitalLibraries. pdf. [Accessed Jan 15, 2015].
[11] Z.Ying Zhang, Developing a holistic model for digital library
evaluation , J. Am. Soc. Inf. Sci. Technol. vol.61(1) , pp. 88-110, 2010.
[12] C. Klas, et al, A Logging Scheme for Comparative Digital Library
Evaluation , Research and Advanced Technology for Digital Libraries.
Springer Berlin Heidelberg, pp. 267-278, 2006.
[13] L. Pinto, P.Ochôa, and M. Vinagre, Integrated approach to the
evaluation of digital libraries: an emerging strategy for managing
resources, capabilities and results . Library statistics for the 21st century
world, pp. 273-288, 2009.
In the future, we will develop a user registration interface,
to be accessed before downloading the articles, in order to get
more data for analyzing and comparing search history data.
Moreover, we will design metrics to evaluate the performance
of the proposed process for the development of new
scorecards oriented to other strategic objectives. Finally, we
will look for the possibility of finding related open linked
dataset catalogues to link projects results.
The balanced scorecard:
[14] R. Banker, H. Chang, and M.
Judgmental effects of performance measures linked to strategy . The
Accounting Review, vol 79(1), pp.1-23, 2004.
[15]
ACKNOWLEDGMENTS
REFERENCES
[17] ISO ISO 11620:2014: Information and documentation-Library
performance indicators
1-99, 2014.
Scientometrics, vol. 79(1), pp. 147-156, 2009.
[18]
[2]
pp. 1-22, 2010.
[3]
libra
In
Proceedings of the 2012 Conference of the Center for Advanced Studies
on Collaborative Research pp. 102-115, 2012.
[16] L. Melo, and C. Pires, Performance evaluation of academic libraries:
implementation model . In 17th Hellenic Conference of Academic
Libraries, Ioanina, vol 2, pp. 2012, 2008.
This work has been partially supported by the Prometeo
Project by SENESCYT, Ecuadorian Government.
[1]
W3C Recommendation, W3C.
2004. Available at: http://www.w3.org/TR/2004/REC-owl-semantics20040210/.[Accessed Feb 10, 2015].
ISO 2789:2013: Information and documentation
vol.1(2),
-based semantic
similarity: A new featureExpert Systems with
Applications, vol 39(9), pp. 7718-7728, 2012.
[19] M. Hallo, S. Luján-Mora, J. Trujillo, Transforming Library Catalogs
into Linked Data
7th International Conference of
Education, Research and Innovation, Seville, Spain, pp. 1845-1853,
2014.
International
[20]
[4]
applications, Springer Berlin Heidelberg, pp. 778-792, 2012.
91
-Creating
Knowledge Out of Interlinked Data, pp. 90-107, 2014.
12.
Evaluating Open Access Journals using Semantic Web technologies
and scorecards.
El contenido de este capítulo corresponde al siguiente artículo:
Hallo, M., Luján-Mora, S. and Maté, A. (2015). Evaluating Open Access Journals using
Semantic Web technologies
and scorecards, Journal of Information Science. DOI:
10.1177/0165551515624353.
URI: http://jis.sagepub.com/content/early/2016/01/13/0165551515624353.abstract.
Reproduced by permission of SAGE Publications Ltd., London, Los Angeles,
New Delhi, Singapore and Washington DC. 2016
Una alternativa para proporcionar mejores evaluaciones es desarrollar en forma incremental
cuadros de mando orientados por objetivos presentándolos en modelos multidimensionales con
indicadores interrelacionados gracias al enlazamiento de información.
Este artículo contiene una descripción más amplia de conceptos, procesos y pruebas para la
publicación de un cuadro de mando para la evaluación del uso de revistas de acceso abierto.
93
Article
Evaluating open access journals using
Semantic Web technologies and
scorecards
Journal of Information Science
1–14
Ó The Author(s) 2016
Reprints and permissions:
sagepub.co.uk/journalsPermissions.nav
DOI: 10.1177/0165551515624353
jis.sagepub.com
Maria Hallo
National Polytechnic School, Ecuador
Sergio Luján-Mora
University of Alicante, Spain
Alejandro Maté
University of Alicante, Spain
Abstract
This paper describes a process to develop and publish a scorecard from an OAJ (Open Access Journal) on the Semantic Web using
Linked Data technologies in such a way that it can be linked to related datasets. Furthermore, methodological guidelines are presented
with activities related to each step of the process. The proposed process was applied to a university OAJ, including the definition of
the KPIs (Key Performance Indicators) linked to the institutional strategies, the extraction, cleaning and loading of data from the data
sources into a data mart, the transformation of data into RDF (Resource Description Framework), and the publication of data by
means of a SPARQL endpoint using the Virtuoso software. Additionally, the RDF data cube vocabulary has been used to publish the
multidimensional data on the Web. The visualization was made using CubeViz, a faceted browser to present the KPIs in interactive
charts.
Keywords
Knowledge management; linked data; Semantic Web; RDF data cube vocabulary
1. Introduction
OA (Open Access) is the free unrestricted online access to digital content. OAJ (Open Access Journals) are scholarly
journals that are available online to the reader ‘without financial, legal, or technical barriers other than those inseparable
from gaining access to the internet itself’ [1]. A classification suggested by Suber [2] for the OA content based on the
rights that authors keep to disseminate their work is summarized as follows:
•
•
•
•
Gold Open Access is adopted by peer-reviewed journals, making the published version freely available from the
publisher’s server without any other rights or permissions being granted.
Green Open Access allows authors for self-archiving in repositories with the consent of journal or publishers.
These repositories are discipline specific or institutional.
Pale Green Open Access allows authors to archive preprints.
Grey Open Access allows authors make their work accessible on institutional or personal websites.
In another proposal [3], OAJ are classified as traditional, pure open access and hybrid:
Corresponding author:
Marı́a Hallo, National Polytechnic School, Isabel la Católica E11-253, Quito, Ecuador PO-Box 17-01-2759, Ecuador.
Email: [email protected]
Downloaded from jis.sagepub.com by guest on March 3, 2016
95
Hallo et al.
•
•
•
2
Traditional subscription-based journals charge annual subscription fees and deliver their content to subscribers
only.
Pure open access journals make all articles available for free online immediately on publication.
Hybrid journals are subscription journals which offer an option for immediate open access for individual articles.
The authors have the option to pay to provide OA to everybody.
A growing number of scholarly journals are using OJS (Open Journal System), a software platform designed to manage articles through author submission, the peer review process, edition and publication [4, 5]. While such a system fosters the publication process, little attention has been paid to evaluate the use of OAJ.
OAJ routinely collect statistics about the use of their digital collection for evaluation purposes. However, these statistics are dispersed, stored across repository files lacking a standard structure and unrelated to the business objectives. As
a result, it is difficult for researchers and users to compare statistical information, while for OAJ it becomes a challenge
to develop policies, assess the impact of its use in society and share their discoveries.
In order to tackle this problem, this paper proposes a scorecard, a tool to monitor strategic objectives in a business [6],
for evaluating and comparing OAJ based on statistics suggested in the ISO 2789:2013 standard [7], as well as a technical
architecture for publishing them based on Linked Data technologies. The term Linked Data refers to a set of best practices for publishing and interlinking structured data on the Web in a human and machine readable way [8].
The proposed approach for evaluating the use of OAJ using Linked Data technologies was developed based on best
practices and recommendations from several authors [9, 10] and tested with a case study based on the journal Revista
Politécnica,1 in the context of a inter-university project for publishing library bibliographic data using Linked Data technologies, developed by National Polytechnic School from Quito (Ecuador) and other universities. Revista Politécnica is
a scientific OAJ whose data were used to demonstrate the value of the proposed linked open data analytics approach.
The dataset contained metadata of scientific articles published in 2014 under an open license. In addition, the dataset
created was linked to external data providing information that goes far beyond the bibliographic data supplied by publishers, such as number of papers in similar subjects, number of visits, statistical indicators below national standards,
etc. The results of these evaluation strategies can have a number of significant implications for the continued development and improvement of OAJ.
The remainder of this paper is structured as follows. Section 2 presents the background on Linked Data technologies
describing the principles and more important characteristics of vocabularies and formats. Section 3 describes scorecards
and the metrics used for evaluating OAJ. Section 4 presents our proposal for defining and publishing a scorecard for the
evaluation of OAJ using RDF formats. Finally, Section 5 describes the conclusions and sketches future works.
2. Background
Below we present some concepts used in this work: Linked Data, URIs, RDF, OWL and multidimensional data models.
2.1. Linked Data
The term Linked Data refers to a set of best practices for publishing and interlinking structured data on the Web in a
human and machine readable way [8]. It is based on the URI (Uniform Resource Identifier)2 and RDF (Resource
Description Framework) specifications.3 The Linked Data principles are:
•
•
•
•
•
Use URIs as names for things.
Use HTTP URIs so that people can look up those names.
When someone looks up a URI, provide useful information using of entities and/or relations from a data model
using controlled vocabulary terms and common standards such as RDF and SPARQL (RDF query language).
When someone looks up a URI, provide useful information, using common standards such as RDF (Resource
Description Framework) and SPARQL (RDF query language).
Include links to other URIs so that they can help to discover related data.
2.2. Naming things with URIs
In Linked Data, the items in a domain of interest and their relations are identified by HTTP URIs. An HTTP URI should
be de-referenceable, helping clients to retrieve a description of the resource that is identified by the URI. The document
Cool URIs for the Semantic Web2 presented by W3C Interest Group describes strategies to make URIs de-referenceable.
Journal of Information Science, 2016, pp. 1–14 Ó The Author(s), DOI: 10.1177/0165551515624353
96
Downloaded from jis.sagepub.com by guest on March 3, 2016
Hallo et al.
3
2.3. RDF data model
RDF is used for publishing Linked Data on the Web, modelling and representing information resources as structured
data. In RDF, the fundamental unit of information is the triple (subject, predicate, object), a type of sentence that represent a simple fact about a resource.
Figure 1 shows graphically the structure of a RDF triple and two examples of RDF triples with the creator and title of
an article:
•
•
•
http://tomheath.com/papers/bizer-heath-berners-lee-ijswis-linked-data.pdf, dc: creator, ‘Tim Berners-Lee’;
http://tomheath.com/papers/bizer-heath-berners-lee-ijswis-linked-data.pdf, dc: title, ‘Linked Data – The Story So
Far’;
dc: is an abbreviation for http://purl.org/dc/elements/1.1/ which means that dc:creator and dc:title are labels
defined at this http address.
In each triple, the ‘subject’ denotes the resource being described and it is represented by a URI. The ‘predicate’
denotes a property of the subject or a relation between the subject, and the object. The predicate is generally a term from
a well-known vocabulary or ontology represented by a URI. The ‘object’ denotes the value of a property or another
resource which is the target of the relation. Objects can be literals or URIs.
Triples can be represented in different formats. For example, Figure 2 describes the triples from the Figure 1 in RDF/
XML, a syntax defined by W3C to express an RDF graph as an XML document.4
Using a combination of URIs and RDF, it is possible to give identity and structure to data. However, using only these
technologies, it is not possible to add semantics to data. Ontologies are used to provide semantics to data. An ontology
represents knowledge as a hierarchy of concepts within a domain, using a shared vocabulary to denote the types, properties and interrelationships of those concepts.
2.4. RDFS and OWL
The Semantic Web Architecture includes two technologies: RDFS (RDF Schema) and OWL (Web Ontology Language).
RDFS is an extension of RDF that defines a vocabulary for the description of entities and relationships.5 RDFS describes
subclass hierarchy and properties hierarchy. Some elements of the RDFS vocabulary are defined in Table 1, such as
rdfs:Class, rdfs:resource, rdfs:subclassOf. In addition, RDFS adopts a property-centric approach, the properties are
defined in terms of the classes of resources to which they apply using rdfs:range and rdfs:domain which are instances of
rdf:Property. This schema allows anyone to extend the description of existing resources. For example, we could define
an eg vocabulary with eg:coauthor property, the domain eg:article and the range eg:person. Afterwards, anyone could
define additional properties with the same domain and range using this RDF property-centric approach.
Figure 1. Examples of RDF triples.
Journal of Information Science, 2016, pp. 1–14 Ó The Author(s), DOI: 10.1177/0165551515624353
97
Downloaded from jis.sagepub.com by guest on March 3, 2016
Hallo et al.
4
Table 1. Core class and properties of the RDF schema vocabulary
Elements
Comment
rdfs:Resource
rdfs:Class
rdfs:Literal
rdf:Property
rdfs:Datatype
rdf:type
rdf:subClassOf
rdfs:range
rdfs:domain
rdfs:label
rdfs:comment
The class resource, everything
The class of classes
The class of literal values
The class of RDF properties
The class of RDF datatypes
Instance of rdf:Property used to state that a resource is an instance of a class
Property used to state that the instances of one class are instances of another
A range of the subject property
A domain of the subject property
A human-readable version of a resource’s name
A human-readable description of a resource
Figure 2. RDF/XML document.
Figure 3. An example of RDF/XML document using RDFS class elements.
Figure 4. An example of RDF/XML document using the rdfs:range and rdfs:domain elements.
In Figure 3, we present an example of RDF/XML document stating that an article is a class and a subclass of a document class.
In Figure 4, we have an example of a RDF/XML document stating that author is a property of article and takes literals
as values.
OWL is an extension of RDFS5 [8], in the sense that it uses class and properties providing additional metadata terms
for the description of ontologies giving efficient reasoning support. Ontologies are formalized vocabularies of terms
Journal of Information Science, 2016, pp. 1–14 Ó The Author(s), DOI: 10.1177/0165551515624353
98
Downloaded from jis.sagepub.com by guest on March 3, 2016
Hallo et al.
5
covering a specific domain and shared by a community of users. A set of concepts (e.g. entities, attributes and processes),
their definitions and their inter-relationships [11] are defined in ontologies. They are primarily exchanged as RDF documents and could be used along with information written in RDF. The first version of OWL Web Ontology Language was
liberated in 2004 as a recommendation from W3C OWL working group. A new version the OWL6 language was published in 2012. In the latest version, any OWL 2 ontology can also be viewed as an RDF graph.
2.5. Multidimensional data model and RDF
In addition to the standards described in the previous sections, it is necessary to describe the KPIs (Key Performance
Indicators), measurable values that demonstrate how effectively a company is achieving key business objectives [12], in
a multidimensional data model in order to enable its analysis. With this purpose we use the RDF Data Cube vocabulary
to publish, discover and link statistical data. The multidimensional data model has dimensions. A dimension represents a
business perspective under which data analysis is to be performed and is organized in a hierarchy of levels, which correspond to different ways to group its elements of analysis and facts or measures. The relational implementation of the multidimensional data model is typically a star schema, or a snowflake schema [13, 14]. A star schema is a convention for
organizing the data into dimension and fact tables. A snowflake schema is a variation of the star schema. Snowflaking is
a form of dimensional modelling in which dimensions are stored in multiple related dimension tables.
Using these technologies we are able to publish scorecards implemented in a multidimensional data model using RDF
and Linked Data technologies, obtaining a number of advantages as described by the W3C recommendation:
•
•
•
•
The individual observations, and groups of observations, become (Web) addressable. This allows publishers and
third parties to annotate and link to this data.
Statistical data can be combined across datasets.
Publishing scorecards as Linked Data offers a flexible, non-proprietary, machine readable means of publication.
It enables reuse of standardized tools and components.
For our work, some existing vocabularies and ontologies are used, such as:
•
•
•
•
•
•
•
RDF data cube vocabulary7 is a standard to publish multidimensional data, on the Web in such a way that it can
be linked to related data sets and concepts. The current version of RDF vocabulary does not enable the aggregation of data from different granularity level along a dimension hierarchy. This vocabulary defines:
datasets, representing the container of some data;
dimensions, meaning some analysis criteria (for example a time period, location, etc.);
measures, representing a piece of data (e.g. a cell in a table), a KPI; and
attributes, expressing characteristics of dimensions.
Dublin Core8 is a set of terms that is used to describe web resources as well as physical resources. Dublin Core
Metadata may be used to provide interoperability in Semantic Web implementations. Some terms of this vocabulary are dc:identifier, dc:title, dc:creator and dc:subject.
BIBO9 (The Bibliographic Ontology) provides concepts and properties for describing bibliographic resources and
relations on the Semantic Web using RDF. Terms of this ontology are academic article, book, proceedings, object
properties such as dc:title, dc:creator, rdf:about, and data properties such as bibo:edition, bibo:issue and
bibo:volume.
FOAF10 (Friend of a Friend) is an ontology describing persons, their activities and relations to other people and
objects in RDF format. Some terms in FOAF vocabulary are foaf:name, foaf:homepage, foaf:person and
foaf:familyName.
ORG11 (Organization) is the ontology for describing organizations, roles and organizational activities. Some terms
from this ontology are org:organization, org:agent, org:event and org:site.
SKOS12 (Simple Knowledge Organization System) is a standard for sharing and linking concepts and concept
schemes. Some terms from this ontology are skos:concept, skos:collection, skos:semanticRelation,
skos:mappingRelation, skos:closeMatch, skos:member and skos:topConceptOf.
VoID13 (Vocabulary of Interlinked Datasets) allows express metadata about RDF datasets. VoiD covers four areas
of metadata:
General metadata follows the Dublin Core model. Examples of terms are dcterms:title for the name of the
dataset and dcterms:license, to point to the license under which a dataset has been published.
Journal of Information Science, 2016, pp. 1–14 Ó The Author(s), DOI: 10.1177/0165551515624353
99
Downloaded from jis.sagepub.com by guest on March 3, 2016
Hallo et al.
6
Table 2. OAJ scorecard fragment
Goal
Objectives
KPI
Actual value
Target value
Make self-diagnosis.
To monitor trends over time.
number of users/month
1000 users/month
10,000 users/month
Access metadata describes how RDF data can be accessed using various protocols. An example of access
metadata is void:sparqlEndpoint.
Structural metadata describes the structure and schema of datasets and is useful for tasks such as querying and
data integration. VoID also provides a number of properties for expressing numeric statistics about a dataset,
such as the number of RDF triples it contains, or the number of entities it describes.
Linksets metadata describes links between datasets; it is helpful for understanding how multiple datasets are
related and can be used together. An example of this kind of metadata is void:Linkset.
3. Evaluation of the use of open access journals
The evaluation approaches, methods, and criteria vary among the existing digital libraries (DL) evaluation studies
[15–18]. A DL is a collection of information stored in digital formats and accessible by computers [19]. OAJ are a
type of specialized DL. The majority of the studies adopt information retrieval evaluation approaches at a restricted
level (either at the system or the user level) while employing traditional criteria, such as precision, search time and
error rate. Very few evaluate the benefits of an OAJ on the user. Furthermore, there are few metrics devised specifically for this goal interlinked with external information. For these reasons, scorecards are an ideal candidate for covering these deficiencies.
3.1. Scorecards
A scorecard is a tool to monitor progress toward a corporate goal in a business. The Balanced Scorecard is one of the
best-known corporate scorecards; it is used to help organizations to align them with their strategic objectives [20]. The
overall strategic goals are broken down into a series of objectives that enable the organization to meet its strategic goals.
Each of these objectives is associated with one or more KPIs, so progress towards each objective can be measured. KPIs
are business metrics used to evaluate factors that are crucial to the success of an organization. In order to use KPIs, measures about actual value, target value and variance should be defined.
Table 2 shows a fragment of an OAJ scorecard. In this example, the OAJ has the following goal: make self-diagnosis.
An objective is to monitor trends over time. This goal requires the monitoring of performance against the targets defining
KPIs such as the number of users and managing it through a scorecard. In the example the managers expect to increase
the actual number of visits per month in a year to 10,000.
3.2. Scorecards and OAJ
Performance metrics and indicators should be related to institutional and OAJ mission and objectives [20]. However, analysing a random sample of OJS from DOAJ14 (Directory of Open Access Journals), few of them publish their vision, mission, strategic objectives or statistics.
The ISO 2789:2013 [7] standard defines statistics for ‘evaluation and comparison of libraries as well as for promoting, marketing and advocating the value that libraries provide for their population and for society’. The objectives of the
library statistics defined in the ISO 2789:2013 standard are summarized as follows:
•
•
•
•
•
to monitor operating results against standards and data of similar organizations;
to monitor trends over time;
to provide a base for planning, decision making, improving service quality and feedback of the results;
to inform national and regional organizations in their support, funding and monitoring roles; and
to demonstrate the value of library services obtained by users, including the potential value to users in future
generations.
Journal of Information Science, 2016, pp. 1–14 Ó The Author(s), DOI: 10.1177/0165551515624353
100
Downloaded from jis.sagepub.com by guest on March 3, 2016
Hallo et al.
7
For our work, we have developed a scorecard to:
•
•
•
monitor use trends over time;
make self-diagnosis; and
use the results in marketing.
Another related standard used is ISO 11620:2014 [21]. This standard specifies the requirements of a performance
indicator for libraries and establishes a set of indicators to be used by libraries of all types. This international standard
offers accepted, tested and publicly accessible methodologies and approaches to measure a range of library services.
Performance indicators can be used for comparison over time within the same library.
A primary purpose of using library performance indicators is self-diagnosis, including comparisons within the same
library in several years [22]. We focus our study mainly on this requirement using Linked Data technologies to allow
future analysis based on interlinked indicators.
The proposed model can be used as a strategic scorecard which can also be navigated. We have used a subset of indicators of the ISO 2789:2013 and ISO 11620:2014 standard, for the use of electronic documents, based on interviews with
librarians, local authorities and the data that was possible to retrieval from the OJS records.
The selected indicators are:
I1: number of virtual visits, (count the number of virtual visits on the library website, regardless of the number of
pages or elements viewed, during the reporting period).
I2: number of rejected accesses (count the total number of unsuccessful requests of a licensed electronic services provided by the library by exceeding the simultaneous user limit).
I3: number of downloads (total number of successful content unit downloads requested from a library-provided online
service).
I4: percentage of external users (percentage of the library’s total access from countries different to the country
library).
I5: percentage of documents not used (percentage of documents not accessed).
I6: user satisfaction (the average rating by users of the library services).
I7: number of digital documents stored.
I8: number of digital documents added.
Along with these indicators extracted from the standard, we have included several dimensions of analysis that help in
aggregating or disaggregating the information at hand:
D1: time (analysis time).
D2: article (published article).
D3: author (article author).
D4: geographic location (visiting geographic location).
D5: keyword (keywords defined in the articles).
D6: objective (OAJ strategic objectives).
The indicators I1–I5 could be analysed for all de dimensions. The indicators I6–I8 could be analysed for all the dimensions except for D4 (geographic location) and D2 (article). A monthly granularity is defined for all the measures.
4. Linked Data publication process for a scorecard
In order to publish and feed a scorecard from an OAJ data mart transformed into RDF format we propose five main
steps executed interactively as shown in Figure 5.
In Sections 4.1–4.5, we describe each step in the proposed process from the data source identification and analysis to
the publishing in a SPARQL platform (SPARQL end point).
4.1. Data source analysis
In this initial step, we analyse the information provided by the OAJ data source that could be useful for the proposed scorecard. This data source had the information about publications, which we needed to link with other datasets to give us
Journal of Information Science, 2016, pp. 1–14 Ó The Author(s), DOI: 10.1177/0165551515624353
101
Downloaded from jis.sagepub.com by guest on March 3, 2016
Hallo et al.
8
Figure 5. Linked Data publication process.
Figure 6. OAJ visits snowflake schema.
better knowledge about the use of publications. First, we represented the OAJ data source in the form of a multidimensional data model, comprising three basic components: dimensions, measures and attributes. This allowed us to approach
the data source as a data mart, a subset of the target data warehouse for OAJ evaluation.
Journal of Information Science, 2016, pp. 1–14 Ó The Author(s), DOI: 10.1177/0165551515624353
102
Downloaded from jis.sagepub.com by guest on March 3, 2016
Hallo et al.
9
Data marts are usually oriented to specific business topics (the topic in this case would be publications), and they
allow us to build specialized scorecards for each area. The data mart is implemented in a multidimensional data model.
The relational representation of the resulting multidimensional data model is a star schema or a snowflake schema. A star
schema presents the data in dimension tables and fact tables. The snowflake schema is a type of star schema in which
the dimension tables are partly or fully normalized. In Figure 6 we present a snowflake schema corresponding to the
OAJ data mart. The fact table contains the KPIs or measures and the dimension tables the criteria of analysis (time, article, objective, geographical location and author). Dependency relationships are designed from the dimensions to the fact
table. Dependency is a directed relationship to show that some elements depend on other model elements.
This data linked to other datasets will give us better knowledge about similar subjects, the authors who work in them,
and the objectives accomplished related to national goals. However, in order to be able to link this data, we need to transform it into RDF.
4.2. RDF data modelling
The goal of this step is to design and implement the vocabularies for describing the datasets in RDF. The most important
recommendation from several studies [23, 24] is to reuse available vocabularies as much as possible to develop new
ontologies [25]. With this aim, we use the controlled vocabularies and ontologies described in Section 2 for modelling
statistical datasets in RDF such as RDF data cube vocabulary, BIBO, Dublin Core, FOAF, ORG, SKOS and VOID.
The reduced RDF data cube model obtained as a result of this step is presented in Figure 7. In this RDF model, each
concept is mapped with the corresponding concept of the model, such as dimension, measure and code list. In RDF, each
resource is identified by a URI. URI enables interaction with the Web using specific protocols. The URI structure for our
proposal was defined by the following:
•
•
•
•
Datasets are identified by {base_URI}/}/dc/cube_name/dataset/{datasetName}. For example. the ojsvisits dataset
is represented by: http://opendata.epn.edu.ec/dc/ojs/dataset/ojsvisits. Dataset is a collection of statistical data corresponding to a defined structure.
Data structure definition defines the structure of one dataset referencing to a set of component specifications. It
defines the dimensions, attributes and measures. It is identified by {base_URI}/dc/cube_name/dsd/
{dataStructureName}. For example, http://opendat.epn.edu.ec/dc/ojs/dsd/dsd-ojs.
The dataset component stands for dimensions, measures and attributes represented as RDF properties in the Data
Cube vocabulary, specified by {base_URI}/dc/ cube_name/prop/{dimension_name| measure_name}.For example
the article dimension is represented by http://opendata.epn.edu.ec/dc/ojs/prop/ article.
Concepts and their values reused across multiple datasets are identified by: {base_URI}/concept/ {conceptName}
and {base_URI}/concept/{conceptName}/{value}. For example, http://opendata.epn.edu.ec/concept/physics.
4.3. RDF generation
The goal of this activity is to define a method and technologies to transform the source data into RDF and produce a set
of mappings from the data sources to RDF. For the case study we used Open Refine15 tool to perform the transformation
from the multidimensional model stored in a relational database to a RDF data cube vocabulary.
Mappings were defined from the multidimensional database to RDF Data Cube elements, for example, dimensions as
qb:Dimension Property, measures as qb:Measure Property, attributes as qb:Attribute Property or the identification of the
data (observations) as qb:Observation instances. Concepts within the datasets may be mapped with other concepts and
code lists (controlled vocabularies) providing compatibility and interoperability. The mappings are used to create the
dataset’s structure, the dataset itself and the observations, using the appropriate URI Scheme for each type of resource
[26, 27]. The code lists that are used to give a value to each of the components are also defined using SKOS vocabulary.
The data are then exported as RDF in an RDF-compliant serialization, such as RDF/XML, as shown in Figure 8.
4.4. Interlinking
The objective of this step is to improve the connectivity to external datasets enabling other applications to discover additional data sources. For this task we perform two steps: discovery and linking.
Discovery comprises finding new target datasets. For this step we used the website ‘the Datahub’.16 We found the
DOAJ directory and several open linked datasets from scientific journals. Moreover we found statistics from several
countries related to business, organizations and research topics. We will focus the analysis in the most visited articles
Journal of Information Science, 2016, pp. 1–14 Ó The Author(s), DOI: 10.1177/0165551515624353
103
Downloaded from jis.sagepub.com by guest on March 3, 2016
Hallo et al.
10
Figure 7. Outline of the RDF Data Cube vocabulary.
Figure 8. Partial result of RDF/XML code generated.
looking for linking to similar topics in datasets like Dbpedia to increase the information about authors, research networks, organizations sponsoring similar works, research articles in similar topics, etc.
Linking allows us to relate external sources for additional information. For this step we used the open source software
Silk17 to find relations between data items in our datasets and the external datasets generating the corresponding RDF
links that were stored in a separated dataset. This data will help us to develop new interrelated KPIs; for example, if we
have number of visits by country extending the information from another dataset the with number of students by country, we could have number of visits/student by country. In addition we could link the keywords from the articles to keywords from research funding institutions to have information about visits from articles by funding institutions and so on.
4.5. Publishing
The goal of this activity is to make RDF datasets available on the Web to users, following the Linked Data principles.
For this activity, we need an RDF server, usually in the form of a SPARQL endpoint. In our case the generated triples
were loaded into a SPARQL endpoint (a conformant SPARQL protocol service) based on OpenLink Virtuoso,18 which is
a database engine that combines the functionality of RDBMS, virtual databases, RDF triple stores, XML store, web application server and file servers. On top of OpenLink Virtuoso, Cubeviz19 and Ontowiki20 component is used as a Linked
Data interface to datasets complying to the RDF Data Cube vocabulary [28]. Datasets may be further ‘announced’ to the
Journal of Information Science, 2016, pp. 1–14 Ó The Author(s), DOI: 10.1177/0165551515624353
104
Downloaded from jis.sagepub.com by guest on March 3, 2016
Hallo et al.
11
Figure 9. Query result example on the OJS visits data cube.
Table 3. Translation of axis values in Figure 9
Spanish
English
Aplicaciones de Procesamiento de Lenguaje
Natural
Estudio de la generación de hidrocarburos
marcadores del proceso de irradiación en
carne de cerdo
Desarrollo de Modelos Digitales para la
Dosimetrı́a de Cobalto-60 de la Escuela
Politécnica Nacional
Evaluación experimental del problema de
flujo no divisible de costo mı́nimo con única
fuente mediante la aplicación de algoritmos
genéticos
Algoritmos
Procesamiento de lenguaje natural
Radiaciones
Applications of Natural Language Processing
Study of hydrocarbon generation markers of the irradiation process in pork
Development of Digital Models for Dosimetry Cobalt-60 of the
National Polytechnic School
Experimental evaluation of the problem of non-divisible flow with minimal cost
by applying genetic algorithms
Algorithms
Natural Language Processing
Radiation
public, to be more discoverable, by publishing the data to international or national open data portals. Figure 9 shows a
view of the SPARQL endpoint with a partial result of the query on the OJS visits data cube, giving the number of visits
by subject and by article. It will be possible in the future to annotate the relationships on the Web and to add more links
increasing the knowledge and the possibility to get more complex queries.
In Figure 9, the y-axis presents name of published articles in Spanish. The x-axis contains the number of visits. Table
3 presents the translation from Spanish to English of the labels. In this graphic we can see the measure number of visits
and two dimensions of analysis disciplines and articles.
Journal of Information Science, 2016, pp. 1–14 Ó The Author(s), DOI: 10.1177/0165551515624353
105
Downloaded from jis.sagepub.com by guest on March 3, 2016
Hallo et al.
12
Figure 10. Architecture for scorecard RDF publishing.
5. Technical architecture
The architecture used in this proposal is shown in Figure 10. Spoon software was used to extract metadata from OAJ;
Open Refine software was used to transform data from the snowflake schema to RDF triples in RDF data cube vocabulary. The generated triples were stored in Open Link Virtuoso software and visualized using CubeViz software.
The users could also access to the RDF data using SPARQL language from virtuoso software. Silk software was used
to discover related datasets for linking the RDF-generated triples.
6. Conclusions and future work
In this paper, we have described a process for evaluating the use of scientific data from OAJ on the Web using scorecards
and the principles of Linked Data. The process is based on best practices and recommendations from several studies, adding tasks and activities considered important during the project. The process begins with the scorecard development, the
transformation into a multidimensional model and afterwards into RDF using the RDF data cube vocabulary. For publishing the RDF multidimensional model we used OpenLink Virtuoso, Onto-wiki and CubeViz applications. The Open
Refine software was applied for the RDF generation process. In order to get better KPIs, the proposal also allows us to
reuse existing and published information into RDF format. The traditional evaluation methods such as the proposal in
Project COUNTER or the standards ISO 2789:2013 and 11620:2014 do not give the possibility of automatic linking of
indicators and analysis features to external data. The power of linking measures with Linked Data goes far from hyperlinks, giving the possibility to annotate and reference statistical data and the nature of the relationships on the Web. In
addition, it is possible to add dynamically more links to new resources. By providing context to the connection, it creates
knowledge, because the link itself is knowledge. The proposed model can help us find new things inferred from the stored
triples. As a result, the developed process fulfilled the requirements of the study.
In the future, we plan to develop a user registration interface, to be accessed before downloading the articles, in order
to get more data for analysing and comparing search history data. Moreover, we will design metrics to evaluate the performance of the proposed process for the development of new scorecards oriented to other strategic objectives.
Furthermore, we will develop and look for related open linked dataset catalogues to link projects results, enhancing the
associated information and creating new interlinked KPIs. Finally, we are planning to develop a recommender system
linking information to datasets from OAJs.
Funding
This research was supported by National Polythecnic School of Quito, Ecuador. Alejandro Maté is funded by the Generalitat
Valenciana (APOSTD/2014/064).
Journal of Information Science, 2016, pp. 1–14 Ó The Author(s), DOI: 10.1177/0165551515624353
106
Downloaded from jis.sagepub.com by guest on March 3, 2016
Hallo et al.
13
Notes
1. Revista Politécnica: http://www.revistapolitecnica.epn.edu.ec/. Revista Politécnica is a scientific journal from National
Polytechnic School.
2. W3C. Cool URIs for the Semantic Web, http://www.w3.org/TR/2008/NOTE-cooluris-20081203/
3. W3C. RDF Resource Description Framework, http://www.w3.org/RDF/
4. W3C. RDF/XML Syntax Specification, www.w3.org/TR/REC-rdf-syntax/ 2004
5. W3C. RDF vocabulary description language 1.0: RDF Schema Recommendation, www.w3.org/TR/2004/REC-rdf-schema20040210
6. W3C. OWL 2 Web Ontology Language Document Overview (second edition), http://www.w3.org/TR/owl2-overview/
7. RDF data cube vocabulary, http://www.w3.org/TR/vocab-data-cube/
8. Dublin Core Metadata Element Set, version1.1.
9. The Bibliographic Ontology, http://bibliontology.com/
10. The Friend of a Friend (FOAF) project, http://www.foaf-project.org/
11. The Organization Ontology (ORG), http://www.w3.org/TR/vocab-org/
12. Simple Knowledge Organization System (SKOS), http://www.w3.org/2004/02/skos/
13. Vocabulary of Interlinked Datasets (Void), http://www.w3.org/TR/void/
14. DOAJ, http://www.doaj.org. DOAJ is an online directory that indexes open access peer-reviewed journals.
15. Open Refine, http://openrefine.org/
16. Datahub, http://datahub.io/. Datahub is a free data management platform used to publish RDF datasets.
17. Silk, http://wifo5-03.informatik.uni-mannheim.de/bizer/silk/
18. Virtuoso Universal Server, http://virtuoso.openlinksw.com/
19. CubeViz, http://cubeviz.aksw.org/
20. Ontowiki, http://aksw.org/Projects/OntoWiki.html
References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
Harnad S. Open access scientometrics and the UK Research Assessment Exercise. Scientometrics 2009; 79(1): 147–156.
Suber P. Open access. London: The MIT Press Essential Knowledge Series, 2012.
The University of Shefield. The University Library: Open Access Key Concepts, www.sheffield.ac.uk/library/openaccess/
concepts#JournalTypes (2012, accessed 10 July 2015).
Brian D and Willinsky E. A Survey of scholarly journals using open journal systems. Scholarly and Research Communication
2010; 1(2): 1–22.
Hallo M, Luján-Mora S, Maté A and Trujillo J. Current state of Linked Data in digital libraries. Journal of Information Science,
Epub ahead of print 21 July 2015; doi: 0165551515594729.
Poll R and Payne P. Impact measures for libraries and information services. Library Hi Tech 2006; 4(4): 547–562.
ISO 2789:2013. Information and documentation-International library statistics.
Bizer C, Heath T, Idehen K and Berners-Lee T. Linked Data on the Web. In: Proceedings of the 17th international conference
on World Wide Web, 2008, pp. 1265–1266.
Banker R, Chang H and Pizzini M. The balanced scorecard: Judgmental effects of performance measures linked to strategy. The
Accounting Review 2004; 79(1): 1–23.
Ermilov I et al. Linked open data statistics: Collection and exploitation. Communications in Computer and Information 2013;
394: 242–249.
Sánchez D, Batet M, Isern D and Valls A. Ontology-based semantic similarity: A new feature-based approach. Expert Systems
with Applications 2012; 39(9): 7718–7728.
Setijono D and Dahlgaard J. Customer value as a key performance indicator (KPI) and a key improvement indicator (KII).
Measuring Business Excellence 2007; 11(2): 44–61.
Luján-Mora S, Trujillo J and Song I. Multidimensional modeling with UML package diagrams. In: Proceedings of the 21st
international conference on conceptual modeling (ER 2002). Lecture Notes in Computer Science, Vol. 2503. Berlin: Springer,
2002, pp. 199–213.
Luján-Mora S, Trujillo J and Song I. Extending the UML for multidimensional modelling. In: Proceedings of the 5th international conference on the Unified Modeling Language (UML 2002). Lecture Notes in Computer Science, Vol. 2460. Berlin:
Springer, 2002, pp. 290–304.
Reeves T, Apedoe X and Woo Y. Evaluating digital libraries: A user friendly guide, University Corporation for Atmospheric
Research, www.dpc.ucar.edu/projects/evalbook/EvaluatingDigitalLibraries.pdf (2005, accessed January 15, 2015).
Ying Zhang Z. Developing a holistic model for digital library evaluation. Journal of the Association for Information Science
and Technology 2010; 61(1): 88–110.
Klas C. P, Albrechtsen H, Fuhr N, Hansen P, Kapidakis S, Kovacs L and Jacob E. A logging scheme for comparative digital
library evaluation. Research and Advanced Technology for Digital Libraries 2006; 267–278.
Journal of Information Science, 2016, pp. 1–14 Ó The Author(s), DOI: 10.1177/0165551515624353
107
Downloaded from jis.sagepub.com by guest on March 3, 2016
Hallo et al.
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
14
Pinto L, Ochôa P and Vinagre M. Integrated approach to the evaluation of digital libraries: An emerging strategy for managing
resources, capabilities and results. Library Statistics for the 21st Century World 2009; 273–288.
Heradio R., Fernández-Amorós D, Cabrerizo F and Herrera-Viedma E. A review of quality evaluation of digital libraries based
on users’ perceptions. Journal of Information Science 2012; 38(3): 269–283.
Maté A, Trujillo J and Mylopoulos J. Conceptualizing and specifying key performance indicators in business strategy models.
In: Proceedings of the 2012 conference of the center for advanced studies on collaborative research, 2012, pp. 102–115.
ISO 11620:2014. Information and documentation-Library performance indicators.
Melo L and Pires C. Performance evaluation of academic libraries: Implementation model. Paper presented at the 17th Hellenic
conference of academic libraries, 24–26 September 2008, Ioanina, Greece.
Pesch O. Implementing SUSHI and COUNTER: A primer for librarians: Edited by Oliver Pesch. The Serials Librarian 2015;
69(2): 107–125.
Uschold M and Gruninger M. Ontologies: Principles, methods and applications. Knowledge Engineering Review 1996; 11(2):
93–126.
Keith A, Cyganiak R, Hausenblas M and Zhao J. Describing linked datasets. In: Proceedings of Linked Data on the Web worshop (LDOW2009), 2009.
Hallo M, Luján-Mora S and Trujillo J. Transforming library catalogs into Linked Data. In: Proceedings of the 7th international
conference of education, research and innovation, 2014, pp. 1845–1853.
Candela G, Escobar P, Marco-Such M and Carrasco R. Transformation of a library catalogue into RDA linked open data.
Research and Advanced Technology for Digital Libraries 2015, pp. 321–325.
Mader C, Martin M and Stadler C. Facilitating the exploration and visualization of linked data. In: Auer S et al. (ed.), Linked
open data-creating knowledge out of interlinked data. Lecture Notes in Computer Science, Vol. 8661. London: Springer, 2014,
pp. 90–107.
Journal of Information Science, 2016, pp. 1–14 Ó The Author(s), DOI: 10.1177/0165551515624353
108
Downloaded from jis.sagepub.com by guest on March 3, 2016
5. Conclusiones y trabajos futuros
Los trabajos que integran esta tesis fueron desarrollados en el período (2013-2016).
Los resultados de las investigaciones han sido publicados en congresos internacionales y
revistas indizadas y con factor de impacto JCR.
Los más importantes resultados alcanzados en este trabajo son:
1. Desarrollo de un marco de trabajo: procesos y arquitectura técnica para la generación
y publicación de metadatos de registros bibliográficos utilizando tecnologías de datos
enlazados .
2. Desarrollo de un marco de trabajo: procesos y arquitectura técnica para la evaluación
del uso de recursos digitales usando cuadros de mando y tecnologías de datos
enlazados.
Los resultados de la investigación se han aplicado a la Revista Politécnica, revista
científica en modalidad impresa y digital editada por la Escuela Politécnica Nacional de
Quito (Ecuador) generándose conjuntos de metadatos en RDF y publicándolos en el punto
SPARQL asociado al sitio web. Por otra parte se han generado en RDF conjuntos de datos
de cuadros de mando con indicadores de uso de la Revista Politécnica. Estos conjuntos de
datos con sus metadatos se han publicado en data.hub.io. El marco de trabajo desarrollado
ha sido aplicado con éxito.
Para el futuro se propone trabajar en la integración de sistemas de recomendación
utilizando datos enlazados desarrollando o reutilizando las ontologías necesarias. Otra área
de trabajo futuro comprende el desarrollo y refinamiento de guias para la elaboración de
cuadros de mando orientados por objetivos enlazándolos a indicadores externos a la
organización.
109
Referencias
[1]
Kitchenham B, Pearl O, Budgen D, Turner M, Bailey J., and Linkman S. (2009).
Systematic literature reviews in software engineering. Information and software
technology, 51(1), 7-15.
[2]
Petersen, K., Feldt, R., Mujtaba, S., and Mattsson, M. (2008). Systematic mapping
studies in software engineering. In 12th International Conference on Evaluation and
Assessment in Software Engineering, Bari, Italy, 17(1), 544-559.
[3]
Baxter, P. and Jack, S. (2008). Qualitative Case Study Methodology: Study Design
and Implementation for Novice Researchers. The Qualitative Report, 13(4), 544-559.
[4]
OntologyEngineeringGroup. (2015). Web semántica y Linked Data. URL:
http://mayor2.dia.fi.upm.es/oeg-upm/index.php/es/researchareas/4-semanticweb.
[5]
Library Linked Data Incubator Group. (2011). W3C Incubator Group Report 25
October 2011. URL: http://www.w3.org/2005/Incubator/lld/XGR-lld-20111025.
[6]
Banker, R., Chang H., and Pizzini M. (2004). The balanced scorecard: Judgmental
effects of performance measures linked to strategy. The Accounting Review, 79(1), 123.
[7]
Hallo, M., De la Fuente, P. and Martínez-González, M. (2012). Las Tecnologías de
Linked Data y sus aplicaciones en el gobierno electrónico. SCIRE, 18(1), 49-61.
[8]
Heath, T., Bizer, C. (2011). Linked Data: Evolving the Web into a Global Data
Space, Morgan & Claypool.
[9]
Berners-Lee, T. (2006). Design Note: Linked Data. URL:
http://www.w3.org/DesignIssues/LinkedData.html.
[10]
Hendler, J. (2009). Web 3.0 Emerging. Computer, 42(1), 111-113.
111
[11]
Berners-Lee, T., Bizer, C., & Heath, T. (2009). Linked data-the story so far.
International Journal on Semantic Web and Information Systems, 5(3), 1-22.
[12]
Cyrille, A., Ngomo, N. et ál. (2010). A time-efficient approach for large- scale link
discovery on the web of data. URL: http://svn.aksw.org/papers/ 2011/ WWW_
LIMES/public.pdf.
[13]
Hartig, O., Sequeda, J. et ál. (2010). How to consume Linked Data on the Web.
URL: http://iswc2009.semanticweb.org/wiki/index.php/ISWC_2009_Tutorials/
How_to_Consume_Linked_Data_on_the_Web.
[14]
Sheridan, J. (2010). Legislation.gov.uk. URL:
http://blog.law.cornell.edu/voxpop/tag/ legal-linked-data/.
[15]
Hallo, M., Luján-Mora, S. and Maté, A. (2015). Current State of Linked Data in
Digital Libraries. Journal of Information Science, 42(2), 117-127. DOI:
10.1177/0165551515594729.
[16]
Hallo, M., De la Fuente, P. and Martínez-González, M. (2013). Data models for
version management of legislative documents. Journal of Information Science, 39(4),
557-572.
[17]
Moens, M. (2001). Innovative techniques for legal text retrieval. Artificial
Intelligence and Law, 9, 29–57.
[18]
Hallo, M., Luján-Mora, S. and Maté, A. (2015). Data model for storage and retrieval
of legislative documents in Digital Libraries using Linked Data. Proceedings of
7th annual International Conference on Education and New Learning Technologies
(EDULEARN15) . Barcelona, España, IATED, 7423-7430.
[19]
Hallo, M., Luján_Mora, S. and Chavez, C. (2014). An Approach to Publish Scientific
Data of Open-access Journals using Linked Data Technologies. Proceedings of the
6th International Conference on Education and New Learning Technologies
(EDULEARN2014). Barcelona, España, IATED, 5940-5948.
[20]
Hallo, M., Luján-Mora, S. and Trujillo, J. (2014). Transforming Library Catalogs into
Linked Data. Proceedings of the 7th International Conference of Education, Research
and Innovation (ICERI2014). Sevilla, España, IATED, 1845-1853.
[21]
Hallo, M., Luján-Mora, S. and Trujillo, J. (2015). An Approach to Publish Statistics
from Open-access Journals using Linked Data Technologies. Proceedings of the 9th
International Technology, Education and Development Conference (INTED2015).
Madrid, España, IATED, 5940-5948.
[22]
Hallo, M., Luján-Mora, S. and Maté, A. (2015). Publishing a Scorecard for Evaluating
the Use of Open-Access Journals Using Linked Data Technologies. Proceedings of
the 2015 International Conference on Computer, Information and
Telecommunication Systems (CITS 2015). Gijón, España, IEEE, 105-109.
112
[23]
Hallo, M., Luján-Mora, S. and Maté, A. (2016). Evaluating open access journals using
Semantic Web technologies and scorecards. Journal of Information Science.
Publicación
anticipada
en
línea
antes
de
impresión.
DOI:
10.1177/0165551515624353.
[24]
Suber P. (2012). Open Access, London: The MIT Press Essential Knowledge Series.
113

Similar documents