How to integrate - ICS

Transcription

How to integrate - ICS
Semantic Integration of Data
Yannis Tzitzikas
Computer Science Department, University of Crete, GREECE
&
Information Systems Laboratory (ISL)
Institute of Computer Science (ICS)
Foundation for Research and Technology – Hellas (FORTH)
ITN-DCH summer school 2016 (@CGI'16), Heraklion, Crete, Greece, June 27, 2016
Outline
•
Motivation
• Requirements
• Case Study: Marine Species Data
• Challenges and Conclusions
Time plan: 25’ presentation, 5’ questions and discussion
Slides: will be publicly available after the school
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
2
Motivation
Huge amounts of data are available and this amount constantly increases.
Almost everyone produces data (and everything will produce data).
Almost everyone needs data (and everything will need data).
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
Motivation
However data and information is not integrated.
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
Motivation
However data and information is not integrated.
Hundreds or thousands of CKAN catalogs each containing hundreds
or thousands of datasets
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
Motivation
In several domains and applications one has to fetch and assemble pieces of
information coming from more than one sources for being able to answer complex
queries (that are not answerable by individual sources) or analyze the integrated
data. This important for science but also for our daily life.
This is true in science in general
 Biodiversity domain
 Cultural Domain
 E-Government
 Science in general
 …

Personal data
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
6
50%-90% of time for data collection and cleaning
It has been written that
 Data scientists spend from 50 percent to 80 percent of their time
in collecting and preparing unruly digital data, before it can be
explored for useful nuggets.
 If you’re trying to reconcile a lot of sources of data that you don’t
control it can take 80% of your time
 One-Third of BI Pros Spend Up to 90% of Time Cleaning Data
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
Indicative Complex Queries
Thunnus Albacares
El Greco
Marine Domain
Given the scientific name of a species (say Thunnus
Albacares), find the ecosystems, waterareas and countries
that this species is native to, and the common names that
are used for this species in each of the countries, as well as
their commercial codes
Cultural Domain
 Give me all paintings of El Greco that are now
exhibited in Greece and their location , as well as all
articles or books about these paintings between 2000
and 2016.
 Give me the paintings of El Greco referring to persons
that were born between 0 and 300 BC.
 Give me all events related to El Greco that will take
place this month in Heraklion.
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
8
Why integration is difficult?





Datasets are kept or produced by different organizations in different formats,
models, locations, systems.
The same real world entities or relationships are referred with different names
and in different natural languages (natural languages have synonyms and
homonyms)
Datasets usually contain complementary information
Datasets can contain erroneous or contradicting information
Datasets about the same domain may follow different conceptualizations of the
domain
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
… names
Thunnus Albacares
348 common names in 82 different languages!
ed Pa'ak Pukeu Sisek kuneng Geelvin-tuna Geelvin-tuna Tuna Tambakol Gubad Jaydher Kababa Shak zoor Tuna sirip kuning Rambayan Tambakol Bangkulis Bankulis
Bronsehan Buyo Kikyawon Paranganon Manguro O'maguro Tag-hu Taguw Taguw peras Taguw tangir Barelis Bariles Barilis Carao Karaw Pak-an Pala-pala Panit Panitto Pirit
Tulingan Kacho Bariles Karaw Panit M'Bassi Mbasi bankudi Mibassi mibankundri Thon a nageoires jaunes Thon jaune Ton zonn Z'ailes jaunes Albacora Atum olede Chefarote
Rabo-seco Gulfinnet tun Gulfinnet tunfisk Bariles Bugo Karaw Geelvintonijn 'Fin Albacore Allison tuna Allison tuna Allison tuna Allison's tuna Atlantic yellowfin tuna Autumn
albacore Long fin tunny Longfin Pacific long-tailed tuna Tuna Tuna Yellow fin tuna Yellow tunny Yellow-fin tuna Yellow-fin tuna Yellow-fin tuna Yellow-fin tunny Yellowfin
Yellowfin Yellowfin tuna Yellowfinned albacore Kulduim-tuun Gegu Tuna Yatu Yatunitoga Keltaevatonnikala Albacore Gegu Grand fouet Guegou Thon Thon a nageoires jaunes
Thon jaune Thon rouge Atu igu mera Albacore Gelbflossen-Thunfisch Gelbflossenthun Tonnos macrypteros Gedar Gedara Ahi Kahauli Kanana Maha'o Palaha Shibi Bantalaan Panit Oriles Tambakul Tonno albacora Tonno monaco Tunnu monicu Tiklaw Vahuyo Kihada Panit Baewe Baibo Baiura Te baewe Te baibo Te bairera Te baitaba Te
ingamea Te ingimea Te inginea Bokado Olwol Malaguno Tambakol Gantarangang Lamatra Aya Aya tuna Bakulan Gelang kawung Kayu Tongkol Tuna Tuna ekor kuning Tuna
sirip kuning Poovan-choora Kannali-mas Bariles Panit Bugudi Gedar Kuppa Pimp Bwebwe Tetena keketina Vahakula Albacore Albakor To'uo Balang kuni Ghidar Albakora
Tunczyk zoltopletwy a. albakora Albacora Albacora Albacora Albacora Albacora Albacora da laje Albacora de lage Albacora-cachorra Albacora-da-lage Albacora-de-laje
Albacora-lage Albacora-lajeira Alvacor Alvacora Alvacora-lajeira Atum Atum Atum albacora Atum albacora Atum albacora Atum rabil Atum-albacora Atum-amarelo Atum-debarbatana-amarela Atum-de-galha-a-re Atum-galha-amarela Galha a re Ielofino Peixe de galha a re Peixe-de-galha-a-re Peixinho da ilho Rabao Rabil Rabo-seco Albacora Ton
galben Albacor Tikhookeanskij zheltoperyj tunets Zheltokhvostyj tunets Asiasi Gaogo Ta'uo To'uo Tuna zutoperka Zutorepi tunj As geddi kelawalla Howalla Howalla Kelawalla
Kelawalla Pihatu kelawalla Yajdar-baal-cagaar Albacora Albacora Albacora aleta amarilla Aleta amarilla Aleta amarilla Atun aleta amarilla Atun aleta amarilla Atun aleta amarilla
Atun aleta amarilla Atun aleta amarilla Atun de aleta amarilla Atun de aleta amarilla Atun de aleta amarilla Atun de aleta amarilla Atun de aleta amarilla Atun de aletas amarillas
Rabil Rabil Rabil Rabil Bariles Jodari Albacora Gulfenad tonfisk Albakora Badla-an Barilis Buyo Tambakol A'ahi A'ahi 'oputea A'ahi 'oputi'i A'ahi hae A'ahi mapepe A'ahi maueue
A'ahi patao A'ahi tari'a'uri A'ahi tatumu A'ahi teaamu A'ahi tiamatau A'ahi vere Otara Kelavai Soccer Soccer Tekuu Atutaoa Kahikahi Kakahi Kakahi/lalavalu Takuo Takuo
Kahikahi Kakahi Sari kanat orkinos Sar?kanatorkinoz bal?g? Sar?kanatton bal?g? Te kasi Ca bo vang Ca Ng? vay vang Nkaba Badla-an Balarito Malalag Painit Panit Baliling
Panit Tiwna melyn Doullou-doullou Ouakhandar Wakhandor Waxandor Wockhandor
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
… names
argentina
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
…. complementary views
dataset
dataset
dataset
dataset
dataset
dataset
reality
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
…different conceptualizations
reality
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
General Requirements or Tasks related to Information
Integration













Dataset discovery
Dataset selection (or sub-dataset selection)
focus
Dataset access and query
Fetch and transformation of data
Data and dataset linking
Data cleaning
Data completion (through context, inference, prediction or other methods)
Management of data provenance
Measuring and testing the quality of datasets (especially the integrated)
Management (and understanding) of the evolution of datasets
Monitoring, production of overviews, visualization of datasets
Interactive browsing and exploration of datasets
Data summarization, preservation
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
Outline
• Main approaches for integration
• The notion of Semantic Warehouse
• Case Study: Integrating information about marine
species
•
•
•
•
•
The role Top-Level Ontologies
Automating the process
Measuring the quality of semantic integration
Provenance Issues
Exploitation
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
Main approaches for Integration
In general there are two main approaches for integration
Warehouse approach (materialized integration)
• Design Phase:
• The underlying sources (and their parts) have to be selected
•
Creation Phase:
• Process for getting and creating the warehouse
•
Maintenance Phase:
• Ability to create the warehouse from scratch, and/or ability to update parts of it
Mappings are exploited to extract information from data sources, to transform it
to the target model and then to store it at the central repository
Mediator approach (virtual integration)
• The mediator receives a query formulated in terms of the unified model/schema.
The mappings are used to enable query translation. The derived sub-queries are
sent to the wrappers of the individual sources, which transform them into
queries over the underlying sources. The results of these sub-queries are sent
back to the mediator where they are assembled to form the final answer
•
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
16
Main approaches for integration (cont.)
Mediator
Warehouse
•
•
•
•
•
•
Benefit: Flexibility in transformation
logic (including ability to curate and fix
problems)
Benefit: Decoupling of the release
management of the integrated resource
•
from the management cycles of the
underlying sources
Benefit: Decoupling of access load from
the underlying sources.
Benefit: Faster responses (in query
answering but also in other tasks, e.g. if
one wants to use it for applying an entity
matching technique).
Benefit: One advantage (but in some
cases disadvantage) of virtual
integration is the real-time
reflection of source updates in
integrated access
Comment: The higher complexity of
the system (and the quality of
service demands on the sources) is
only justified if immediate access to
updates is indeed required.
Shortcomings You have to pay the cost for
hosting the warehouse. You have to refresh
periodically the warehouse
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
17
Case Study: Marine Species
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
18
Context: iMarine
Id: An FP7 Research Infrastructure Project (2011-2014)
Final goal: launch an initiative aimed at establishing and operating an einfrastructure supporting the principles of the Ecosystem Approach to fisheries
management and conservation of marine living resources.
Partners:
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
19
Continuation in BlueBRIDGE
BlueBRIDGE (Building Research environments for fostering
Innovation, Decision making, Governance and Education to
support Blue growth), H2020-EINFRA-2015-1
Sept. 2015- Feb. 2018
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
Marine Information: in several sources
WoRMS: World Register of Marine Species
Registers more than 200K species
ECOSCOPE- A Knowledge Base About Marine
Ecosystems (IRD, France)
FLOD (Fisheries Linked Data) of
Food and Agriculture Organization (FAO) of the United
Nations
FishBase: Probably the largest and most extensively
accessed online database
of fish species.
DBpedia
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
21
Marine Information:
in several sources
Taxonomic information
Storing
complementary
information
Ecosystem information (e.g. which fish eats which fish)
Commercial codes
General information, occurrence data, including
information from other sources
General information, figures
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
22
Marine Information:
in several sources
Accessed through
different technologies
Web services (SOAP/WSDL)
RDF + OWL files
SPARQL Endpoint
Relational Database
SPARQL Endpoint
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
23
..
How to integrate
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
24
Scope Control (how to control it?)
 We use the notion of competency queries.
 A competency query is a query that is useful for the
community at hand, e.g. for a human member , or for building
applications for that domain

Indicative competency queries for our running example:
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
25
Materialization or Mediation?
In both cases we need a unified model/schema
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
26
The Top Level Ontology: MarineTLO
MarineTLO aims at being a global core model that
provides a common, agreed-upon and understanding of the concepts and
relationships holding in the marine domain to enable knowledge sharing,
information exchanging and integration between heterogeneous sources
– covers with suitable abstractions the marine domain to enable the most
fundamental queries, can be extended to any level of detail on demand, and
– allows data originating from distinct sources to be adequately mapped and
integrated
• MarineTLO is not supposed to be the single ontology covering the entirety of what
exists
–
Benefits:

reduced effort for improving and evolving : the focus is given on one model,
rather than many (the results are beneficial for the entire community)

reduced effort for constructing mappings: this approach avoids the inevitable
combinatorial explosion and complexities that results from pair-wise mappings
between individual metadata formats and/or ontologies
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
27
MarineTLO: Query capabilities
It should allow formulating the competency queries.
Indicative examples of queries that can be formulated
1.Given the scientific name of a species, find its predators with the related taxon-rank
classification and with the different codes that the organizations use to refer to them.
2. Given the scientific name of a species, find the ecosystems, waterareas and countries that
this species is native to, and the common names that are used for this species in each of the
countries
The MarineTLO currently contains around
90 classes and 40 properties.
More in www.ics.forth.gr/isl/MarineTLO
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
28
Materialization or Mediation?

We will focus on the materialization case
 i.e on the construction and maintenance of a MarineTLO-based
semantic warehouse
Semantic warehouse
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
29
..
Integration process
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
30
The warehouse construction and evolution process
Expressed over
MarineTLOuses
Define requirements in terms
of competency queries
MatWare
Queries
Fetch the data from the selected sources
(SPARQL endpoints, services, etc)
MatWare
Transform and Ingest to the Warehouse
MatWare
Inspect the connectivity of the
Warehouse
Formulate rules creating sameAs
relationships
Triples
creates
uses
produces
Rules for
Instance
Matching
uses
Apply the rules to the warehouse
MatWare
Ingest the sameAs relationships
to the warehouse
MatWare
Test and evaluate the Warehouse (using
sameAs triples
Warehouse
the competency queries and the conn. metrics)
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
31
How to connect the fetched pieces of information?
The Semantic Approach
 Use URIs instead of strings
 You can establish links in this way
 You can avoid the problem of homonyms
Use owl:sameAs to connect equivalent URIs
 Various other semantic relationships

Linked Data is a method of
publishing structured data so that
it can be interlinked and become
more useful through semantic
queries. It builds upon standard
Web technologies such as HTTP,
RDF and URIs. This enables data
from different sources to be
connected and queried
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
How to link
We need Entity Matching
 Both automatic methods and handcrafted
rules are required
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
Example: Suffix-based URI equivalence
=
thunnusalbacares
thunnusalbacares
lower case conversion
ThunnusAlbacares
thunnusalbacares
underscore removal
Thunnus_Albacares
thunnus_albacares
prefix removal
http://www.dbpedia.com/Thunnus_Albacares
≡
http://www.ecoscope.com/thunnus_albacares
last(u): is the string obtained by (a) getting the substring after the last "/" or "\#",
and turning the letters of the picked substring to lowercase and deleting the
underscore letters that might exist.
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
34
Example: Entity Matching-based URI Equivalence
Matching Rule:
If an Ecoscope individual's preflabel in lower case is the same with the
attribute label of a FLOD individual then these two individuals are
the same.
Thunnus Albacares
thunnus albacares
label
skos:preflabel
http://www.ecoscope.com/thunnus_albacares
sameAs
http://www.fao.org/figis/flod/entities/codedentity/
636cdcea-c411-43ad-97ff-00c9304f5e60
Yannis Tzitzikas et al., LWDM 2014, Athens
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
35
..
How to measure the quality of the warehouse?
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
36
Connectivity Assessment

Why it is useful to measure Connectivity?
 For assessing how much the aggregated content is connected
 For getting an overview of the warehouse
 For quantifying the value of the warehouse (query capabilities)
o Poor connectivity affects negatively the query capabilities of the warehouse.
 For making easier its monitoring after reconstruction
 For measuring the contribution of each source to the warehouse, and hence
deciding which sources to keep or exclude (there are already hundreds of
SPARQL endpoints). Identification of redundant or unconnected sources
 In general Connectivity has two main aspects: Schema and Instance.
 Regarding Schema Connectivity our running example uses a top level
ontology (MarineTLO) and schema mappings in order to associate the
fetched data with the schema of the top level ontology.
 As regards Instance Connectivity one has to inspect and test the connectivity
of the “draft” warehouse through the competency queries, and a number of
connectivity metrics that we have defined and then formulate rules for
instance matching
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
37
Connectivity Metrics Definition
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
38
Connectivity Metrics:
Increase in the average Degree
 Suffix canonicalization
The average degree
is increased from
18.72 to 23.92.
 Entity Matching
The average degree, of all
sources is significantly
bigger than before.
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
39
Connectivity Metrics: Exchanging
The metrics can also be exchanged for assisting dataset discovery or dataset selection
(in a mediator-based architecture). We have extended VoID (Vocabulary of Interlinked
Datasets) for representing and exchanging such metrics (VoIDWarehouse)
VoIDWarehouse
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
40
Connectivity Metrics: Exchanging (cont).
1. Compute of the Connectivity Metrics-Production of Matrixes
2. Describe the Connectivity Metrics with the proposed VoID extension
3. Store these triples in a separate graph space
4. Retrieve/Query these values from the warehouse using SPARQL queries
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
41
..
Provenance
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
42
Provenance
It is important to keep the provenance of each data in the warehouse.
We have realized that the following 4 levels of provenance support are
usually required:
[a] Conceptual level
[b] URIs and Values level
[c] Triple Level
[d] Query level
Level [a] can be supported by the conceptual model level. In our
application context we use the MarineTLO and the transformation
rules do the required transformations.
Matware offers support also for levels [b]-[d]
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
43
Provenance
a) Conceptual modeling level
Example: Assignment of identifiers to species
 MarineTLO models the provenance of species names, codes etc, and the
Transformation rules of MatWare transform the ingested data according
to this model.
hasCodeType
isIdentifiedBy
YFT
FAOCode
Thunnus albacares
isIdentifiedBy
hasCodeType
127027
WoRMSCode
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
44
Provenance
b) URIs and Literals :
i. Adopting the namespace mechanism for URIs:
- The prefix of the URI provides information about the origin of the data.
- e.g. www.fishbase.org/entity/ecosystem#mediterannean_sea
ii. Ability to attach @Source to every literal coming from a Source:
- e.g. select scientific name and authorship of Yellow Fin Tunna
- This policy allows formulating source-centric queries in a relative simple way:
SELECT ?speciesname
WHERE {
?species tlo:has_scientific_name ?scientificname
FILTER(langMatches(lang(?scientificname), “worms"))
}
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
45
Provenance
c) Triple Level Provenance
•
Store the fetched triples in a separate graphspace:
FISHBASE: http://www.ics.forth.gr/isl/Fishbase
DBpedia: http://www.ics.forth.gr/isl/DBpedia
FLOD: http://www.ics.forth.gr/isl/FLOD
Ecoscope: http://www.ics.forth.gr/isl/Ecoscope
WoRMS: http://www.ics.forth.gr/isl/WoRMS
•
•
By asking for the graph that each triple is coming from we
retrieve the provenance of the data.
This enables refreshing only one part of the warehouse
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
46
Provenance
d) Query Level Provenance
•
•
Matware offers a query rewriting functionality that exploits the contents of the
graphspaces for returning the sources that contributed to the query results
(including those that contributed to the intermediate steps).
Let q be a SPARQL query that has n parameters in the select clause and contains k
triple patterns of the form (?s_i, ?p_i, ?o_i) :
SELECT {?o_1 ?o_2} WHERE {
?s_1 ?p_1 ?o_1 .
?s_2 ?p_2 ?o_2 .
?s_k ?p_k ?o_k
}
•
The rewriting produces a query q’ that has n+k parameters in the select clause and
each triple pattern (?si ?pi ?oi) has been replaced by: graph ?gi {?si ?pi ?oi}.
Eventually the rewritten query q’ is:
SELECT {?o_1 ?o_2 ?g_1 ?g_2 ?g_k} WHERE {
graph ?g_1 {?s_1 ?p_1 ?o_1 }.
graph ?g_2 { ?s_2 ?p_2 ?o_2} .
graph ?g_3 {?s_k ?p_k ?o_k}
}
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
47
Provenance
Example of Query Level Provenance:
Query: For a scientific name of a species (e.g. Thunnus Albacares) find the FAO codes
of the waterareas in which the species is native.
Query in SPARQL:
select ?faocode
?source1 ?source2
where {
graph ?source1 {
ecoscope:thunnus_albacares MarineTLO:isNativeAt ?waterarea
}.
graph ?source2 {
?waterarea MarineTLO:LXrelatedIdentifierAssignment ?faocode
} }
RESULT:
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
48
Architecture of Matware

Actions in order to create a Warehouse from scratch one should specify
 the type of the repository
 the names of the graphs that correspond to the different sources
 URL, username and password in order to connect to the repository
 Actions in order to add a new source
 (a) include the fetcher class for the specific source as plug in
 (b) provide the mapping files (schema mappings)
 (c) include the transformer class for the specific source as a plug in
 (d) provide the SILK rules as xml files
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
49
The resulted MarineTLO-based Warehouse(1/2)
Integrated information about Thunnus albacares from different sources
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
50
The resulted MarineTLO-based Warehouse(2/2)
Concepts
Ecoscope
FLOD
WoRMS DBpedia Fishbase
Species
Scientific Names
Authorships
Common Names
Predators
Ecosystems
Countries
Water Areas
Vessels
Gears
EEZ
iMarine 2nd Review, September
Yannis Tzitzikas,
ITN-DCH summer school 2016 (@CGI'16)
2013,Brussels
51
Evolution over Time
< some plots>
 Need for visualization

Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
Exploitation of the Semantic
Warehouse
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
53
Exploitation of the Semantic Warehouse
A) Semantic Processing of Search Results (einfrastructure service)
B) Fact Sheet Generator (web application)
C) Species Identification Tool
D) Interactive 3D visualization
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
54
A) For Semantic Post-Processing of Search Results:
The process
web
browsing
contents
query
terms
(top-L) results
(+ metadata)
Entity
Mining
MarineTLO
Warehouse
entities / contents
Visualization/Interaction
(faceted search, entity
exploration, annotation,
top-k graphs, etc.)
semantic
data
Semantic
Analysis
• Grouping,
• Ranking
• Retrieving more
properties
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
55
A) For Semantic Post-Processing of Search Results:
Example (X-Search)
The
Warehou
se is used
The
Warehou
se is used
Search
Results
Result of
Entity
Mining
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
Result of
textual
clustering
56
Example of the
EntityCard of Thunnus
Albacares
The
Warehou
se is used
From DBpedia
From FLOD
From Ecoscope
From WoRMS
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
57
A’) XSearch as a bookmarklet
Dynamic annotating of entities over any Web page
The
Warehou
se is used
Entity
exploration
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
58
B) Fact Sheet Generator & Android Application
Fact Sheet Generator
Ichthys
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
59
C) Species Identification Tool
Species identification through Preference-enriched Faceted Search over the
semantic descriptions of fish species
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
D) Interactive 3D Visualization of Datasets
The metrics are exploited for producing interactive 3D visualization of datasets
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
This approach is general
Integrated information about Thunnus albacares from different sources
Datasets about
Toledo
Datasets about
Art
Datasets about
Crete
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
62
The big picture (core concepts and relations)
services
Molecular world
and parts
global
indices
part of
exemplifies
is
about
cross reference
Human Activities
Records
Forecasts
from
Samples or
Specimen
(bio,geo)
Publications
Products of
mathem.
Models
collections
use to
appear in
Species
databases
exemplifies
from
The core
Complex
conceptualization
System
of Earth
Sciences
Activities
create
maintain
occurs in
Time
Human Activities
Place
Observations
Simulation
based on
Situation
describe
Y. Marketakis and Y. Tzitzikas (FORTH), Edinburg, March 2012
63
..
What’s next
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
64
Challenges and our ongoing research
Emphasis on
 Dataset discovery, dataset
recommendation, dataset
selection (e.g. in mediatorbased integration)
 Finding all URIs of an entity
 Finding all triples of an entity
 More effective visualizations,
monitoring, quality testing,
trust estimation
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
Concluding Remarks
Semantic integration could boost data-intensive scientific discovery but requires
tackling several challenging issues
 We have discussed main requirements and challenges in designing, building,
maintaining and evolving a real and operational semantic warehouse for marine
resources
 We have presented the process and related tools that we have developed for
supporting this process with emphasis on Scope control, Connectivity
assessment, Provenance, Reconstructability, Extensibility
 Currently we focus on applying this approach for large number of datasets

Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
66
Links (1/2)
MatWare (for automating the warehouse construction process)
• http://www.ics.forth.gr/isl/MatWare/
 MarineTLO (top-level ontology)
• http://www.ics.forth.gr/isl/MarineTLO/
 Semantic Warehouses

 MarineTLO-Warehouse: http://virtuoso.i-marine.d4science.org:8890/sparql
– also browsable through http://virtuoso.i-marine.d4science.org:8890/fct
XSearch (exploiting semantic warehouses in searching)
• http://www.ics.forth.gr/isl/X-Search/
 Xlink (exploiting semantic warehouses for entity identification in
texts)
• http://www.ics.forth.gr/isl/X-Link

Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
67
Links (2/2)

Hippalus: Preference-enriched Faceted Search
 www.ics.forth.gr/isl/Hippalus
o Select a dataset from the Marine Biology domain for enacting the species identification
through PFS

Interactive 3D Visualization of the LOD Cloud
 www.ics.forth.gr/isl/3DLod/

LODSyndesis: (measuring the commonalities in the entire LOD)
 www.ics.forth.gr/isl/LODsyndesis
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
References

On connectivity metrics
• Y. Tzitzikas, et al, Quantifying the Connectivity of a Semantic Warehouse, 4th
International Workshop on Linked Web Data Management, LWDM'14@
EDBT'14)
• M. Mountantonakis et al, Extending VoID for Expressing the Connectivity
Metrics of a Semantic Warehouse, 1st International Workshop on Dataset
Profiling & Federated Search for Linked Data (PROFILES'14), ESWC'14,
• M. Mountantonakis, N. Minadakis, Y. Marketakis, P. Fafalios and Y. Tzitzikas,
Quantifying the Connectivity of a Semantic Warehouse and Understanding its
Evolution over Time, International Journal on Semantic Web and Information
Systems (IJSWIS), (accepted for publication in 2016), will appear with DOI:

Recent work on for integrating large number of datasets
 M. Mountantonakis and Y. Tzitzikas, On Measuring the Lattice of
Commonalities Among Several Linked Datasets, VLDB’16, Sept 2016
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
Acknowledgements

Joint work with








Michalis Mountantonakis
Nikos Minadakis
Yannis Marketakis
Pavlos Fafalios
Panagiotis Papadakos
Chryssoula Bekiari
Martin Doerr
Maria Papadaki
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
Thank you for your attention
Yannis Tzitzikas, ITN-DCH summer school 2016 (@CGI'16)
71