Erhard Rahm Ontologies Ontology Matching Instance

Transcription

Erhard Rahm Ontologies Ontology Matching Instance
Erhard Rahm
http://dbs.uni-leipzig.de
June 3, 2009
Ontologies
` Ontology Matching
`
¾ Problem
¾ Match
`
techniques and prototypes (e.g., GLUE)
Instance-based matching in COMA++
¾ Constraint-
/ Content-based Matching
¾ Matching web directories
`
Matching by Instance overlap
¾ Similarity
measures
¾ Evaluation: Product catalogs,
g , biomedical ontologies
g
Stability of ontology mappings
` Conclusions
`
2
`
Support a shared understanding of terms/concepts
in a domain
¾ Annotation
ontology
of data instances by terms/concepts of an
`
Semantically organize information of a domain
`
Support data integration
`
Sample ontologies
¾ Find
data instances based on concepts (queries,
navigation)
¾ e.g.
by mapping data sources to shared ontology
¾ Product
catalogs of companies
companies, e
e.g.
g online shops
¾ Web directories
¾ Biomedical ontologies
3
Hierarchical categorization of products
` Instances: product
d
d
descriptions
` Often very large: ten thousands categories, millions
of products
`
4
`
Categorization of websites
Instances: website descriptions
p
((URL,, name,, content
description)
Manual vs. automated category assignment of instances
`
General lists or specialized (per region, topic, etc.), e.g.
`
`
¾
Yahoo! Directory
¾
Dmoz – Open Directory Project (ODP)
¾
Google
l Directory – based
b
d on Dmoz
¾
Business.com
¾
Vfunk: Global Dance Music Directory
5
`
Many ontologies for different disciplines, e.g.
¾ Molecular
Biology,
etc.
Biology Anatomy,
Anatomy Health etc
Largest ontologies (> 10,000 concepts), e.g., Gene
gy (GO),
( ), NCI Thesaurus
Ontology
` Ontologies used to annotate genes and proteins
`
9 Support for “functional” data analysis
`
Instances: annotated objects; separate from ontology
Protein
SwissProt
Gene
E
Entrez
6
instance associations
Molecular Function
GO
Biological Process
Genetic Disorders
OMIM
GO
AgBase
7
Focus on practically used ontologies
` Ontology
O
l
O consists off a set off concepts/categories
interconnected by relationships (e.g. of type „is-a“
or „part-of
part-of“)). O is represented by a DAG and has a
designated root concept.
¾ Concepts
p have attributes, e.g.
g Id,, Name,, Description
p
¾ Concepts may have associated instances
` Ontologies may be versioned
` Instances
`
¾ May
be managed together with ontology or
i d
independently
d
l
¾ May be associated to several concepts
schemas even per concept
¾ May have heterogeneous schemas,
8
O1
pp g
Mapping:
Matching
O2
O1, O2
iinstances
t
`
O1.e11, O2.e23, 0.87
O1.e13,O2.e27, 0.93
…
further input,
e.g. dictionaries
Process of identif
identifying
ing semantic correspondences
between 2 ontologies
¾ Result:
ontology mapping
¾ Mostly equivalence mappings: correspondences specify
equivalent ontology concepts
`
Variation of schema matching problem
9
Shopping.Yahoo.com
Electronics
DVD Recorder
Beamer
Digital Cameras
Digital Photography
Digital Cameras
`
Ontology mappings useful for
Electronics & Photo
TV & Video
DVD Recorder
Projectors
Camera & Photo
Digital Cameras
IImproving
i
query results,
l
e.g. to fi
find
d specific
ifi products
d
` Advanced (cross-site) product recommendations
` Automatic categorization of products in different catalogs
` Merging catalogs
`
10
Amazon.com
Protein
SwissProt
instance associations
Gene
Entrez
`
Molecular Function
G
GO
?
Genetic Disorders
OMIM
GO
?
Ontology mappings useful for
Improved analysis
` Answering questions such as “Which Molecular Functions are involved in which
Biological Processes?
Processes?”
` Validation (curation) and recommendation of instance
associations
` Ontology merge or curation, e.g. to reduce overlap between
ontologies
`
11
High degree of semantic heterogeneity in
independently developed ontologies
` Syntactic differences
`
`
`
¾
Different models and languages
g g
¾
¾
Different is-a and part-of hierarchies
O l
Overlapping
i
categories
i
¾
Naming ambiguities and conflicts
`
Different scope
Heterogeneous instance representations
Structural differences
Semantic differences
Modeling errors / inconsistencies
` Instance / content differences
`
`
`
12
Biological Process
automatic generic solutions ?
Fully automatic,
Metadata-based
Element
• Names
• Descriptions
`
Structure
Constraintbased
Linguistic
• Types
• Keys
Instance-based
Element
Constraintbased
Linguistic
• Parents
• Children
• Leaves
L
• IR (word
frequencies,
k tterms))
key
Reuse-oriented
Element
Structure
• Dictionaries
• Thesauri
• Previous match
results
Constraintbased
• Value pattern
and ranges
Matcher combinations
Hybrid matchers
` Composite matchers
`
* Rahm, E., P.A. Bernstein: A Survey of Approaches to Automatic Schema Matching.
VLDB Journal 10(4), 2001
13
`
semantics of a category may be better expressed by
g y than by
y
the instances associated to category
metadata (e.g. concept name, description)
¾ Categories
with most similar instances should match
Main problem: Availability of (shared/similar)
instances for most/all concepts
` Common
C
cases:
?
?
O1
O2
O1
`
ontology
associations
Common
instances
a) Common instances (separate from ontologies)
Example: Documents/Objects annotated by
O1, O2 terms / concepts
14
O1
instances
?
O2
O2
instances
b) Ontology-specific instances
b1) with shared instances
b2) without shared (but similar) instances
Many prototypes for schema or ontology matching *
` Instance
based schema matching (XML,
Instance-based
(XML relational)
`
¾ SEMINT
¾ LSD
¾ Clio
¾ iMap
¾ Dumas
`
Instance-based ontology matching (OWL)
¾ GLUE,
GLUE
U off Washington
W hi t
¾ COMA++, U Leipzig (supports schema + ont. matching)
¾ FOAM / QOM
QOM, U Karlsruhe
¾ Sambo, Linköping U, Sweden
¾ Falcon-AO, South East U, China
¾ RiMOM, Tsinghua U, China
* Euzenat/Shvaiko: Ontology matching. Springer 2007
15
Mappings for O1 , Mappings for O2
Relaxation Labeling
Similarity Matrix
Common
Knowledge
C
K
l d &
Similarity Estimator
Domain Constraints
Similarityy Function
Joint Probability Distribution P(A,B), P(A’, B)…
Meta Learner
Distribution
Base Learner
Base Learner Estimator
Taxonomy O1
(tree structure + data instances)
`
`
`
`
16
Taxonomy O2
(tree structure + data instances)
Use of machine learning to find ontology mappings
Base learners use concept names + data instances (description)
Similarity measures computed from “joint probability distribution”
of concepts
Evaluation on comparatively small ontologies: 3 match tasks,
per ontology: 34-331 concepts, 6-30 non-leaf concepts, 150014000 instances,
instances 34-236
34 236 correspondences
* Doan, AH; et al: Learning to Match Ontologies on the Semantic Web.
VLDB Journal, 12(4):303-319, 2003
A,S
Concept A
Concept S
¬A,
A S
A,¬S
Hypothetical
universe of
f
all examples
¬A,¬S
P(A ∩ S)
Sim(Concept A, Concept S) =
P(A ∪ S)
[Jaccard]
=
P(A,S)
P(A,¬S) + P(A,S) + P(¬A,S)
Joint Probability Distribution: P(A,S),P(¬A,S),P(A,¬S),P(¬A,¬S)
different similarity measures usable based on JPD
17
Mutual use of trained classifiers to determine instanceinstance-concept
associations (requires no shared but only similar instances)
¬A,¬S
A
Taxonomy 1
¬A,S
A,¬S
Taxonomy 2
¬A,¬SS
¬S
¬A
A,¬S
CLA
A,S
A,S
A
¬A,S
CLS
¬A
JPD estimated by counting the sizes of the partitions
18
S
¬S
System for aligning and merging biomedical
ontologies
` Framework to find similar concepts in overlapping
g
for alignment
g
and merge
g tasks
OWL ontologies
`
`
Combined use of different matchers and auxiliary
information
` Linguistic
Linguistic, structure-based,
structure-based
constraint-based
` Instance-based matching
… Based
B d on ttexts
t ((e.g., papers))
… Two concepts are similar if a
document describes both concepts
`
description
d
i i
llogic
i reasoner
checks results for ontology
consistency and cycles
* Lambrix, P; Tan, H.: SAMBO – A system for aligning and merging biomedical ontologies.
Journal of Web Semantics, 4(3):196-206 , 2006
19
`
Dataset
¾ extracted
d
ffrom G
Google,
l Y
Yahoo
h
and
d Looksmart
L k
web
b
directory
¾ More than 4,500
4 500 simple node matching tasks,
tasks no
instances
Comparison of matching quality results (top-3 systems of each year )
•
20
In 2008 the systems
y
together
g
did not manage to discover
48% of the total number of
positive correspondences
OAEI (Ontology Alignment Evaluation Initiative) Alignment Contest,
http://oaei.ontologymatching.org
Ontologies
` Ontology Matching
`
¾ Problem
¾ Match
`
techniques and prototypes (e.g., GLUE)
Instance-based matching in COMA++
¾ Constraint-
/ Content-based Matching
¾ Matching web directories
`
Matching by Instance overlap
¾ Similarity
measures
¾ Evaluation: Product catalogs,
g , biomedical ontologies
g
Stability of ontology mappings
` Conclusions
`
21
Extends previous COMA prototype (VLDB2002)
` Matching of XML & rel
rel. Schemas and OWL ontologies
` Several match strategies: Parallel (composite) and
`
sequential
q
matching;
g Fragment-based
g
matching
g for large
g
schemas; Reuse of previous match results
Model
M
d l Pool
P l
Directed
graphs
S1
S2
Import,
Load,
Save
Model
Manipulation
22
Match Strategy
Component
identification
{s11, s12, ...}}
{s21, s22, ...}
Mapping Pool
Matcher
execution
Similarity
Combination
Matcher 1
s11↔s21
s12↔s22
s13↔s23
Matcher 2
Matcher 3
Similarity cube
Mapping
s
Mapping
Nodes, ...
Paths, ...
Name, Children,
Leaves,
NamePath, …
Aggregation,
Direction,
Selection,
CombinedSim
Diff, Intersect,
Union,
MatchCompose,
Eval, ...
Component
Types
Matcher
Library
Combination
Library
Mapping
Manipulation
*Schema and Ontology Matching with COMA++. Proc. SIGMOD 2005
Repository (persistent) &
Workspace (in‐memory)
Current Mapping
Domains
Schemas/
Ontologies
Mappings
Source Schema
Schema/
mapping info
pp g
Target Schema
23
Configuration of matcher
Metadata-based
Reuse-based
Instance-based
User-programmed
24
Configuration of match strategies
S
Source Schema
S h
I t
Intermediate Schema
di t S h
Mapping
Excel <‐> Noris
Excel
< > Noris
T
Target Schema
tS h
Mapping
Noris <‐
Noris
< > Noris_Ver2
> Noris Ver2
25
`
Instance matchers introduced in 2006
Constraint-based matching
` Content-based matching: 2 variations
`
`
Coma++ maintains instance value set per element
XML schema instances
(
Ontology instances
26
`
Instance constraints are assigned to schema elements
¾
¾
¾
General constraints: always applicable
Example: average length and used characters (letters
(letters, numeral
numeral,
special char.)
Numerical constraints: for numerical instance values
Example
p : p
positive or negative,
g
, integer
g or float
“M @
“[email protected]”
il
” vs.
Pattern constraints:
“[email protected]”
Example: Email and URL
Use of constraint similarity matrix to determine
element
l
t similarity
i il it (lik
(like d
data
t type
t
matching)
t hi )
` Simple and efficient approach
`
¾
`
Effectiveness depends on availability of constrained value ranges / pattern
A
Approach
h does
d
nott require
i shared
h
d iinstances
t
Determine Constraints
General constraints
element1
Numerical constraints
element2
Pattern constraints
Compare
Constraints
c11↔c21
…
c1h↔c2k
Similarity
value
Using
U
i Constraint
C t i t
Similarity Matrix
27
`
◦
◦
◦
`
◦
◦
2 variations
Value Matching: pairwise similarity comparison of instance
values
Document (value set) matching: combine all instances into a
virtual document and compare documents
Both approaches do not require shared instances
Value matching
Use any similarity measure for pairwise value comparison
Aggregate individual similarity values (similarity matrix) into a
combined concept similarity (e
(e.g.,
g based on Dice)
Compare pair-wise
Instance Values
28
element1
instance11↔instance21
instance12↔instance21
element2
…
i
i
instance
1n↔instance
2m
Aggregation
Similarity
value
Similarity Matrix
`
Document matching
1 iinstance d
document per category or selected
l
d string
i
category attribute (e.g. description)
¾ Document comparison
p
based on TF-IDF to focus on most
significant terms
¾
`
Two options to deal with multiple string attributes
All values
l
ffor these
h
attributes
ib
are h
handled
dl d as one virtual
i
l
document
¾ Independent
p
matching
g per
p attribute and aggregation
gg g
of the
similarity values
¾
element1
element2
Compare virtual documents
document1
document2
instance11
instance21
instance12 Similarity function instance22
…
instance1n
based on
TF IDF
TF-IDF
Similarity
value
…
instance2m
29
39 of 51 test cases based on instances
` 2966 correspondences in reference alignment
`
2400
2200
2000
1800
1600
1400
1200
1000
800
600
400
200
0
All Corresp.
Correct Corresp.-
ConstraintM hi
Matching
F-Measure: 0.15
30
ContentM hi
Matching
0.61
NameType
0.64
Content +
NameType
0.82
Constraint+
C t t+N
T
Content+NameType
0.82
`
Instance-based matching between 4 web directories, limited to online shops
Dmoz
D
746
15,304
#Categories
#Direct instances
Google
G
l
728
15,082
Web
W
b
418
13,673
Dmoz
Yahoo
Y
h
3,234
34,949
Yahoo
Clothing
Sports
Sports
Swimwear
Water Sports
Swimming and Diving
Swimming and Diving
Gear and Equipment
Apparel
URL =http://www.beachwear.net
Name =The Beachwear Network
Description
=Selection
Selection
beachwear.
URL
=http://www
http://www.skinzwear.com/
skinzwearofcom/
Name =Skinz Deep
Description
=Swimwear, bikinis and
URL
=http://www.ritchieswimwear.com/
streetwear.
Name
= Ritchie Swimwear
Description
p
=Designer
g brand for
women, men and little girls.
URL =www.skinzwear.com
Name =Skinz Deep,
p, Inc.
Description
=Bikinis, swimwear,
URL =www.ritchieswimwear.com
beachwear,
and streetwear
Name =Ritchie
Swimwearfor men and
women.
Description =Offers bathing suits, beachware,
and cover-ups for men, women, and children.
Stores located throughout South Florida.
* Massmann, S., Rahm, E.: Evaluating . Evaluating Instance-based Matching of Web
Directories. Proc. WebDB 2008
31
`
`
Instances are shop websites
Instance-based matching on 3 attributes: shop URL, name, description
¾
`
Use of directly and indirectly associated instances
URL matcher based an value matching
¾
After URL preprocessing, equal URLs are needed (same shops in different
directories) to find matching categories
goodbooks.net/index.asp
http://www.goodbooks.net
Preprocessing
Preprocessing
goodbooks.net
32
GoodBooks.net
Preprocessing
`
Name matcher based on value matching
`
Description matcher based on document matching
`
Name / description
p
matching
g do not need shared instances
`
Six match tasks Æ six reference mappings
# Corresp
Dmoz ↔
Google
Dmoz ↔
Web
Dmoz ↔
Yahoo
Google ↔
Web
729
218
436
211
(manually created)
Google ↔ Web ↔
Yahoo
Yahoo
416
235
∑ 2245
33
`
Combination of three instance-based matchers (URL, name,
description) and six metadata-based matchers
minimum and
maximum
i
values for the
six match
tas s
tasks
best single
metadata-based
matche
34
best single
instance-based
matcher
Combination: all 3 instance-based
and 3 metadata-based matchers
(Path Name,
Name Parent),
Parent)
(Path,
average Fmeasure: 0.79
Ontologies
` Ontology Matching
`
¾ Problem
¾ Match
`
techniques and prototypes (e.g., GLUE)
Instance-based matching in COMA++
¾ Constraint-
/ Content-based Matching
¾ Matching web directories
`
Matching by Instance overlap
¾ Similarity
measures
¾ Evaluation: Product catalogs,
g , biomedical ontologies
g
Stability of ontology mappings
` Conclusions
`
35
`
Use of instance overlap for ontology matching:
t
two
concepts
t are related
l t d / similar
i il
if th
they share
h
a
significant number of associated objects
`
Different measures to determine the instance-based
similarity
`
`
Base-K; Dice, Min, Jaccard …
Extensions:
Consideration of indirect instance associations
` Combination with other match approaches
` Consideration
C
id
ti
off similar
i il (b
(butt non-identical)
id ti l) objects
bj t
`
* Thor, A; Kirsten, T; Rahm, E.: Instance-based matching of hierarchical ontologies. Proc. BTW, 2007
Kirsten, T, Thor, A; Rahm, E.: Instance-based matching of large life science ontologies. Proc. DILS, 2007
36
Amazon
by Brands
Microsoft
Novell
Softunity
by Category
Books
DVD
Kid & H
Kids
Home
Software
Software
Languages
Utilities & Tools
Traveling
B i
Business
&P
Productivity
d ti it
Operating System
Windows
Burning
Software
Operating
System
Handheld
Software
Linux
Id =158298302X
EAN = "662644467122"
Title = "SuSE Linux 10.1 (DVD)"
Price = 49.99
Ranking = 180
Id = B0002423YK
EAN = 0805529832282
Title = "Windows XP Home Edition incl. SP2"
Price = 191.91
Ranking = 47
Id = ECD435127K
EAN = 0662644467122
ProductName = "SuSE Linux 10.1"
DateOfIssue = 02.06.2006
P i =Id59
Price
59.95
= 95
ECD851350K
EAN = 0805529832282
ProductName = "WindowsXP Home"
DateOfIssue = 15.10.2004
Price = 238.90
37
`
Baseline similarity SimBaseK
⎧1 , if N c1c2 >= K
SimBaseK (c1 , c2 ) = ⎨
⎩ 0 , if N c1c2 < K
ƒ Dice similarity SimDice
2 ⋅ Nc c
Sim Dice (c1 , c2 ) =
Nc + Nc
c1∈O1
Example:
4
3
I
1 2
1
∩=2
c2∈O2
2
SimBase1 = SimBase2 = 1, SimBase3 = 0
S Dice = 2*2/(4+3)
Sim
* /( ) = 0.57
ƒ Minimum similarity SimMin
SimMin (c1 , c2 ) =
38
SimMin
= 2/3
= 0.67
N c1c2
min( N c1 , N c2 )
0 ≤ SimDice ≤ SimMin ≤ SimBase1 ≤ 1
`
Computation of precision & recall needs a perfect
mapping
i
((reference
f
alignment)
li
t)
¾ Laborious
for large ontologies
¾ Might not be well-defined
well defined
Syntactic measures to “approximate” recall / precision
` Match coverage: fraction of matched categories
`
MatchCover ageO1 =
`
| CO1 |
∈ [0...1]
|C
| + | CO 2− Match |
InstMatchC overage = O1 − Match
| CO1 − Inst | + | CO2 − Inst |
Combined
Match ratio: #correspondences per matched concept
MatchRatioO1 =
`
| CO1 − Match |
| CorrO1−O 2 |
≥1
| CO1− Match |
CombinedMatchRatio =
2⋅ | CorrO1−O 2 |
≥1
| CO1− Match | + | CO 2− Match |
Goal: high Match Coverage with low Match Ratio
39
Amazon (AM) vs. Softunity (SU)
` Baseline1: max
max. Match
Coverage, high Match Ratios
` SimMin: good Match Coverage,
moderate Match Ratios
` SimDice: low Match Coverage,
low Match Ratios
`
O1
CorrO1O2
O2
SU
RatioO1 RatioO2
InstO1O2
5.4
2.1
AM
# concepts (product categories)
470
1,856
# concepts having instances
170
1,723
(products))
# instances (p
2,576
,
18,024
,
# direct associations
2,576
25,448
1
≈ 1.4
≈15
15
≈15
15
# associations / # instances
# Instances / #concepts
SU
I2
SU
1872
335
2.7
1.1
AM
SU
AM
SU
849
71
1.2
1.0
AM
27%
80%
Baseline1
40
AM
100%
CoverageO1O2
I1
711
SU
AM
Min (100%)
SU
535
AM
Dice (50%)
`
Ontologies
¾3
¾
subontologies
g
of GeneOntology
gy
Genetic disorders of OMIM
Instances: Ensembl proteins of 3 species, i.e. homo
sapiens, mouse, rat
Homo Sapiens Mus Musculus
` Only subset of concepts
3,018
2,810
288
110
h associated
has
i t d iinstances
t
201
`
2,452
Molecular
Function
Biological
Process
Cellular
Component
77
Genetic
Disorder
47
133
2 709
2,709
Rattus Norvegicus
Ensembl Proteins
of different species
Number of associated
Biological Processes
(total # processes: 12,555)
41
Human
1,0
0,8
0,6
0,4
0,2
,
0,0
Rat
Match Coverage
M t h Ratios
Match
R ti per ontology
t l
MF - CC
BP - CC
SimKaappa
Sim Dice
Sim
mMin
SimB
Base
SimKaappa
Sim Dice
Sim
mMin
Sim
mMin
SimB
Base
MF - BP
MF - BP
42
Mouse
SimB
Base
`
SimKaappa
`
SimBase: high Match Coverage (99%) w.r.t. concepts having instances,
very high Match Ratios
SimDice: low Coverage (< 20%) and low Match Ratios
SimMin: good Coverage (60%-80%) with moderate Match Ratios
Sim Dice
`
MF
BP
Base
20.4
17.0
Min
4.4
4.0
Dice
1.3
1.2
(Match Ratios for Homo Sapiens,
MF-BP task)
Simple matcher on concept names
Relatively low Match Coverage (however w.r.t.
w r t all
concepts including instance-free concepts)
`
`
` No correspondences for similarity ≥ 0.9
` Low similarity thresholds (e.g.< 0.6) too imprecise
Match Coverage p er ontology
0,5
0,6
0,7
0,8
0,50
0,45
0,40
0,35
0,30
0,25
0,20
0,15
0,10
,
0,05
0,00
MF - BP
MF
MF
BP
MF
MF - BP
CC
BP
MF - CC
CC
BP - CC
Match Coverage per ontology
BP
0.5
4.4
6.9
06
0.6
24
2.4
29
2.9
0.7
1.4
1.4
0.8
1.1
1.1
Match Ratios per ontology
43
Combinations between instance- (SimMin) and
metadata-based match approach
`
` Union:
U i
Increased
I
dM
Match
t hC
Coverage an M
Match
t hR
Ratios
ti
` Intersection: Low Match Coverage (<1%)
Low overlap between instance- and metadatab
based
d mappings
`
0,5
Matc
ch Coverage pe
er Ontology
1,00
0,6
0,7
0,8
Match Coverage per ontology
for combined mappings
0,80
0,60
MF - BP
0,40
MF
BP
∪
4.1
3.7
∩
1.0
1.0
0,20
0,00
MF
BP
MF - BP
44
Match Ratios per ontology
(Name threshold 0.7)
CC
MF
MF - CC
BP
CC
BP - CC
(SimMin = 1.0, Homo Sapiens)
Automatically vs. manually assigned annotations
` Example: Annotations in Ensembl (July 2008) – 46,704
46 704 proteins
`
Automatically assigned
M
Manually
ll assigned
i
d
Sum
returns small mappings of
likely improved quality
all
¾ Restriction to manual annotations
82%
18%
BP
57824
22951
80775
72%
28%
|CorrBP_MF|
Ontology mappings for Base3,Min
man
`
MF
82466
17729
100195
Base3
Min ∩ Base3
Base3
Min ∩ Base3
21386
3275
3835
758
|CBP|
|CMF|
1939
1107
899
435
1393
1107
533
285
man
all
MCBP MCMF MRBP MRMF
45
Base3
0,13
0,17
11,0
15,4
Min ∩ Base3
0 08
0,08
0 13
0,13
30
3,0
30
3,0
Base3
0,06
0,06
4,3
7,2
Min ∩ Base3
0,03
0,03
1,7
2,7
Ontologies
` Ontology Matching
`
¾ Problem
¾ Match
`
techniques and prototypes (e.g., GLUE)
Instance-based matching in COMA++
¾ Constraint-
/ Content-based Matching
¾ Matching web directories
`
Matching by Instance overlap
¾ Similarity
measures
¾ Evaluation: Product catalogs,
g , biomedical ontologies
g
Stability of ontology mappings
` Conclusions
`
46
Continous evolution of ontologies (many versions)
` Evolution analysis of 16 life science ontologies:
`
¾ Average
of 60% growth in last four years
¾ Deletes and changes also common
Ontology
NCI Thesaurus
GeneOntology
-- Biological Process
-- Molecular Function
-- Cellular Components
ChemicalEntities
size
|C| start
|C| last
grow |C|, start, last
large
35,814
17,368
8,625
7,336
1,407
10,236
63,924
25,995
15,001
8,818
2,176
18,007
1.78
1.50
1.74
1.20
1.55
1.76
www.izbi.de/onex
Full period (May. 04 - Feb. 08)
##monthly
thl
changes:
Ontology
Add
Del
Obs
adr
NCI Thesaurus
GeneOntology
-- Biological Process
-- Molecular Function
-- Cellular Components
ChemicalEntities
627
200
146
36
18
256
2
12
7
3
2
62
12
4
2
2
0
0
42.4
12.2
16.2
6.8
8.9
4.1
Last year (Feb. 07 - Feb. 08)
add-frac del-frac obs-frac
1.3%
0.9%
1.2%
0.4%
1.0%
1.8%
0.0%
0.1%
0.1%
0.0%
0.1%
0.5%
0.0%
0.0%
0.0%
0.0%
0.0%
0.0%
Add
Del
Obs
416
222
133
69
19
384
0
20
10
7
3
67
5
5
2
3
0
0
Hartung, M; Kirsten, T; Rahm, E.: Analyzing the Evolution of Life Science Ontologies and Mappings.
Proc. 5th Intl. Workshop on Data Integration in the Life Sciences (DILS), 2008
47
`
High change rates in
¾ Ontologies
O
l i
¾ Instances
¾ Annotations
`
(instance-concept associations)
Ontology
gy mappings
pp g ((between versions of two
ontologies) also change frequently, especially for
instance-based match approaches
¾ correspondences
versions
`
48
may disappear in newer mapping
Consideration of instance overlap or metadatametadata
bases similarity may not be sufficient for
„good“ ontology
determining
g „g
gy mappings
pp g
`
Standard match approaches only consider
i f
information
ti
about
b t currentt ontology
t l
versions
i
and
d
ignore evolution history
Is the black correspondence as
good as the red one?
¾ Possible instabilities of match
correspondences due to
evolution
l ti off ontologies
t l i and/or
d/
related data source
ƒ Idea: Consider the evolution of a match correspondence to assess its
stability/quality in the current version
* Thor, A; Hartung, M; Gross, A; Kirsten, T; Rahm, E.:An Evolution-based Approach for
Assessing Ontology Mappings - A Case Study in the Life Sciences. Proc. BTW, 2009
49
`
Average Stability
1 n −1
stabAvg n ,k (a, b, m ) = 1 − ⋅ ∑ simi +1 (a, b, m ) − simi (a, b, m ) ∈ [0,1]
k i =n−k
`
Weighted Maximum Stability
¾ Proximity of similarities in the last versions compared to
the current
version
⎡ simn (a, b, m ) − simn −i (a, b, m ) ⎤
stabWM n ,k (a, b, m ) = 1 − max ⎢
⎥ ∈ [0,1]
i =1...k
i
⎣
⎦
Correspondence
similaritty
1
stabAvg 6,5 stabWM 6,5
0,8
0,6
0,4
0,2
0
1
50
2
3
4
Version
5
6
(a1,b1)
0.9
0.95
(a2,b2)
0.7
0.9
(a3,b
b3 )
09
0.9
06
0.6
Setting
` Mapping GO Biological Processes to Molecular Functions
` Instance based matching (using Ensembl source)
` Result: 2
2,497
497 correspondences (Base3 ∩ Min≥ 00.8)
8) which
existed in the last 5 versions
` Selection of correspondences based on similarity and stability
`
accepted
55%
candidates
15%
questionable
30%
51
`
Instance-based match approaches
`
`
`
Important since instances reflect well semantics of categories
Availability of usable instances may be restricted to subset of
concepts (consideration of indirectly associated instances helpful)
Need to be combined with metadata-based
metadata based techniques
Correct ontology mappings NOT limited to 1:1
correspondences
` High
h change
h
rates for
f ontologies/instances
l
may result
l in
unstable ontology mappings
` Matching
atc
g based on
o s
shared
a ed instances
sta ces
`
`
`
`
Instance based matching in COMA++
Instance-based
`
`
52
Different similarity measures to consider instance overlap
Especially applicable in bioinformatics (frequent annotations)
3 basic instance matchers (constraint-based, content-based) not
requiring shared instances
l bl combination
b
h many metadata-based
d
b
d approaches
h and
d
Flexible
with
different match strategies
Evaluation and validation of large ontology
mappings
i
` Combined study of ontology matching and instance
(entity) matching
`
Correspondences based on instance similarity not equality
¾ Entity matching utilizing category similarity
¾ Automatic instance categorization
¾
Scalable instance match approaches based on
machine
hi learning
l
i
` Ontology Evolution
` Ontology
O t l
M
Merging
i
`
53
http://se-pubs.dbs.uni-leipzig.de
54
`
`
`
`
`
`
`
55
Aumüller, D., Do, H., Massmann, S., Rahm, E.: Schema and
ontology matching with COMA++. Proc. ACM SIGMOD, 2005
Hartung M, Kirsten T., Rahm E.: Analyzing the evolution of life
science ontologies and Mappings. Proc. DILS 2008. Springer LNCS
5109
Kirsten T., Thor A., Rahm E.: Instance-based matching of large
life science ontologies. Proc. DILS 2007. Springer LNCS 4544
Massmann, S.; Rahm, E.: Evaluating Instance-based
Instance based matching of
web directories. Proc. 11th Int. Workshop on the Web and
Databases (WebDB), 2008
Rahm, E., Bernstein, P.: A survey of approaches to automatic
schema matching.
matching The VLDB Journal
Journal, 10(4): 334
334-350,
350 2001
2001.
Thor A., Hartung M., Groß A., Kirsten T., Rahm E.: An evolution-
based approach for assessing ontology mappings - A case study
in the life sciences. Proc. 13th German Database Conf. (BTW),
2009
Thor, A., Kirsten, T., Rahm, E.: Instance-based matching of
hierarchical ontologies. Proc. 12th German Database Conf. (BTW),
2007