Erhard Rahm Ontologies Ontology Matching Instance
Transcription
Erhard Rahm Ontologies Ontology Matching Instance
Erhard Rahm http://dbs.uni-leipzig.de June 3, 2009 Ontologies ` Ontology Matching ` ¾ Problem ¾ Match ` techniques and prototypes (e.g., GLUE) Instance-based matching in COMA++ ¾ Constraint- / Content-based Matching ¾ Matching web directories ` Matching by Instance overlap ¾ Similarity measures ¾ Evaluation: Product catalogs, g , biomedical ontologies g Stability of ontology mappings ` Conclusions ` 2 ` Support a shared understanding of terms/concepts in a domain ¾ Annotation ontology of data instances by terms/concepts of an ` Semantically organize information of a domain ` Support data integration ` Sample ontologies ¾ Find data instances based on concepts (queries, navigation) ¾ e.g. by mapping data sources to shared ontology ¾ Product catalogs of companies companies, e e.g. g online shops ¾ Web directories ¾ Biomedical ontologies 3 Hierarchical categorization of products ` Instances: product d d descriptions ` Often very large: ten thousands categories, millions of products ` 4 ` Categorization of websites Instances: website descriptions p ((URL,, name,, content description) Manual vs. automated category assignment of instances ` General lists or specialized (per region, topic, etc.), e.g. ` ` ¾ Yahoo! Directory ¾ Dmoz – Open Directory Project (ODP) ¾ Google l Directory – based b d on Dmoz ¾ Business.com ¾ Vfunk: Global Dance Music Directory 5 ` Many ontologies for different disciplines, e.g. ¾ Molecular Biology, etc. Biology Anatomy, Anatomy Health etc Largest ontologies (> 10,000 concepts), e.g., Gene gy (GO), ( ), NCI Thesaurus Ontology ` Ontologies used to annotate genes and proteins ` 9 Support for “functional” data analysis ` Instances: annotated objects; separate from ontology Protein SwissProt Gene E Entrez 6 instance associations Molecular Function GO Biological Process Genetic Disorders OMIM GO AgBase 7 Focus on practically used ontologies ` Ontology O l O consists off a set off concepts/categories interconnected by relationships (e.g. of type „is-a“ or „part-of part-of“)). O is represented by a DAG and has a designated root concept. ¾ Concepts p have attributes, e.g. g Id,, Name,, Description p ¾ Concepts may have associated instances ` Ontologies may be versioned ` Instances ` ¾ May be managed together with ontology or i d independently d l ¾ May be associated to several concepts schemas even per concept ¾ May have heterogeneous schemas, 8 O1 pp g Mapping: Matching O2 O1, O2 iinstances t ` O1.e11, O2.e23, 0.87 O1.e13,O2.e27, 0.93 … further input, e.g. dictionaries Process of identif identifying ing semantic correspondences between 2 ontologies ¾ Result: ontology mapping ¾ Mostly equivalence mappings: correspondences specify equivalent ontology concepts ` Variation of schema matching problem 9 Shopping.Yahoo.com Electronics DVD Recorder Beamer Digital Cameras Digital Photography Digital Cameras ` Ontology mappings useful for Electronics & Photo TV & Video DVD Recorder Projectors Camera & Photo Digital Cameras IImproving i query results, l e.g. to fi find d specific ifi products d ` Advanced (cross-site) product recommendations ` Automatic categorization of products in different catalogs ` Merging catalogs ` 10 Amazon.com Protein SwissProt instance associations Gene Entrez ` Molecular Function G GO ? Genetic Disorders OMIM GO ? Ontology mappings useful for Improved analysis ` Answering questions such as “Which Molecular Functions are involved in which Biological Processes? Processes?” ` Validation (curation) and recommendation of instance associations ` Ontology merge or curation, e.g. to reduce overlap between ontologies ` 11 High degree of semantic heterogeneity in independently developed ontologies ` Syntactic differences ` ` ` ¾ Different models and languages g g ¾ ¾ Different is-a and part-of hierarchies O l Overlapping i categories i ¾ Naming ambiguities and conflicts ` Different scope Heterogeneous instance representations Structural differences Semantic differences Modeling errors / inconsistencies ` Instance / content differences ` ` ` 12 Biological Process automatic generic solutions ? Fully automatic, Metadata-based Element • Names • Descriptions ` Structure Constraintbased Linguistic • Types • Keys Instance-based Element Constraintbased Linguistic • Parents • Children • Leaves L • IR (word frequencies, k tterms)) key Reuse-oriented Element Structure • Dictionaries • Thesauri • Previous match results Constraintbased • Value pattern and ranges Matcher combinations Hybrid matchers ` Composite matchers ` * Rahm, E., P.A. Bernstein: A Survey of Approaches to Automatic Schema Matching. VLDB Journal 10(4), 2001 13 ` semantics of a category may be better expressed by g y than by y the instances associated to category metadata (e.g. concept name, description) ¾ Categories with most similar instances should match Main problem: Availability of (shared/similar) instances for most/all concepts ` Common C cases: ? ? O1 O2 O1 ` ontology associations Common instances a) Common instances (separate from ontologies) Example: Documents/Objects annotated by O1, O2 terms / concepts 14 O1 instances ? O2 O2 instances b) Ontology-specific instances b1) with shared instances b2) without shared (but similar) instances Many prototypes for schema or ontology matching * ` Instance based schema matching (XML, Instance-based (XML relational) ` ¾ SEMINT ¾ LSD ¾ Clio ¾ iMap ¾ Dumas ` Instance-based ontology matching (OWL) ¾ GLUE, GLUE U off Washington W hi t ¾ COMA++, U Leipzig (supports schema + ont. matching) ¾ FOAM / QOM QOM, U Karlsruhe ¾ Sambo, Linköping U, Sweden ¾ Falcon-AO, South East U, China ¾ RiMOM, Tsinghua U, China * Euzenat/Shvaiko: Ontology matching. Springer 2007 15 Mappings for O1 , Mappings for O2 Relaxation Labeling Similarity Matrix Common Knowledge C K l d & Similarity Estimator Domain Constraints Similarityy Function Joint Probability Distribution P(A,B), P(A’, B)… Meta Learner Distribution Base Learner Base Learner Estimator Taxonomy O1 (tree structure + data instances) ` ` ` ` 16 Taxonomy O2 (tree structure + data instances) Use of machine learning to find ontology mappings Base learners use concept names + data instances (description) Similarity measures computed from “joint probability distribution” of concepts Evaluation on comparatively small ontologies: 3 match tasks, per ontology: 34-331 concepts, 6-30 non-leaf concepts, 150014000 instances, instances 34-236 34 236 correspondences * Doan, AH; et al: Learning to Match Ontologies on the Semantic Web. VLDB Journal, 12(4):303-319, 2003 A,S Concept A Concept S ¬A, A S A,¬S Hypothetical universe of f all examples ¬A,¬S P(A ∩ S) Sim(Concept A, Concept S) = P(A ∪ S) [Jaccard] = P(A,S) P(A,¬S) + P(A,S) + P(¬A,S) Joint Probability Distribution: P(A,S),P(¬A,S),P(A,¬S),P(¬A,¬S) different similarity measures usable based on JPD 17 Mutual use of trained classifiers to determine instanceinstance-concept associations (requires no shared but only similar instances) ¬A,¬S A Taxonomy 1 ¬A,S A,¬S Taxonomy 2 ¬A,¬SS ¬S ¬A A,¬S CLA A,S A,S A ¬A,S CLS ¬A JPD estimated by counting the sizes of the partitions 18 S ¬S System for aligning and merging biomedical ontologies ` Framework to find similar concepts in overlapping g for alignment g and merge g tasks OWL ontologies ` ` Combined use of different matchers and auxiliary information ` Linguistic Linguistic, structure-based, structure-based constraint-based ` Instance-based matching Based B d on ttexts t ((e.g., papers)) Two concepts are similar if a document describes both concepts ` description d i i llogic i reasoner checks results for ontology consistency and cycles * Lambrix, P; Tan, H.: SAMBO – A system for aligning and merging biomedical ontologies. Journal of Web Semantics, 4(3):196-206 , 2006 19 ` Dataset ¾ extracted d ffrom G Google, l Y Yahoo h and d Looksmart L k web b directory ¾ More than 4,500 4 500 simple node matching tasks, tasks no instances Comparison of matching quality results (top-3 systems of each year ) • 20 In 2008 the systems y together g did not manage to discover 48% of the total number of positive correspondences OAEI (Ontology Alignment Evaluation Initiative) Alignment Contest, http://oaei.ontologymatching.org Ontologies ` Ontology Matching ` ¾ Problem ¾ Match ` techniques and prototypes (e.g., GLUE) Instance-based matching in COMA++ ¾ Constraint- / Content-based Matching ¾ Matching web directories ` Matching by Instance overlap ¾ Similarity measures ¾ Evaluation: Product catalogs, g , biomedical ontologies g Stability of ontology mappings ` Conclusions ` 21 Extends previous COMA prototype (VLDB2002) ` Matching of XML & rel rel. Schemas and OWL ontologies ` Several match strategies: Parallel (composite) and ` sequential q matching; g Fragment-based g matching g for large g schemas; Reuse of previous match results Model M d l Pool P l Directed graphs S1 S2 Import, Load, Save Model Manipulation 22 Match Strategy Component identification {s11, s12, ...}} {s21, s22, ...} Mapping Pool Matcher execution Similarity Combination Matcher 1 s11↔s21 s12↔s22 s13↔s23 Matcher 2 Matcher 3 Similarity cube Mapping s Mapping Nodes, ... Paths, ... Name, Children, Leaves, NamePath, … Aggregation, Direction, Selection, CombinedSim Diff, Intersect, Union, MatchCompose, Eval, ... Component Types Matcher Library Combination Library Mapping Manipulation *Schema and Ontology Matching with COMA++. Proc. SIGMOD 2005 Repository (persistent) & Workspace (in‐memory) Current Mapping Domains Schemas/ Ontologies Mappings Source Schema Schema/ mapping info pp g Target Schema 23 Configuration of matcher Metadata-based Reuse-based Instance-based User-programmed 24 Configuration of match strategies S Source Schema S h I t Intermediate Schema di t S h Mapping Excel <‐> Noris Excel < > Noris T Target Schema tS h Mapping Noris <‐ Noris < > Noris_Ver2 > Noris Ver2 25 ` Instance matchers introduced in 2006 Constraint-based matching ` Content-based matching: 2 variations ` ` Coma++ maintains instance value set per element XML schema instances ( Ontology instances 26 ` Instance constraints are assigned to schema elements ¾ ¾ ¾ General constraints: always applicable Example: average length and used characters (letters (letters, numeral numeral, special char.) Numerical constraints: for numerical instance values Example p : p positive or negative, g , integer g or float “M @ “[email protected]” il ” vs. Pattern constraints: “[email protected]” Example: Email and URL Use of constraint similarity matrix to determine element l t similarity i il it (lik (like d data t type t matching) t hi ) ` Simple and efficient approach ` ¾ ` Effectiveness depends on availability of constrained value ranges / pattern A Approach h does d nott require i shared h d iinstances t Determine Constraints General constraints element1 Numerical constraints element2 Pattern constraints Compare Constraints c11↔c21 … c1h↔c2k Similarity value Using U i Constraint C t i t Similarity Matrix 27 ` ◦ ◦ ◦ ` ◦ ◦ 2 variations Value Matching: pairwise similarity comparison of instance values Document (value set) matching: combine all instances into a virtual document and compare documents Both approaches do not require shared instances Value matching Use any similarity measure for pairwise value comparison Aggregate individual similarity values (similarity matrix) into a combined concept similarity (e (e.g., g based on Dice) Compare pair-wise Instance Values 28 element1 instance11↔instance21 instance12↔instance21 element2 … i i instance 1n↔instance 2m Aggregation Similarity value Similarity Matrix ` Document matching 1 iinstance d document per category or selected l d string i category attribute (e.g. description) ¾ Document comparison p based on TF-IDF to focus on most significant terms ¾ ` Two options to deal with multiple string attributes All values l ffor these h attributes ib are h handled dl d as one virtual i l document ¾ Independent p matching g per p attribute and aggregation gg g of the similarity values ¾ element1 element2 Compare virtual documents document1 document2 instance11 instance21 instance12 Similarity function instance22 … instance1n based on TF IDF TF-IDF Similarity value … instance2m 29 39 of 51 test cases based on instances ` 2966 correspondences in reference alignment ` 2400 2200 2000 1800 1600 1400 1200 1000 800 600 400 200 0 All Corresp. Correct Corresp.- ConstraintM hi Matching F-Measure: 0.15 30 ContentM hi Matching 0.61 NameType 0.64 Content + NameType 0.82 Constraint+ C t t+N T Content+NameType 0.82 ` Instance-based matching between 4 web directories, limited to online shops Dmoz D 746 15,304 #Categories #Direct instances Google G l 728 15,082 Web W b 418 13,673 Dmoz Yahoo Y h 3,234 34,949 Yahoo Clothing Sports Sports Swimwear Water Sports Swimming and Diving Swimming and Diving Gear and Equipment Apparel URL =http://www.beachwear.net Name =The Beachwear Network Description =Selection Selection beachwear. URL =http://www http://www.skinzwear.com/ skinzwearofcom/ Name =Skinz Deep Description =Swimwear, bikinis and URL =http://www.ritchieswimwear.com/ streetwear. Name = Ritchie Swimwear Description p =Designer g brand for women, men and little girls. URL =www.skinzwear.com Name =Skinz Deep, p, Inc. Description =Bikinis, swimwear, URL =www.ritchieswimwear.com beachwear, and streetwear Name =Ritchie Swimwearfor men and women. Description =Offers bathing suits, beachware, and cover-ups for men, women, and children. Stores located throughout South Florida. * Massmann, S., Rahm, E.: Evaluating . Evaluating Instance-based Matching of Web Directories. Proc. WebDB 2008 31 ` ` Instances are shop websites Instance-based matching on 3 attributes: shop URL, name, description ¾ ` Use of directly and indirectly associated instances URL matcher based an value matching ¾ After URL preprocessing, equal URLs are needed (same shops in different directories) to find matching categories goodbooks.net/index.asp http://www.goodbooks.net Preprocessing Preprocessing goodbooks.net 32 GoodBooks.net Preprocessing ` Name matcher based on value matching ` Description matcher based on document matching ` Name / description p matching g do not need shared instances ` Six match tasks Æ six reference mappings # Corresp Dmoz ↔ Google Dmoz ↔ Web Dmoz ↔ Yahoo Google ↔ Web 729 218 436 211 (manually created) Google ↔ Web ↔ Yahoo Yahoo 416 235 ∑ 2245 33 ` Combination of three instance-based matchers (URL, name, description) and six metadata-based matchers minimum and maximum i values for the six match tas s tasks best single metadata-based matche 34 best single instance-based matcher Combination: all 3 instance-based and 3 metadata-based matchers (Path Name, Name Parent), Parent) (Path, average Fmeasure: 0.79 Ontologies ` Ontology Matching ` ¾ Problem ¾ Match ` techniques and prototypes (e.g., GLUE) Instance-based matching in COMA++ ¾ Constraint- / Content-based Matching ¾ Matching web directories ` Matching by Instance overlap ¾ Similarity measures ¾ Evaluation: Product catalogs, g , biomedical ontologies g Stability of ontology mappings ` Conclusions ` 35 ` Use of instance overlap for ontology matching: t two concepts t are related l t d / similar i il if th they share h a significant number of associated objects ` Different measures to determine the instance-based similarity ` ` Base-K; Dice, Min, Jaccard … Extensions: Consideration of indirect instance associations ` Combination with other match approaches ` Consideration C id ti off similar i il (b (butt non-identical) id ti l) objects bj t ` * Thor, A; Kirsten, T; Rahm, E.: Instance-based matching of hierarchical ontologies. Proc. BTW, 2007 Kirsten, T, Thor, A; Rahm, E.: Instance-based matching of large life science ontologies. Proc. DILS, 2007 36 Amazon by Brands Microsoft Novell Softunity by Category Books DVD Kid & H Kids Home Software Software Languages Utilities & Tools Traveling B i Business &P Productivity d ti it Operating System Windows Burning Software Operating System Handheld Software Linux Id =158298302X EAN = "662644467122" Title = "SuSE Linux 10.1 (DVD)" Price = 49.99 Ranking = 180 Id = B0002423YK EAN = 0805529832282 Title = "Windows XP Home Edition incl. SP2" Price = 191.91 Ranking = 47 Id = ECD435127K EAN = 0662644467122 ProductName = "SuSE Linux 10.1" DateOfIssue = 02.06.2006 P i =Id59 Price 59.95 = 95 ECD851350K EAN = 0805529832282 ProductName = "WindowsXP Home" DateOfIssue = 15.10.2004 Price = 238.90 37 ` Baseline similarity SimBaseK ⎧1 , if N c1c2 >= K SimBaseK (c1 , c2 ) = ⎨ ⎩ 0 , if N c1c2 < K Dice similarity SimDice 2 ⋅ Nc c Sim Dice (c1 , c2 ) = Nc + Nc c1∈O1 Example: 4 3 I 1 2 1 ∩=2 c2∈O2 2 SimBase1 = SimBase2 = 1, SimBase3 = 0 S Dice = 2*2/(4+3) Sim * /( ) = 0.57 Minimum similarity SimMin SimMin (c1 , c2 ) = 38 SimMin = 2/3 = 0.67 N c1c2 min( N c1 , N c2 ) 0 ≤ SimDice ≤ SimMin ≤ SimBase1 ≤ 1 ` Computation of precision & recall needs a perfect mapping i ((reference f alignment) li t) ¾ Laborious for large ontologies ¾ Might not be well-defined well defined Syntactic measures to “approximate” recall / precision ` Match coverage: fraction of matched categories ` MatchCover ageO1 = ` | CO1 | ∈ [0...1] |C | + | CO 2− Match | InstMatchC overage = O1 − Match | CO1 − Inst | + | CO2 − Inst | Combined Match ratio: #correspondences per matched concept MatchRatioO1 = ` | CO1 − Match | | CorrO1−O 2 | ≥1 | CO1− Match | CombinedMatchRatio = 2⋅ | CorrO1−O 2 | ≥1 | CO1− Match | + | CO 2− Match | Goal: high Match Coverage with low Match Ratio 39 Amazon (AM) vs. Softunity (SU) ` Baseline1: max max. Match Coverage, high Match Ratios ` SimMin: good Match Coverage, moderate Match Ratios ` SimDice: low Match Coverage, low Match Ratios ` O1 CorrO1O2 O2 SU RatioO1 RatioO2 InstO1O2 5.4 2.1 AM # concepts (product categories) 470 1,856 # concepts having instances 170 1,723 (products)) # instances (p 2,576 , 18,024 , # direct associations 2,576 25,448 1 ≈ 1.4 ≈15 15 ≈15 15 # associations / # instances # Instances / #concepts SU I2 SU 1872 335 2.7 1.1 AM SU AM SU 849 71 1.2 1.0 AM 27% 80% Baseline1 40 AM 100% CoverageO1O2 I1 711 SU AM Min (100%) SU 535 AM Dice (50%) ` Ontologies ¾3 ¾ subontologies g of GeneOntology gy Genetic disorders of OMIM Instances: Ensembl proteins of 3 species, i.e. homo sapiens, mouse, rat Homo Sapiens Mus Musculus ` Only subset of concepts 3,018 2,810 288 110 h associated has i t d iinstances t 201 ` 2,452 Molecular Function Biological Process Cellular Component 77 Genetic Disorder 47 133 2 709 2,709 Rattus Norvegicus Ensembl Proteins of different species Number of associated Biological Processes (total # processes: 12,555) 41 Human 1,0 0,8 0,6 0,4 0,2 , 0,0 Rat Match Coverage M t h Ratios Match R ti per ontology t l MF - CC BP - CC SimKaappa Sim Dice Sim mMin SimB Base SimKaappa Sim Dice Sim mMin Sim mMin SimB Base MF - BP MF - BP 42 Mouse SimB Base ` SimKaappa ` SimBase: high Match Coverage (99%) w.r.t. concepts having instances, very high Match Ratios SimDice: low Coverage (< 20%) and low Match Ratios SimMin: good Coverage (60%-80%) with moderate Match Ratios Sim Dice ` MF BP Base 20.4 17.0 Min 4.4 4.0 Dice 1.3 1.2 (Match Ratios for Homo Sapiens, MF-BP task) Simple matcher on concept names Relatively low Match Coverage (however w.r.t. w r t all concepts including instance-free concepts) ` ` ` No correspondences for similarity ≥ 0.9 ` Low similarity thresholds (e.g.< 0.6) too imprecise Match Coverage p er ontology 0,5 0,6 0,7 0,8 0,50 0,45 0,40 0,35 0,30 0,25 0,20 0,15 0,10 , 0,05 0,00 MF - BP MF MF BP MF MF - BP CC BP MF - CC CC BP - CC Match Coverage per ontology BP 0.5 4.4 6.9 06 0.6 24 2.4 29 2.9 0.7 1.4 1.4 0.8 1.1 1.1 Match Ratios per ontology 43 Combinations between instance- (SimMin) and metadata-based match approach ` ` Union: U i Increased I dM Match t hC Coverage an M Match t hR Ratios ti ` Intersection: Low Match Coverage (<1%) Low overlap between instance- and metadatab based d mappings ` 0,5 Matc ch Coverage pe er Ontology 1,00 0,6 0,7 0,8 Match Coverage per ontology for combined mappings 0,80 0,60 MF - BP 0,40 MF BP ∪ 4.1 3.7 ∩ 1.0 1.0 0,20 0,00 MF BP MF - BP 44 Match Ratios per ontology (Name threshold 0.7) CC MF MF - CC BP CC BP - CC (SimMin = 1.0, Homo Sapiens) Automatically vs. manually assigned annotations ` Example: Annotations in Ensembl (July 2008) – 46,704 46 704 proteins ` Automatically assigned M Manually ll assigned i d Sum returns small mappings of likely improved quality all ¾ Restriction to manual annotations 82% 18% BP 57824 22951 80775 72% 28% |CorrBP_MF| Ontology mappings for Base3,Min man ` MF 82466 17729 100195 Base3 Min ∩ Base3 Base3 Min ∩ Base3 21386 3275 3835 758 |CBP| |CMF| 1939 1107 899 435 1393 1107 533 285 man all MCBP MCMF MRBP MRMF 45 Base3 0,13 0,17 11,0 15,4 Min ∩ Base3 0 08 0,08 0 13 0,13 30 3,0 30 3,0 Base3 0,06 0,06 4,3 7,2 Min ∩ Base3 0,03 0,03 1,7 2,7 Ontologies ` Ontology Matching ` ¾ Problem ¾ Match ` techniques and prototypes (e.g., GLUE) Instance-based matching in COMA++ ¾ Constraint- / Content-based Matching ¾ Matching web directories ` Matching by Instance overlap ¾ Similarity measures ¾ Evaluation: Product catalogs, g , biomedical ontologies g Stability of ontology mappings ` Conclusions ` 46 Continous evolution of ontologies (many versions) ` Evolution analysis of 16 life science ontologies: ` ¾ Average of 60% growth in last four years ¾ Deletes and changes also common Ontology NCI Thesaurus GeneOntology -- Biological Process -- Molecular Function -- Cellular Components ChemicalEntities size |C| start |C| last grow |C|, start, last large 35,814 17,368 8,625 7,336 1,407 10,236 63,924 25,995 15,001 8,818 2,176 18,007 1.78 1.50 1.74 1.20 1.55 1.76 www.izbi.de/onex Full period (May. 04 - Feb. 08) ##monthly thl changes: Ontology Add Del Obs adr NCI Thesaurus GeneOntology -- Biological Process -- Molecular Function -- Cellular Components ChemicalEntities 627 200 146 36 18 256 2 12 7 3 2 62 12 4 2 2 0 0 42.4 12.2 16.2 6.8 8.9 4.1 Last year (Feb. 07 - Feb. 08) add-frac del-frac obs-frac 1.3% 0.9% 1.2% 0.4% 1.0% 1.8% 0.0% 0.1% 0.1% 0.0% 0.1% 0.5% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% Add Del Obs 416 222 133 69 19 384 0 20 10 7 3 67 5 5 2 3 0 0 Hartung, M; Kirsten, T; Rahm, E.: Analyzing the Evolution of Life Science Ontologies and Mappings. Proc. 5th Intl. Workshop on Data Integration in the Life Sciences (DILS), 2008 47 ` High change rates in ¾ Ontologies O l i ¾ Instances ¾ Annotations ` (instance-concept associations) Ontology gy mappings pp g ((between versions of two ontologies) also change frequently, especially for instance-based match approaches ¾ correspondences versions ` 48 may disappear in newer mapping Consideration of instance overlap or metadatametadata bases similarity may not be sufficient for „good“ ontology determining g „g gy mappings pp g ` Standard match approaches only consider i f information ti about b t currentt ontology t l versions i and d ignore evolution history Is the black correspondence as good as the red one? ¾ Possible instabilities of match correspondences due to evolution l ti off ontologies t l i and/or d/ related data source Idea: Consider the evolution of a match correspondence to assess its stability/quality in the current version * Thor, A; Hartung, M; Gross, A; Kirsten, T; Rahm, E.:An Evolution-based Approach for Assessing Ontology Mappings - A Case Study in the Life Sciences. Proc. BTW, 2009 49 ` Average Stability 1 n −1 stabAvg n ,k (a, b, m ) = 1 − ⋅ ∑ simi +1 (a, b, m ) − simi (a, b, m ) ∈ [0,1] k i =n−k ` Weighted Maximum Stability ¾ Proximity of similarities in the last versions compared to the current version ⎡ simn (a, b, m ) − simn −i (a, b, m ) ⎤ stabWM n ,k (a, b, m ) = 1 − max ⎢ ⎥ ∈ [0,1] i =1...k i ⎣ ⎦ Correspondence similaritty 1 stabAvg 6,5 stabWM 6,5 0,8 0,6 0,4 0,2 0 1 50 2 3 4 Version 5 6 (a1,b1) 0.9 0.95 (a2,b2) 0.7 0.9 (a3,b b3 ) 09 0.9 06 0.6 Setting ` Mapping GO Biological Processes to Molecular Functions ` Instance based matching (using Ensembl source) ` Result: 2 2,497 497 correspondences (Base3 ∩ Min≥ 00.8) 8) which existed in the last 5 versions ` Selection of correspondences based on similarity and stability ` accepted 55% candidates 15% questionable 30% 51 ` Instance-based match approaches ` ` ` Important since instances reflect well semantics of categories Availability of usable instances may be restricted to subset of concepts (consideration of indirectly associated instances helpful) Need to be combined with metadata-based metadata based techniques Correct ontology mappings NOT limited to 1:1 correspondences ` High h change h rates for f ontologies/instances l may result l in unstable ontology mappings ` Matching atc g based on o s shared a ed instances sta ces ` ` ` ` Instance based matching in COMA++ Instance-based ` ` 52 Different similarity measures to consider instance overlap Especially applicable in bioinformatics (frequent annotations) 3 basic instance matchers (constraint-based, content-based) not requiring shared instances l bl combination b h many metadata-based d b d approaches h and d Flexible with different match strategies Evaluation and validation of large ontology mappings i ` Combined study of ontology matching and instance (entity) matching ` Correspondences based on instance similarity not equality ¾ Entity matching utilizing category similarity ¾ Automatic instance categorization ¾ Scalable instance match approaches based on machine hi learning l i ` Ontology Evolution ` Ontology O t l M Merging i ` 53 http://se-pubs.dbs.uni-leipzig.de 54 ` ` ` ` ` ` ` 55 Aumüller, D., Do, H., Massmann, S., Rahm, E.: Schema and ontology matching with COMA++. Proc. ACM SIGMOD, 2005 Hartung M, Kirsten T., Rahm E.: Analyzing the evolution of life science ontologies and Mappings. Proc. DILS 2008. Springer LNCS 5109 Kirsten T., Thor A., Rahm E.: Instance-based matching of large life science ontologies. Proc. DILS 2007. Springer LNCS 4544 Massmann, S.; Rahm, E.: Evaluating Instance-based Instance based matching of web directories. Proc. 11th Int. Workshop on the Web and Databases (WebDB), 2008 Rahm, E., Bernstein, P.: A survey of approaches to automatic schema matching. matching The VLDB Journal Journal, 10(4): 334 334-350, 350 2001 2001. Thor A., Hartung M., Groß A., Kirsten T., Rahm E.: An evolution- based approach for assessing ontology mappings - A case study in the life sciences. Proc. 13th German Database Conf. (BTW), 2009 Thor, A., Kirsten, T., Rahm, E.: Instance-based matching of hierarchical ontologies. Proc. 12th German Database Conf. (BTW), 2007