Object Matching for Improving Information Quality.
Transcription
Object Matching for Improving Information Quality.
Erhard Rahm http://dbs.uni-leipzig.de http://dbs.uni-leipzig.de/wdi-lab November 25, 2009 Innovation Lab at Univ. of Leipzig on semantic Web Data Integration ` Funded by BMBF (German ministry for research and education) ` ¾ 2009: initial phase ¾ Full funding starts in Jan. 2010 (10 full-time employees l + students) d ) ` Goals ¾ Semi-automatic, Semi automatic high quality data integration of heterogeneous (web) data ¾ Faster development p of data integration g solutions than with traditional integration approaches, e.g. data warehouses M k research h approaches h ready d for f the th market k t ¾ Make 2 ` Mashup/Workflow-like data integration ¾ Framework F k to specify if and d execute workflows kfl f data for d acquisition from web sources, data transformation, integration and analysis ¾ Support for dynamic (runtime) data integration ¾ Research prototype: iFuice + extensions BusinessIntelligence Analysis Query generation (e g for Google (e.g. product search) Data extraction and transformation Product list Object Matching Generation of web service requests (e.g. Amazon.com) Repository 3 ` Ontology and Schema matching ¾ Semi-automatic S i i generation i off mappings i b between related schemas (e.g., XML business schemas) or ontologies (e.g., product catalogs) ¾ Support for large schemas/ontologies ¾ Research prototype: COMA++ Shopping.Yahoo.com Electronics DVD Recorder Beamer Digital Cameras Digital Photography Camcorders 4 Amazon.com Electronics & Photo TV & Video DVD Recorder Projectors Camera & Photo Digital Cameras ` Object j Matching g ((Entity y resolution,, Deduplication) p ) ¾ Effective strategies for matching related objects (entities, instances) from one or several sources ¾ Offline ffl matching h ( (e.g. with h data d warehouse) h ) and d online l matching (e.g., within mashup applications) ¾ Research prototypes: MOMA MOMA, FEVER 5 ` Introduction (Object Matching) ` FEVER platform for object matching strategies ¾ Architecture ¾ Manually M ll specified ifi d match h strategies i ((operator trees)) ¾ Training-based learning of match strategies ¾ Evaluation ` Dynamic object matching in mashups ¾ OCS ` (Online (O li Citation Ci i Service) S i ) Instance-based ontology gy matching g ¾ Approaches ¾ Support ` 6 in COMA++ Conclusions Data Quality Problems Single-Source Problems Schema Level (Lack of integrity constraints, poor schema design) - Uniqueness U i - Referential integrity … Better schemas/ models Instance Level (Data entry errors) Multi-Source Problems Schema Level Instance Level (Heterogeneous data models and schema designs) (Overlapping, contradicting and inconsistent data) - Misspellings - Redundancy/duplicates - Contradictory values … - Naming conflicts -Structural conflicts … - Inconsistent aggregating - Inconsistent timing … Object matching Schema matching, conflict resolution Object matching (entity resolution, duplicate detection, reference reconcilation …)) * E. Rahm, H. H. Do: Data Cleaning: Problems and Current Approaches. IEEE Techn. Bull. Data Eng., Dec. 2000 7 ` Identify semantically equivalent (matching) objects within one data source or between different sources ¾ to integrate (merge) them, compare them, improve data quality, etc. ¾ ` Most previous work for structured (relational) data Source1: Customer Cno LastName FirstName Gender Address 23 Harley St, Chicago 24 Smith Christoph M IL, 60633-2394 493 Smith Source2: Client 8 Kris L. F 2 Hurley Place, South Fork MN, 48503-5998 Phone/Fax 333-222-6542 / 333-2226599 444-555-6666 CID Name Street City 11 Kristen Smith 2 Hurley Pl South Fork, MN 48503 24 Christian Smith Hurley St 2 S Fork MN Sex 0 1 9 Duplicates due to • Order of authors • Extraction error ((title,, author)) • Different titles! • Typos (author name) etc. etc 10 Object matching approaches Value-based Context-based ... Multiple attributes unsupervised p supervised p ... • Aggregation function with threshold • User-specified Rules: • Hernandez et al. (SIGMOD 1995) • Clustering • Monge, Elkan (DMKD 1997) • Mc Callum et al. (SIGKDD 2000) • Cohen, Richman (SIGKDD 2002) • decision trees • Verykios et al. (Inf.Sciences 2000) • Tejada et al. (Inf. Systems 2001) • Support vector machine • Bilenko, Mooney (SIGKDD 2003) • Minton et al. (2005) 11 http://dc-pubs.dbs.uni-leipzig.de 12 • Hierarchies: • Ananthakrishna et.al. (VDLB 2002) • Graphs: • Bhattacharya, Getoor (DMKD 2004) • Dong et al. (SIGMOD 2005) • Ontologies l ... Single attribute ` Support combination of several match techniques ` Manual construction of combined strategies ¾ BN, BN ` MOMA, MOMA SERF … Learning-based frameworks ¾ FEBRL, FEBRL ` MARLIN, MARLIN TAILOR TAILOR, Active Atlas … Problems ¾ Evaluation results not conclusive ¾ High tuning effort needed ¾ Dependency d on training data d f learning-based for l b d approaches * H. Köpcke, E. Rahm: Frameworks for Entity Matching: An Overview. Data and Knowledge Engineering, 2009 13 ` Introduction (Object Matching) ` FEVER platform for object matching strategies ¾ Architecture ¾ Manually M ll specified ifi d match h strategies i ((operator trees)) ¾ Training-based learning of match strategies ¾ Evaluation ` Dynamic object matching in mashups ¾ OCS ` (Online (O li Citation Ci i Service) S i ) Instance-based ontology gy matching g ¾ Approaches ¾ Support ` 14 in COMA++ Conclusions FEVER = Framework for EValuating Entity Resolution ¾ Platform for configuration and evaluation of entity resolution (object matching) algorithms and strategies ¾ Key K features: f t ¾ Flexible specification of object matching workflows ¾ Semi Semi-automatic automatic parameter configuration (e.g., (e g similarity thresholds) ¾ Support for training-based matching to reduce manual tuning effort ¾ Comparative evaluations of different match approaches ¾ Köpcke, H.; Thor, A.; Rahm, E.: Comparative evaluation of entity resolution approaches with FEVER. Demo, Proc. VLDB, 2009 FEVE ER Core Comp ponents Workflow Edittor Runtime Engiine 15 16 Parameter Setting Param’s Operator Tree Execution • Logging • Breakpoints ea po ts Data Specification Entities Mappings • Source • Target • Generated M. • Training M. • Perfect M. Evaluation • Prec., Recall, ... Result • Reduc. Ratio, .. • Runtime, u t e, ... Operator Tree Specification Operator Library • Blockingg • Matcher • Combination • ... Quality Analysis: Quality vs. • Param. Effort • Labeling abe g Effort ot Configuration Specification Configuration Strategies for • Non-Learn. Matchers • Learn.-based Matchers ` Match results are represented as instance mappings ( (correspondences) d ) between b t 2 sources ¾ Mappings ` can be stored for re-use Source1 Source2 Sim p1 p‘1 1 p2 p‘1 p 09 0.9 p3 p‘3 0.8 Matchers also operate on mappings ¾ Cartesian product between input sources ¾ Output of previously executed matchers/operators 17 ¾ Describe workflows implementing a match strategy ¾ Leaves: data sources ¾ inner nodes: operators (for blocking, matching etc.) ¾ Execution in post-order post order traversal sequence ¾ Match result = Result of root operator Merge Matcher 1 Matcher 2 Blocking Source1 18 Source2 ¾ Blocking: Sorted Neighborhood, Canopy Clustering, … 9 Necessary to reduce search space from Cartesian product to more likely matching object pairs ¾ Attribute matchers (on preselected pair of attributes): ¾ string similarity (TFIDF, Jaccard, Cosine, Trigram, …)) ¾ PPJoinPlus, EdJoin ¾ External implementations, e.g., Fuzzy Lookup (MS SQL Server) ¾ Context matchers (e.g., Neighborhood matcher) ¾ Combination of match results: Merge, Compose 19 map1 1. Merge A1 Titl Matcher Title M t h map1 A2 map2 Author Matcher map2 • Overcome short-comings (e.g., precision or recall) 2. Compose A1 map1 A3 map2 • Efficient re-use of mappings 20 dblp A2 p1 p2 p4 p‘‘1 pp‘‘2 p‘‘4 p‘1 pp‘2 p‘3 map2 B1 map1 A1 B2 p1 map3 A2 p2 ... pn v1 p‘2 p‘n ... pp‘1 dblp v‘1 ` C Combine bi match t h mappings i with ith generall object bj t relationships l ti hi ` Bibliographic example: Conference@DBLP - Conference@ACM ` Attribute matching suffers from highly different values ` ` „Two conferences are the same if they share a significant number of publications.“ ` Reuse of match result for publications Very effective in experiments Thor, A.; Rahm, E.: MOMA - A Mapping-based Object Matching System. Proc. CIDR, 2007 21 Configuration Specification Operator Tree Specification Model Application • Learner (Dec. (Dec Tree, Tree SVM SVM, ...)) • Matcher selection • No. of examples • Selection scheme (Ratio, Random) • Similarity threshold 22 Model Generation Blocking Training Data Selection Training Data Source1 Source2 ¾ Use of training data to find effective matcher combination and configuration (supervised learning) ¾ Learners for model generation in FEVER: ¾ Decision Tree, Logistic Regression, SVM ¾ Multiple learning approach Cosine(title) > 0.629 - + Trigram(venue) > 0.197 - + EditDistance(year) > 0.25 Non-match Trigram(authors) > 0.7 + ... ... ... match tree-based Decision tree based match strategy 23 ` Training data: set of object pairs with manually l b ll d match/mon-match labelled t h/ t h decisions d i i ¾# ` training pairs should be low (limit manual effort) Training pairs should be non-trivial non trivial ¾ Similarity ` above a certain threshold Selection approaches in FEVER ¾ RANDOM: randomly select n object pairs above a similarity threshold t for labeling ¾ RATIO: reduce n randomly selected pairs (above sim. threshold t) so that at least a fraction ratio (<=0.5) (< 0 5) of matching or non-matching pairs are in the training set 9 ratio 0.4: 40%/60% matches/non-matches / / ((or vice versa)) 9 Balances positive and negative training 24 7 real data sources: ` Bibliographic: g p DBLP,, ACM Digital g library, y, GoogleScholar (GS) E-commerce: Abt.com, Buy.com, Amazon.com, G Google le Product P d t Search Se h (GP), (GP) from 1,100 to 64,000 objects per source ¾ ¾ ¾ 4 match tasks ` ¾ ¾ publications: DBLP-ACM DBLP-GS E-Commerce: Abt-Buy Amazon - GP P f Perfect mapping: i ` ¾ ¾ manually determined for bibliographic tasks E commerce data use of UPCs for E-commerce 25 Use of MS Fuzzy Lookup for comparison ` Similarity on 1-2 attributes ` Default setting for similarity threshold vs. manually tuned settings (varying more than 1000 settings on test data d using FEVER)) ` 90% 70% Baseline F-measure results 1 attribute 2 attributes 2 attributes (tuned) 50% 30% Amaz .-GP ABTBuy DBL P-GS 26 DBL PACM 10% Amazon-GoogleBase 1.0 1.0 0.9 0.9 08 0.8 0.8 0.7 0.7 F-measu ure F-meas sure Abt-Buy y 0.6 0.5 0.4 03 0.3 Random Ratio baseline strategy 0.2 0.1 0.6 0.5 0.4 Random R ti 0 Ratio 0.4 4 baseline strategy 03 0.3 0.2 0.1 0.0 0.0 0.3 0.4 0.5 0.6 0.7 0.8 0.3 similarity threshold 0.4 0.5 0.6 0.7 0.8 similarity threshold E-commerce tasks, labeling effort 50 27 1.0 0.9 0.9 0.8 Decision Tree Logistic Regression SVM Multiple Learning baseline strategy 0.7 06 0.6 0.8 Decision Tree Logistic Regression SVM Multiple Learning baseline strategy 0.7 06 0.6 0.5 0.5 20 50 100 labeling effort 28 F-me easure 10 1.0 F-me easure DBLP-Scholar F-me easure F-me easure DBLP-ACM 500 20 50 100 500 labeling effort Bibliographic tasks, Ratio training selection Abt-Buy Abt Buy Amazon GoogleBase Amazon-GoogleBase 1.0 1.0 09 0.9 0.8 F-mea asure F-me easure 0.9 0.7 06 0.6 0.8 Decision Tree Logistic Regression SVM Multiple Learning baseline strategy 0.7 06 0.6 0.5 20 50 100 500 0.5 20 labelling effort 50 100 500 labeling effort E-commerce tasks, Ratio training selection 29 30 Decision Tree Logistic Regression SVM Multiple Learning baseline strategy ` Match configurations with several matchers are diffi lt to difficult t tune t manually ll with ith currentt implementations, e.g. MS Fuzzy Lookup ` Learning-based L i b d match h strategies i can clearly l l outperform manual match strategies even with small training data, data especially for challenging tasks ` Ratio is a simple and effective approach for training ` Multiple learning approach effectively combines selection providing a balanced number of matching and non-matching object pairs several basic learners ` Introduction (Object Matching) ` FEVER platform for object matching strategies ¾ Architecture ¾ Manually M ll specified ifi d match h strategies i ((operator trees)) ¾ Training-based learning of match strategies ¾ Evaluation ` Dynamic object matching in mashups ¾ OCS ` (Online (O li Citation Ci i Service) S i ) Instance-based ontology gy matching g ¾ Approaches ¾ Support ` in COMA++ Conclusions 31 ` On-demand citation service (OCS)* What are the most cited papers of conference X or author Y? ¾ Frequent changes, i.e., new publications & new citations ¾ ` Idea: Combine publication lists, e.g. from DBLP or Pubmed, with citation counts, e.g from Google Scholar Citeseer or Scopus Scholar, DBLP, Pubmed: high bibliographic data quality ¾ GS: large g coverage g of citations counts ¾ ` Query and match problem: Given a set of DBLP publications bli i → How H to effectively ff i l find fi d corresponding di GS publications? * http://labs.dbs.uni-leipzig.de/ocs 32 Select publications from DBLP (Author venue, (Author, venue …)) Query generation . for Google Scholar Publication list Object Matching (Query results vs. publications Iterative queryy refinement Automatic generation of search queries, e.g. on author, venue, title (pattern) ` Dynamic object matching for search results ` 33 Bibliographic data from DBLP Corresponding GS publications 34 Sum of GS citations 35 ` Interactive approach, i.e., user selects match th h ld thresholds Title Year Authors 80% +/- two years 50% 85% +/- one year 60% 90% equal year 70% 95% 80% 100% 90% relaxed restrictive i i 100% ` 36 Aggregated result is adjusted automatically based on match definition ` Introduction (Object Matching) ` FEVER platform for object matching strategies ¾ Architecture ¾ Manually M ll specified ifi d match h strategies i ((operator trees)) ¾ Training-based learning of match strategies ¾ Evaluation ` Dynamic object matching in mashups ¾ OCS ` (Online (O li Citation Ci i Service) S i ) Instance-based ontology gy matching g ¾ Approaches ¾ Support ` in COMA++ Conclusions 37 O1 pp g Mapping: Matching O2 O1, O2 iinstances t ` O1.e11, O2.e23, 0.87 O1.e13,O2.e27, 0.93 … further input, e.g. dictionaries Process of identif identifying ing semantic correspondences between 2 ontologies ¾ Result: ontology mapping ¾ Mostly equivalence mappings: correspondences specify equivalent ontology concepts ` 38 Variation of schema matching problem Amazon.com Shopping.Yahoo.com Electronics & Photo Electronics TV & Video DVD Recorder Beamer DVD Recorder Projectors Digital Cameras Digital Photography Camera & Photo Digital Cameras Camcorders ` Ontology mappings useful for IImproving i query results, l e.g. to fi find d specific ifi products d ` Automatic categorization of products in different catalogs ` Merging catalogs ` 39 semantics of a concept/category may be better p by y the instances associated to category g y expressed than by metadata (e.g. concept name, description) ` Categories with most similar instances should match ` ¾ Requires concepts shared or similar instances for most/all ? O1 O1 instances ? O2 O2 instances Two cases - ontologies share instances - ontologies do not share but have similar instances 40 Amazon by Brands Microsoft Novell Softunity by Category Books DVD Kid & H Kids Home Software Software Languages Utilities & Tools Traveling B i Business &P Productivity d ti it Operating System Windows Burning Software Operating System Handheld Software Linux Id =158298302X EAN = "662644467122" Title = "SuSE Linux 10.1 (DVD)" Price = 49.99 Ranking = 180 Id = B0002423YK Id = ECD435127K EAN = 0662644467122 ProductName = "SuSE Linux 10.1" DateOfIssue = 02.06.2006 P i =Id59 Price 59.95 = 95 ECD851350K EAN = 0805529832282 ProductName = "WindowsXP Home" DateOfIssue = 15.10.2004 Price = 238.90 EAN = 0805529832282 Title = "Windows XP Home Edition incl. SP2" Price = 191.91 Ranking = 47 Thor, A., Kirsten, T., Rahm, E.: Instance-based matching of hierarchical ontologies. Proc. 12th BTW Conf., 2007 41 Dmoz Yahoo Clothing Sports Sports Swimwear Water Sports Swimming and Diving Swimming and Diving Gear and Equipment Apparel pp URL =http://www.beachwear.net Name =The Beachwear Network Description =Selection of beachwear. URL =http://www.skinzwear.com/ Name =Skinz Deepp Description =Swimwear, bikinis and URL =http://www.ritchieswimwear.com/ streetwear. Name = Ritchie Swimwear Description =Designer brand for women, men and little girls. 42 URL =www.skinzwear.com Name =Skinz Deep, Inc. Description =Bikinis, swimwear, URL =www.ritchieswimwear.com beachwear, and streetwear Name =Ritchie Swimwearfor men and women. Description =Offers bathing suits, beachware, and cover-ups for men, women, and children. Stores located throughout South Florida. * Massmann, S., Rahm, E.: Evaluating Instance-based Matching of Web Directories. Proc. WebDB 2008 Extends previous COMA prototype (VLDB2002) ` Matching of XML & rel rel. Schemas and OWL ontologies ` Several match strategies: Parallel (composite) and ` sequential q matching; g Instance-based matching; g Fragmentg based matching for large schemas; Reuse of previous match results Model M d l Pool P l Directed graphs S1 S2 Import, Load, Save Model Manipulation Match Strategy Component identification {s11, s12, ...}} {s21, s22, ...} Mapping Pool Matcher execution Similarity Combination Matcher 1 s11↔s21 s12↔s22 s13↔s23 Matcher 2 Matcher 3 Similarity cube Mapping s Mapping Nodes, ... Paths, ... Name, Children, Leaves, NamePath, … Aggregation, Direction, Selection, CombinedSim Diff, Intersect, Union, MatchCompose, Eval, ... Component Types Matcher Library Combination Library Mapping Manipulation *Schema and Ontology Matching with COMA++. Proc. SIGMOD 2005 43 ` Instance matchers introduced in 2006 Constraint-based matching ` Content-based matching: 2 variations ` ` Coma++ maintains instance value set per element XML schema instances ( Ontology instances 44 Instance constraints are assigned to schema elements ` General constraints: always applicable Example: average length and used characters (letters (letters, numeral numeral, special char.) Numerical constraints: for numerical instance values Example p : p positive or negative, g , integer g or float “M @ “[email protected]” il ” vs. Pattern constraints: “[email protected]” Example: Email and URL ¾ ¾ ¾ Use of constraint similarity matrix to determine element l t similarity i il it (lik (like d data t type t matching) t hi ) ` Simple and efficient approach ` Effectiveness depends on availability of constrained value ranges / pattern ¾ A Approach h does d nott require i shared h d iinstances t ` Compare Constraints c11↔c21 … c1h↔c2k Determine Constraints General constraints element1 Numerical constraints element2 Pattern constraints Similarity value Using U i Constraint C t i t Similarity Matrix 45 2 variations ` ¾ Value Matching: pairwise similarity comparison of instance ¾ Document (value set) matching: combine all instances into a ¾ ` values virtual document and compare documents Both approaches do not require shared instances Document matching ¾ ¾ 1 instance document per category or selected string category attribute (e.g. description) Document comparison p based on TF-IDF to focus on most significant g terms element1 element2 Compare virtual documents document1 document2 instance11 instance21 instance12 Similarity function instance22 … instance1n 46 based on TF-IDF … instance2m Similarity value Introduction (Object Matching) ` FEVER platform for object matching strategies ` ¾ Architecture ¾ Manually specified match strategies (operator trees) ¾ Training-based b d llearning off match h strategies ¾ Evaluation ` Dynamic object matching in mashups ¾ OCS ` (Online Citation Service) Instance-based ontology matching ¾ Approaches pp ¾ Support ` in COMA++ Conclusions 47 ` Object matching is a critical step for data quality and data integration ` ` Offline and online data integration Effective match strategies combining several matchers are hard to find and tune Very large number of possible combinations and configurations ` High quality vs. efficiency tradeoff ` Utilization of domain knowledge ` ` Learning-based approaches support semi-automatic generation of suitable match strategies Requires suitable training selection (e.g. Ratio approach) ` Multiple is robust and effective p Learning g approach pp (but slow) ` 48 ` Instance based matching of ontologies facilitated by Instance-based object matching Instances can reflect well semantics of categories Same/similar instances required in both ontologies ` ` ` Instance-based Instance based matching in COMA++ ` ` 3 basic instance matchers (constraint-based, content-based) not requiring shared instances Flexible combination with many metadata-based metadata based approaches ` Correct ontology mappings NOT limited to 1:1 correspondences ` Support for high efficiency and high effectiveness 49 ¾ Performance P f 50 techniques, h i e.g. parallel ll l object bj matching hi ` Evaluation and validation for larger datasets ` Self-Tuning Self Tuning of context matchers ` Scalable l bl instance-based b d ontology l match h approaches ` ` ` ` ` ` 51 Köpcke, H.; Thor, A.; Rahm, E.: Comparative evaluation of entity resolution approaches with FEVER. Proc. 35th Intl. Conference on Very Large Databases (VLDB), Demo, 2009 Köpcke, H., Rahm, E.: Frameworks for Entity Matching An Overview Data and Knowledge Engineering, Overview. Engineering 2009 Köpcke, H.; Rahm, E.: Training Selection for Tuning Entity Matching. Proc. 6th Int. Workshop on Quality in Databases and Management of Uncertain Data (QDB/MUD) (QDB/MUD), 2008 Massmann, S.; Rahm, E.: Evaluating Instance-based matching of web directories. Proc. 11th Int. Workshop on the Web and Databases (WebDB) (WebDB), 2008 Thor, A., Kirsten, T., Rahm, E.: Instance-based matching of hierarchical ontologies. Proc. 12th German Database Conf. (BTW) 2007 (BTW), Thor, A.; Rahm, E.: MOMA - A Mapping-based Object Matching System. Proc. of the 3rd Biennial Conf. on Innovative Data Systems Research (CIDR) (CIDR), 2007
Similar documents
Erhard Rahm Ontologies Ontology Matching Instance
ontology mapping ¾ Mostly equivalence mappings: correspondences specify equivalent ontology concepts
More information