Information Integration
Transcription
Information Integration
Information Integration Introduction Winter term 2014/15 Melanie Herschel [email protected] Data Engineering - IPVS, University of Stuttgart 1 Chapter 1 Introduction • Administrative details • Information Integration • Semester outlook 2 Schedule • Regular slots: Tuesday, Thursday, Friday, 8:45 - 11:15 am • Room 0.118 (Universitätsstr. 38) • 4 SWS lecture (2 VL + 2 Ü) • Compact course over 7 weeks from 13/10/2014 to 28/11/2014 • Alternating lectures and practicals • Odd weeks: • Lectures on Tuesday, 8:45 - 11:15 and Thursday 8:45 - 10:15 • Practicals on Thursday 10:30 - 11:15 and Friday 8:45 - 11:15 • Even weeks • Lectures on Tuesday, 8:45 - 10:15 and Thursday 8:45 - 10:15 • Practicals on Tuesday and Thursday, 10:30 - 11:45, Friday 8:45 - 11:15 3 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Schedule Week Lectures Practicals Week 1 14/10 (8:45 - 11:15) 16/10 (8:45 - 10:15) 16/10 (10:30 - 11:15) 17/10 (8:45 - 11:15) Week 2 21/10 (8:45 - 10:15) 23/10 (8:45 - 10:15) 21/10 (10:30 - 11:15) 23/10 (10:30 - 11:15) 24/10 (8:45 - 11:15) Week 3 28/10 (8:45 - 11:15) 30/10 (8:45 - 10:15) 30/10 (10:30 - 11:15) 31/10 (8:45 - 11:15) Week 4 4/11 (8:45 - 10:15) 6/11 (8:45 - 10:15) 4/11 (10:30 - 11:15) 6/11 (10:30 - 11:15) 7/11 (8:45 - 11:15) Week 5 11/11 (8:45 - 11:15) 13/11 (8:45 - 10:15) 13/11 (10:30 - 11:15) 14/11 (8:45 - 11:15) Week 6 18/11 (8:45 - 10:15) 20/11 (8:45 - 10:15) 18/11 (10:30 - 11:15) 20/11 (10:30 - 11:15) 21/11 (8:45 - 11:15) Week 7 25/11 (8:45 - 11:15) 27/11 (8:45 - 10:15) 27/11 (10:30 - 11:15) 28/11 (8:45 - 11:15) 4 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Contact Information and Online Resources Prof. Dr. Melanie Herschel Tel + 4 9 711 6 8 5 8 84 4 4 Email me lan ie .h e rsc h e l@ ipv s.un i-st ut t gar t .de Web h t t p:/ / w w w.i pv s.u n i -st u t t gar t .de / abt e i l u n ge n / as/ abt e ilun g/ mit arbe it e r/ M e lan ie .H e rsc h e l Office hours by appo in t me n t http://www.ipvs.uni-stuttgart.de/abteilungen/as/abteilung/mitarbeiter/ Melanie.Herschel_infos/teaching.html Please visit regularly, as slides, news, etc. will be posted there. 5 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart About these slides.. Slides and lecture based on material kindly provided by Prof. Felix Naumann, Hasso-Plattner-Insitut Potsdam Quizzies Definition Examples Code snippets 6 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Literature •Ulf Leser und Felix Naumann. Informationsintegration. Architekturen und Methoden zur Integration verteilter und heterogener Datenquellen ISBN 3898644006 dpunkt Verlag. (lecture primarily based on this book) •AnHai Doan, Alon Halevy, Zachary Ives Principles of Data Integration ISBN 0124160441 Morgan Kauffmann (1st Edition) 7 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Literature •Selected topics also covered in •Stefan Conrad. Föderierte Datenbanksysteme. •M. Tamer Özsu, Patrick Valduriez. Principles of Distributed Database Systems. •Additional resources will be referenced in class. •All scientific articles can be accessed via Google Scholar, DBLP, CiteSeer, ACM Digital Library, or on the author’s personal websites. 8 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Questions & Feedback • Please ask questions anytime! • During the lecture • By email or phone • Feedback is highly welcome for future improvement! • Slides • Website • ... 9 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Chapter 1 Introduction • Administrative Details • Information Integration • Semester outlook 10 Credit: Michael Marcol http://www.freedigitalphotos.net/images/view_photog.php?photogid=371 What is Information Integration? Information integration is... • ... the combination of data and content coming from different data sources to obtain a unified set of information. • ... the correct, complete, and efficient unification of data and content of different, heterogeneous sources to obtain a unified and structured set of information to be effectively interpreted by users and applications. 11 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Data source Examples Excerpt of a SwissProt file 12 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Data source Examples Beispiel eines HTML Formulars 13 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Data source Examples Excerpt of a list of public Web Services on www.xmethods.net 14 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Data Source Examples adapted"from" Suchanek"&"Weikum"tutorial@SIGMOD"2013" Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart 5" 15 Where do we encouter Information Integration? In a broad sense •Business-Integration •Application-Integration •Process-Integration (Workflow-Integration) In a more strict sense (focus of this lecture) •Datenbanken und Informationssysteme •Verteilt •Autonom/heterogen 16 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Integration = Abstraction • In a single databases (a single source), logical DB-design abstracts from the physical DB design. • Data independence • Queries: procedural vs. declarative • Information integration in turn abstracts from logical DB design • Source independence (where data is stored) • Data model and syntax independence • Independence from semantic differences (hopefully!) 17 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Information Integration Examples Mashups, see e.g. Mashup repository www.programmableweb.com 18 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Information Integration Examples Mashup Example 2 19 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Information Integration Examples Google shopping 20 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Application Areas [Halevy04] 21 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Application Areas [Halevy04] 22 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Application Areas [Halevy04] 23 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Application Areas adapted"from" Suchanek"&"Weikum"tutorial@SIGMOD"2013" Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart 5" 24 Information Integration: An old Problem • On the research agenda for more than 50 years • Early systems data back to the 1970s. • Manual integration has been considered even earlier, of course. • New problems • Large number of data sources • Heterogeneity • New types of data (XML, RDF, GIS, OO,...) • Neue types of queries (search, UDFs,...) • Neue types of results (ranking, visualization, ...) • New types of users (managers, admins, anybody, ...) • Alon Halevy: „It‘s plain hard!“ [Halevy04] 25 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Why is it difficult? [Halevy04] • System aspects • Different systems • Query processing over multiple systems • Social aspects • Find relevant data in companies (on the Web) • Access relevant data • People need to be convinced to cooperate / sources must provide some interfaces to be used • Logic-based reasons • Schema and data heterogeneity • These are independent from the chosen integration architecture 26 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Example Web Service A Web Service A Web Service B <movie> <Titel> Troy </Titel> <Actors> <Actor> Eric Bana </Actor> <Actor> Brad Pitt </Actor> </Actors> </movie> Web Service B • Location: Tübingen • Operationen: ‣ getMovieByActor(firstName, lastName) ‣ getMovieByTitle(title) • Output <film> <name> Troy </name> <cast> Pitt & Cox</cast> <year> 2003 </year> </film> • Location: Hamburg • Operation: myMovies(Actor, Year) • Ausgabestruktur: name title getMov movie myMov Actors film cast Actor year 27 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Integrating Web Services A & B 1. User Interfaces 2. Schema Integration / Schema Mapping 3. Query rewriting 4. Runtime estimation (optimization) 5. Sending requests to both services 6. Get answers 7. Entity resolution 8. Data fusion 1. Resolving conflicts etc. 2. Determining integrated result 3. Executing integration 9. Visualization to the user 28 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Step 1: User Interfaces Web Service A Web Service B 29 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Step 2: Schema Integration / Mapping Web Service A title getMov movie Actors Web Service B Actor + name myMov film cast year Integrated Schema myMov Schema Integration = title movie year Actors Actor 30 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Step 2: Schema Integration / Mapping Web Service A title getMov movie Actors Web Service B Actor Schema Mapping + name myMov film cast year Integrated Schema myMov = title year movie Actors Actor 31 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Step 3: Query Rewriting Query rewriting based on target representation • Z.B. Concat(Firstname, Lastname) = Actor Sources (Web Service A & B) Target representation Transform query to directly query Web Services A and B 32 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Step 4: Query Optimization • Optimization focus? A quick response or a complete answer? ‣ Web Service A in Tübingen (local) ‣ Web Service B in Hamburg (remote) ‣ Web Service B has more attributes and more entities. ‣ Web Service A has less attributes. •Außerdem: ‣ Search by year only supported by Web Service B. ‣ Data transformations can be expensive to compute. 33 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Step 5: Sending Requests • Send requests tp Web Service A and B (1) • Web services send back results (2) Quellen (Web Service A & B) Zielrepräsentation (1) (2) Query w.r.t. integrated schema (1) (2) 34 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Step 6: Get answers Query Title = “Troy” and year = “2003” returns the following results. Web Service A <movie> <Titel> Troy </Titel> <Actors> <Actor> Eric Bana </Actor> <Actor> Brad Pitt </Actor> </Actors> </movie> Web Service B <film> <name> Troy </name> <cast> Pitt & Cox</main> <year> 2003 </year> </film> 35 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Step 7: Entity Resolution Is the movie returned by Web Service A the same movie returned by Web Service B? To answer this question, we have (1) to identify semantic equivalences among result schemata (schema matching) and (2) compare the data. Web Service A <movie> <Titel> Troy </Titel> <Actors> <Actor> Eric Bana </Actor> <Actor> Brad Pitt </Actor> </Actors> </movie> Data comparison using a similarity measure Schema Matching Web Service B <film> <name> Troy </name> <cast> Pitt & Cox</main> <year> 2003 </year> </film> 36 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Step 8.1: Resolving Conflicts <movie> <Titel> Troy </Titel> <Actors> <Actor> Eric Bana </Actor> <Actor> Brad Pitt </Actor> </Actors> </movie> Web Service A Identical titles ➙ no conflict Eric Bana, Cox & 2003 exist in only one source ➙ uncertainty <film> <name> Troy </name> <cast> Pitt & Cox</main> <year> 2003 </year> </film> Web Service B different data ➙ conflict 37 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Step 8.2: Determining Integrated Result Web Service A Web Service B <movie> <Titel> Troy </Titel> <Actors> <Actor> Eric Bana </Actor> <Actor> Brad Pitt </Actor> </Actors> </movie> Integrated result <film> <name> Troy </name> <cast> Pitt & Cox</main> <year> 2003 </year> </film> <movie> <Titel> Troy </Titel> <Actors> <Actor> Bana </Actor> <Actor> Pitt </Actor> <Actor> Cox </Actor> </Actors> <year> 2003 </year> </movie> 38 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Step 8.3: Executing Integration • How to perform the data fusion? • Declarative code? ‣ SQL, XQuery, XSLT ‣ Rarely possible ‣ Typically slow • Procedural code? ‣ Java, C++ ‣ Difficult to maintain ‣ Fast 39 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Step 9: Visualization Visualization of • The result • Data provenance • Data quality • Changed values • Operators used • ... Title from Web Service A and B <movie> <Titel> Troy </Titel> <Actors> <Actor> Bana </Actor> <Actor> Pitt </Actor> <Actor> Cox </Actor> </Actors> <year> 2003 </year> </movie> From Web Service A, was Eric Bana before Conflict has been resolved. 40 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Chapter 1 Introduction • Administrative Details • Information Integration • Semester Outlook 41 Credit: Michael Marcol http://www.freedigitalphotos.net/images/view_photog.php?photogid=371 Semester Outlook Problem setting •Introduction to Information Integration •Distribution, Autonomy and Heterogeneity Architectures •Materialized and Virtual Integration •5-Layer Architecture •Mediator/Wrapper Architecture •Schema Mapping •Schema Matching Mapping 42 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart Semester Outlook Modelling •Global-As-View modelling •Local-As-View modelling Query processing •Global-As-View query processing •Containment and Local-As-View query processing •Bucket algorithm Data Integration •Entity Resolution •Data Fusion 43 Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart