Information Integration

Transcription

Information Integration
Information Integration
Introduction
Winter term 2014/15
Melanie Herschel
[email protected]
Data Engineering - IPVS, University of Stuttgart
1
Chapter 1
Introduction
• Administrative details
• Information Integration
• Semester outlook
2
Schedule
• Regular slots: Tuesday, Thursday, Friday, 8:45 - 11:15 am
• Room 0.118 (Universitätsstr. 38)
• 4 SWS lecture (2 VL + 2 Ü)
• Compact course over 7 weeks from 13/10/2014 to 28/11/2014
• Alternating lectures and practicals
• Odd weeks:
• Lectures on Tuesday, 8:45 - 11:15 and Thursday 8:45 - 10:15
• Practicals on Thursday 10:30 - 11:15 and Friday 8:45 - 11:15
• Even weeks
• Lectures on Tuesday, 8:45 - 10:15 and Thursday 8:45 - 10:15
• Practicals on Tuesday and Thursday, 10:30 - 11:45, Friday 8:45 - 11:15
3
Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart
Schedule
Week
Lectures
Practicals
Week 1
14/10 (8:45 - 11:15)
16/10 (8:45 - 10:15)
16/10 (10:30 - 11:15)
17/10 (8:45 - 11:15)
Week 2
21/10 (8:45 - 10:15)
23/10 (8:45 - 10:15)
21/10 (10:30 - 11:15)
23/10 (10:30 - 11:15)
24/10 (8:45 - 11:15)
Week 3
28/10 (8:45 - 11:15)
30/10 (8:45 - 10:15)
30/10 (10:30 - 11:15)
31/10 (8:45 - 11:15)
Week 4
4/11 (8:45 - 10:15)
6/11 (8:45 - 10:15)
4/11 (10:30 - 11:15)
6/11 (10:30 - 11:15)
7/11 (8:45 - 11:15)
Week 5
11/11 (8:45 - 11:15)
13/11 (8:45 - 10:15)
13/11 (10:30 - 11:15)
14/11 (8:45 - 11:15)
Week 6
18/11 (8:45 - 10:15)
20/11 (8:45 - 10:15)
18/11 (10:30 - 11:15)
20/11 (10:30 - 11:15)
21/11 (8:45 - 11:15)
Week 7
25/11 (8:45 - 11:15)
27/11 (8:45 - 10:15)
27/11 (10:30 - 11:15)
28/11 (8:45 - 11:15)
4
Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart
Contact Information and Online Resources
Prof. Dr. Melanie Herschel
Tel + 4 9 711 6 8 5 8 84 4 4
Email me lan ie .h e rsc h e l@ ipv s.un i-st ut t gar t .de
Web h t t p:/ / w w w.i pv s.u n i -st u t t gar t .de / abt e i l u n ge n /
as/ abt e ilun g/ mit arbe it e r/ M e lan ie .H e rsc h e l
Office hours by appo in t me n t
http://www.ipvs.uni-stuttgart.de/abteilungen/as/abteilung/mitarbeiter/
Melanie.Herschel_infos/teaching.html
Please visit regularly, as slides, news, etc. will be posted there.
5
Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart
About these slides..
Slides and lecture based on material kindly provided by
Prof. Felix Naumann, Hasso-Plattner-Insitut Potsdam
Quizzies
Definition
Examples
Code snippets
6
Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart
Literature
•Ulf Leser und Felix Naumann.
Informationsintegration. Architekturen und Methoden
zur Integration verteilter und heterogener Datenquellen
ISBN 3898644006
dpunkt Verlag.
(lecture primarily based on this book)
•AnHai Doan, Alon Halevy, Zachary Ives
Principles of Data Integration
ISBN 0124160441
Morgan Kauffmann (1st Edition)
7
Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart
Literature
•Selected topics also covered in
•Stefan Conrad.
Föderierte Datenbanksysteme.
•M. Tamer Özsu, Patrick Valduriez.
Principles of Distributed Database Systems.
•Additional resources will be referenced in class.
•All scientific articles can be accessed via Google Scholar, DBLP, CiteSeer,
ACM Digital Library, or on the author’s personal websites.
8
Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart
Questions & Feedback
• Please ask questions anytime!
• During the lecture
• By email or phone
• Feedback is highly welcome for future improvement!
• Slides
• Website
• ...
9
Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart
Chapter 1
Introduction
• Administrative Details
• Information Integration
• Semester outlook
10
Credit: Michael Marcol
http://www.freedigitalphotos.net/images/view_photog.php?photogid=371
What is Information Integration?
Information integration is...
• ... the combination of data and content coming from different data sources to
obtain a unified set of information.
• ... the correct, complete, and efficient unification of data and content of
different, heterogeneous sources to obtain a unified and structured set of
information to be effectively interpreted by users and applications.
11
Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart
Data source Examples
Excerpt of a SwissProt file
12
Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart
Data source Examples
Beispiel eines HTML Formulars
13
Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart
Data source Examples
Excerpt of a list of public Web Services on www.xmethods.net
14
Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart
Data Source Examples
adapted"from"
Suchanek"&"Weikum"tutorial@SIGMOD"2013"
Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart
5"
15
Where do we encouter Information Integration?
In a broad sense
•Business-Integration
•Application-Integration
•Process-Integration (Workflow-Integration)
In a more strict sense (focus of this lecture)
•Datenbanken und Informationssysteme
•Verteilt
•Autonom/heterogen
16
Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart
Integration = Abstraction
• In a single databases (a single source), logical DB-design abstracts from the physical DB design.
• Data independence
• Queries: procedural vs. declarative
• Information integration in turn abstracts from logical DB design
• Source independence (where data is stored)
• Data model and syntax independence
• Independence from semantic differences (hopefully!)
17
Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart
Information Integration Examples
Mashups, see e.g. Mashup repository www.programmableweb.com
18
Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart
Information Integration Examples
Mashup Example 2
19
Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart
Information Integration Examples
Google shopping
20
Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart
Application Areas [Halevy04]
21
Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart
Application Areas [Halevy04]
22
Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart
Application Areas [Halevy04]
23
Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart
Application Areas
adapted"from"
Suchanek"&"Weikum"tutorial@SIGMOD"2013"
Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart
5"
24
Information Integration: An old Problem
• On the research agenda for more than 50 years
• Early systems data back to the 1970s.
• Manual integration has been considered even earlier, of course.
• New problems
• Large number of data sources
• Heterogeneity
• New types of data (XML, RDF, GIS, OO,...)
• Neue types of queries (search, UDFs,...)
• Neue types of results (ranking, visualization, ...)
• New types of users (managers, admins, anybody, ...)
• Alon Halevy: „It‘s plain hard!“ [Halevy04]
25
Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart
Why is it difficult? [Halevy04]
• System aspects
• Different systems
• Query processing over multiple systems
• Social aspects
• Find relevant data in companies (on the Web)
• Access relevant data
• People need to be convinced to cooperate / sources must provide some interfaces to be used
• Logic-based reasons
• Schema and data heterogeneity
• These are independent from the chosen integration architecture
26
Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart
Example
Web Service A
Web
Service
A
Web Service B
<movie>
<Titel> Troy </Titel>
<Actors>
<Actor> Eric Bana </Actor>
<Actor> Brad Pitt </Actor>
</Actors>
</movie>
Web
Service
B
• Location: Tübingen
• Operationen:
‣ getMovieByActor(firstName, lastName)
‣ getMovieByTitle(title)
• Output
<film>
<name> Troy </name>
<cast> Pitt & Cox</cast>
<year> 2003 </year>
</film>
• Location: Hamburg
• Operation: myMovies(Actor, Year)
• Ausgabestruktur:
name
title
getMov
movie
myMov
Actors
film
cast
Actor
year
27
Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart
Integrating Web Services A & B
1. User Interfaces
2. Schema Integration / Schema Mapping
3. Query rewriting
4. Runtime estimation (optimization)
5. Sending requests to both services
6. Get answers
7. Entity resolution
8. Data fusion
1. Resolving conflicts etc.
2. Determining integrated result
3. Executing integration
9. Visualization to the user
28
Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart
Step 1: User Interfaces
Web Service A
Web
Service
B
29
Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart
Step 2: Schema Integration / Mapping
Web
Service
A
title
getMov
movie
Actors
Web
Service
B
Actor
+
name
myMov
film
cast
year
Integrated Schema
myMov
Schema
Integration
=
title
movie
year
Actors
Actor
30
Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart
Step 2: Schema Integration / Mapping
Web
Service
A
title
getMov
movie
Actors
Web
Service
B
Actor
Schema
Mapping
+
name
myMov
film
cast
year
Integrated Schema
myMov
=
title
year
movie
Actors
Actor
31
Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart
Step 3: Query Rewriting
Query rewriting based on target representation
• Z.B. Concat(Firstname, Lastname) = Actor
Sources
(Web Service A & B)
Target representation
Transform query to
directly query Web
Services A and B
32
Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart
Step 4: Query Optimization
• Optimization focus?
A quick response or a complete answer?
‣ Web Service A in Tübingen (local)
‣ Web Service B in Hamburg (remote)
‣ Web Service B has more attributes and more entities.
‣ Web Service A has less attributes.
•Außerdem:
‣ Search by year only supported by Web Service B.
‣ Data transformations can be expensive to compute.
33
Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart
Step 5: Sending Requests
• Send requests tp Web Service A and B (1)
• Web services send back results (2)
Quellen
(Web Service A & B)
Zielrepräsentation
(1)
(2)
Query w.r.t.
integrated schema
(1)
(2)
34
Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart
Step 6: Get answers
Query Title = “Troy” and year = “2003” returns the following results.
Web
Service
A
<movie>
<Titel> Troy </Titel>
<Actors>
<Actor> Eric Bana </Actor>
<Actor> Brad Pitt </Actor>
</Actors>
</movie>
Web
Service
B
<film>
<name> Troy </name>
<cast> Pitt & Cox</main>
<year> 2003 </year>
</film>
35
Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart
Step 7: Entity Resolution
Is the movie returned by Web Service A the same movie returned by Web Service B?
To answer this question, we have (1) to identify semantic equivalences among result schemata
(schema matching) and (2) compare the data.
Web
Service
A
<movie>
<Titel> Troy </Titel>
<Actors>
<Actor> Eric Bana </Actor>
<Actor> Brad Pitt </Actor>
</Actors>
</movie>
Data comparison using
a similarity measure
Schema
Matching
Web
Service
B
<film>
<name> Troy </name>
<cast> Pitt & Cox</main>
<year> 2003 </year>
</film>
36
Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart
Step 8.1: Resolving Conflicts
<movie>
<Titel> Troy </Titel>
<Actors>
<Actor> Eric Bana </Actor>
<Actor> Brad Pitt </Actor>
</Actors>
</movie>
Web
Service
A
Identical titles
➙ no conflict
Eric Bana, Cox & 2003
exist in only one source
➙ uncertainty
<film>
<name> Troy </name>
<cast> Pitt & Cox</main>
<year> 2003 </year>
</film>
Web
Service
B
different data ➙ conflict
37
Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart
Step 8.2: Determining Integrated Result
Web
Service
A
Web
Service
B
<movie>
<Titel> Troy </Titel>
<Actors>
<Actor> Eric Bana </Actor>
<Actor> Brad Pitt </Actor>
</Actors>
</movie>
Integrated
result
<film>
<name> Troy </name>
<cast> Pitt & Cox</main>
<year> 2003 </year>
</film>
<movie>
<Titel> Troy </Titel>
<Actors>
<Actor> Bana </Actor>
<Actor> Pitt </Actor>
<Actor> Cox </Actor>
</Actors>
<year> 2003 </year>
</movie>
38
Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart
Step 8.3: Executing Integration
• How to perform the data fusion?
• Declarative code?
‣ SQL, XQuery, XSLT
‣ Rarely possible
‣ Typically slow
• Procedural code?
‣ Java, C++
‣ Difficult to maintain
‣ Fast
39
Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart
Step 9: Visualization
Visualization of
• The result
• Data provenance
• Data quality
• Changed values
• Operators used
• ...
Title from Web Service
A and B
<movie>
<Titel> Troy </Titel>
<Actors>
<Actor> Bana </Actor>
<Actor> Pitt </Actor>
<Actor> Cox </Actor>
</Actors>
<year> 2003 </year>
</movie>
From Web Service A,
was Eric Bana before
Conflict has been
resolved.
40
Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart
Chapter 1
Introduction
• Administrative Details
• Information Integration
• Semester Outlook
41
Credit: Michael Marcol
http://www.freedigitalphotos.net/images/view_photog.php?photogid=371
Semester Outlook
Problem setting
•Introduction to Information Integration
•Distribution, Autonomy and Heterogeneity
Architectures
•Materialized and Virtual Integration
•5-Layer Architecture
•Mediator/Wrapper Architecture
•Schema Mapping
•Schema Matching
Mapping
42
Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart
Semester Outlook
Modelling
•Global-As-View modelling
•Local-As-View modelling
Query processing
•Global-As-View query processing
•Containment and Local-As-View query processing
•Bucket algorithm
Data Integration
•Entity Resolution
•Data Fusion
43
Information Integration | WS 2014/15 | Melanie Herschel | Universität Stuttgart