Slides - IfIS - Technische Universität Braunschweig

Transcription

Peer-to-Peer
Data Management
Wolf-Tilo Balke
Sascha Tönnies
Institut für Informationssysteme
Technische Universität Braunschweig
http://www.ifis.cs.tu-bs.de
Overview
• Why Peer-to-Peer Databases?
– Federation
– Information integration
– Sensor networks
• P2P Databases
– Challenges
– Design Dimensions
• Existing P2P Database systems
–
–
–
–
Edutella: focus on expressivity
PIER: focus on scalability
Piazza: focus on integration
HiSbase: focus on scalability for spatial data
VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
2
1 Motivation
• Peer-to-peer data management might need some
database-like functionality
– Complex queries over possibly large volumes of data
• Examples of applications include
– Federation of sources
– Information integration
– Sensor networks
– „New‟ internet
1.1 Federation of similar data providers
• Examples
– (Digital) Libraries
– Primary Scientific Data Providers
(Gene Databases)
– News Providers
• All nodes offer the same kind of information
• Homogeneous network (fixed schema)
• Non-P2P solutions exist, but not open/scalable
4
1.2 Information Integration
• Examples
– Find German professors having published at
least three papers at the Conference on
Very Large Databases
– Find introductory database book in German,
written by a German professor
– Find all recordings of Mozarts ‚Magic
Flute„ with conductors who also once
conducted Berliner Philharmoniker
• Very tedious to find with current search engines
• Needs database-like querying capabilities
• Heterogeneous network
– Information from several databases need to be combined
5
1.3 Sensor Networks
• Examples
– Network Monitoring:
• network maps
• event detections
• ...
– Car Traffic Monitoring
• Huge amount of nodes
• Low amount of data
• Homogeneous network
Screenshots from project PHI presentation, J. Hellerstein, Berkeley
6
Overview
1. Why Peer-to-Peer Databases?
1.
2.
3.
Federation
Information integration
Sensor networks
2. P2P Databases
1.
2.
Challenges
Design Dimensions
3. Existing P2P Database systems
1.
2.
3.
4.
7
2.1 Challenges of Schema-Based P2P Networks
• Multi-Dimensional Search Space
– DHTs only work for one dimension (one attribute)
• Schema Heterogeneity
– Sources use different database schemas for similar
information
• Potentially large result sets
– SELECT * FROM Firewalls.BlockedPackets ...
– Range and Aggregate Queries
• And the usual P2P challenges...
– Trust
– Network Churn
– Unbalanced Popularity
8
2.2 Design Dimensions
• Network Properties
– Data Placement
– Topology and Routing
• Data Access
– Data Model
– Query Language
• Integration Mechanism
– Mapping Representation
– Mapping Creation
– Integration Method
9
2.2 Data Placement
• Placement according to ownership
– Data stays at information source
– Full control of data by owner (access policy, availability, etc.)
– More autonomy of single nodes
• Placement according to search strategy
– Data is distributed according to later access mechanism
(e.g., DHT)
– No control over data access
– More freedom to optimize query routing
• Additional caching/replication possible
– Essential for load balancing
10
2.2 Topology and Routing (1)
• Unstructured Networks
–
–
–
–
Flooding as routing algorithm
Supports arbitrary expressive queries
Agnostic to schema heterogeneity
Inefficient (filtered flooding can help)
• Short-cut networks
–
–
–
–
Unstructured, but continuously optimize network connections
Can develop into regular structures like Small-World networks
Clustering & filtered flooding reduces query distribution traffic
Fireworks routing
11
2.2 Topology and Routing (2)
• Super-peer networks
– Inherits advantages and disadvantages of unstructured
network
– Better efficiency and scaling (but still flooding)
– Good match to distributed databases (super-peers
become mediators)
• DHT Networks
– Create separate overlay for each attribute
• Or use Multidimensional DHTs, e.g. Mercury
– Limited query expressivity
– Suitable for homogeneous schema
– Not all queries are evaluated efficiently
12
2.2 Topology and Routing - Summary
• Local indexing
– No knowledge about other peers
• Central indexing
– One node holds complete index
Doesn‘t scale
Single point of
control (and failure)
• Distributed indexing
–
–
–
–
Distributed Hash Tables
Filtered Flooding
Short-cut networks
Super-peer networks
13
2.2 Data Model
• Fixed set of attributes
– Allows for sophisticated topologies
– Inflexible
– Applicability: custom applications
• Relational model
– Usual database model
– Not designed for distribution
• XML
– Semi-structured data
• RDF
– Semantic Web exchange format
– Very suitable for distributed data
14
2.2 Query Language
• None
– Fixed set of parameterized queries
• Relational query language
– Always subset of SQL
• XML query language
– XPath or XQuery
• RDF Query Language
– SPARQL or its predecessors
– Logic language
15
2.2 Mapping Representation
• Declarative
– Translation between schema elements
– Distributed database approaches applicable
• Procedural
– Imperative description how to translate/transform queries and data
• Mapping characteristics
– Unidirectional or Bidirectional
– Simple (one-to-one) mapping or complex mappings
• Mapping of objects
– State equality of objects in different sources
16
2.2 Mapping Creation
• Manual
– Users create mappings
– Network distributes mappings and
uses them for translation
• Semi-automatic
– System proposes mappings, based on heuristics
• attribute name
• similar data
– User feedback used to validate created mappings
• Automatic
– E.g., probabilistic mapping
– Similar techniques like for semi-automatic mapping
17
2.2 Integration Mechanism
• Query Rewriting
– Query is translated to target schema
– Data is translated back to source schema
– Most common approach
• Data Rewriting
– Data is replicated to source schema
– Only feasible for small data sets
18
2.2 Existing Systems - Typology
• Focus on network scalability
– homogeneous schema
– low query expressivity
– DHT as underlying network structure
• Focus on expressivity
– super-peer or unstructured
– unlimited query complexity
• Focus on integration
– typically unstructured
– query routing driven by mappings
19
2.2 Existing Systems – Overview
Name
Scalability
Expressivity
Integration
Topology
Data
Data
Placement
Model
Query Language
PIER
DHT (Bamboo)
Distributed
Relational
SQL subset
RDFPeers
DHT (MAAN)
Distributed
RDF
-
Mercury
DHT (Symphony)
Distributed
Tuples
-
SQPeer
Super-peer
Owner
RDF
RQL
PeerDB
Unstructured
Owner
Relational
SQL subset
Edutella
Super-peer
Owner
RDF
datalog (SQL)
Piazza
Unstructured
Owner
XML
XQuery subset
GridVine
DHT (P-Grid)
Distributed
RDF
-
DRAGO
Unstructured
Owner
Descr. Logics
OWL subset
List not complete
20
Overview
1. Why Peer-to-Peer Databases?
1.
2.
3.
Federation
Sensor networks
2. P2P Databases
1.
2.
Challenges
Design Dimensions
3. Existing P2P Database systems
1.
2.
3.
4.
21
3.1 Edutella: Introduction
• Initial Goal: Achieve interoperability between
heterogeneous metadata-driven (e-learning) systems
• Provides metadata only, not the resources
– Resources are fetched via http
• Query Examples
– “Find software engineering course lecture notes for
undergraduates in German language”
– “Find an introduction to Enterprise Java Beans for
professionals”
– “Find a course in software requirements analysis from a
Swedish university”
22
3.1 Query Service
• Provides standardized query/retrieval of RDF
metadata stored in distributed RDF repositories
• Query Exchange Language
– Based on Datalog (allows expression of rules)
– RDF syntax
– For exchange only
• Adapters to enable QEL (query exchange
language) query processing on diverse backends
23
3.1 Query processing
App.
specific
format
EQM
QEL
Query
Formatter
P2P Network
Edutella
Provider
Interface
Edutella
Consumer
Interface
Consumer
Application
Query
Parser
• Parsers/Formatters convert between query languages
• Applications and backends are shielded from
communication layer
• Query messages are exchanged in RDF/XML format
Provider
Provider
ProviderBack-End
(Repository)
EQM
Rep.
specific
format
• Wrappers available for SQL, RDQL, RQL, and others
24
3.1 Edutella Topology
• Super-Peers
• Content Providers
• Content Consumers
• Use filtered
flooding in
super-peer
backbone
• HyperCuP topology
for backbone
25
3.1 Cayley Graphs
• Graph representing a permutation group G,
described by a set of generators
– Regular, vertex-symmetric, recursively decomposable
– Optimal routing and broadcast algorithms exist
a
b
1234
2
6
7
0
2
2
1
3
1
0
4
3241
2431
1
0
1
0
2314
3124
0
2
2
c
1
1324
2
8
5
0
0
0
1
2
3421
0
1
4321
2 2
2
2
b
1
4312
2
2341
d
2
2
1
2
1
2134
3412
1
4231
3214
1
2
0
0
2
1432
2413
a
2
0
2
1
4213
1423
1
0
1
0
1342
4132
1243
4123
0
1
3142
0
2
d
Hypercube
Star Graph
1
2
2143
c
26
3.1 Super-peer Topology: HyperCuP
 Super-peers are arranged as hypercube
 Broadcast needs n-1 messages, log2(n) hops
 High connectivity, resilient against node failures
SP5
SP6
0
2
1
SP1
0
SP7
1
SP2
SP1
SP4
SP3
SP7
SP2
SP6
SP8
0
1
2
0
Minimal spanning tree
1
2
SP3
SP5
2
SP4
SP8
27
3.1 Super-Peer-based Query Routing
• Database fragment summaries
• Index structure and maintenance
• Query Routing
28
3.1 Peer Fragment Summaries
Peer1.Doc
Peer2.Doc
Identifier
IdentifierTitle Title
Date Date
1861978766Csdoi sdofi
1948
521354021
sfi sfdsf
Eoite
odsifj woifj
1394875966
593574021
Deor aodfiOewr
sdfwe
dls 1952
svonwe
1817305606Toid sdofijPsadoifh
1937
534536021
cvcdovasdafns
dsf
1809239086Csdo asofdi
1916
528943021
Vsdweor
sdfokj sfew
1345398705Epodsf csmieo
1924
529874521
mo sdfp
Wdfj vspo
dort
526983221
Awer fzwe xhzpwf 1959
Peer1
Doc.Identifier
Doc.Title
Doc.Date[1916-1959]
Doc.Format [Book]
Doc.Language[de]
1993
2005
1999
2001
1989
FormatLanguage
Language
Coverage
Book
Book
Book
Book
Book
Book
en
en
en
en
en
de
de
de
de
de
de
Scotland
Wales
York
West Midlands
London
Peer2
Doc.Identifier
Doc.Title
Doc.Date[1989-2005]
Doc.Language[de]
Doc.Coverage[UK]
29
3.1 Super-peer / Peer Indices
 Peers forward summary to super-peer
P1
P2
P4
SP1
0
1
SP3
SP2
1
0
SP4
Peer1 Summary
Doc.Identifier
Doc.Title
Doc.Date[1916-1959]
Doc.Format [Book]
Doc.Language[de]
Peer2 Summary
Doc.Identifier
Doc.Title
Doc.Date[1989-2005]
Doc.Language[en]
Doc.Coverage[UK]
P3
Super-Peer1 SP/P Index
Doc.Identifier
P1 , P2
Doc.Title
P1 , P2
Doc.Date[1916-1959]
P1
[1989-2005]
P2
Doc.Format [Book]
P1
Doc.Language[de]
P1
[en]
P2
Doc.Coverage[UK]
P2
30
3.1 Super-Peer Fragment Summaries
Doc
Identifier
Title
Date
Format
Language
521354021
593574021
534536021
528943021
529874521
526983221
1861978766
1394875966
1817305606
1809239086
1345398705
Csdoi sdofi sfi sfdsf
Deor aodfi sdfwe dls
Toid sdofij cvcdova
Csdo asofdi weor
Epodsf csmieo mo
Awer fzwe xhzpwf
Eoite odsifj woifj
Oewr svonwe
Psadoifh sdafns dsf
Vsd sdfokj sfew
Wdfj vspo sdfp dort
1948
1952
1937
1916
1924
1959
1993
2005
1999
2001
1989
Book
Book
Book
Book
Book
Book
de
de
de
de
de
de
en
en
en
en
en
Super-Peer1
Coverage
Scotland
Wales
York
West Midlands
London
SP1 Summary
Doc.Identifier
Doc.Title
Doc.Date[1916-2005]
Doc.Format [Book]
Doc.Language[de, en]
Doc.Coverage[UK]
31
3.1 Super-peer/Super-peer Indices
SP1 Summary
Doc.Identifier
Doc.Title
Doc.Date[1916-2005]
Doc.Format [Book]
Doc.Coverage[UK]
Super-Peer2 SP/SP Index
…
…
Doc.Language[de]
SP1
[en]
SP1
…
…
SP1
0
1
SP3
…
…
Doc.Language[de]
SP1
[en]
SP1
…
…
SP2
1
0
SP4
…
…
Doc.Language[de]
SP2,SP3
[en]
SP2,SP3
…
…
• Naively forwarding is not optimal
32
3.1 Super-peer/Super-peer Indices
• Take edge dimension into account
• forward SP/SP index entries only along lower edges
SP1 Summary
…
…
SP1
0
1
…
…
Doc.Language[de]
SP1 (1)
[en]
SP1 (1)
…
…
SP3
SP2
…
…
Doc.Language[de]
SP1 (0)
[en]
SP1 (0)
…
…
1
0
SP4
…
…
Doc.Language[de]
SP3 (0)
[en]
SP3 (0)
…
…
33
3.1 Query Routing
• Use SP/P and SP/SP indices as filters
SELECT * FROM Doc WHERE Language=”de“ AND …
Super-Peer1 SP/P Index
…
…
Doc.Language[de]
P1
[en]
P2
…
…
P1
P2
P4
SP1
0
1
…
…
Doc.Language[de]
SP1 (1)
[en]
SP1 (1)
…
…
SP3
SP2
P3
1
0
SP4 Super-Peer4 SP/SP Index
…
Doc.Language[de]
[en]
…
…
SP3 (0)
SP3 (0)
…
34
3.1 Application: P2P Digital Library Network
• Large amount of individual DLs
• Autonomous institutions
• Users have to
blah
blah
blah
– find relevant DLs
– search separately on every found DL
• Violates 4th law of Library Science
– “Save the time of the reader”
(Ranganathan, 1931)
35
3.1 DL Search Engine Solution
• Search engine approach
blah
blah
blah
– ‚Crawl„ DLs
– Copy Content
– Offer unified collection
• Issues
– Search engine controls content
– Proprietary interface
(or just Web crawl)
– Difficult to preserve metadata
– Single point of failure
36
3.1 Open Archive Initiative Solution
• Standardize metadata ‚Crawling„ interface
blah
blah
blah
– OAI-PMH (Protocol for
Metadata Harvesting)
• Harvesters
– collect metadata from DLs
– offer search facilities
• Issues
–
–
–
–
No single entry point
Harvesters control content
Points of failure
Incentive for Harvester?
37
3.1 From OAI to P2P
• Create „peer wrapper‟ for existing DLs
Super-peer
backbone
Digital
Libraries
Content
Providers
OAI-PMH
Interface
38
3.1 OAI-P2P – a Digital Library Network
• P2P approach:
blah
blah
blah
– DLs form self-organized network
– User queries are distributed
• Advantages
– No dependency on service provider
– Each DL still controls its content
– No single point of failure
• 5th law of Library Science:
– “The library is a growing organism”
(Ranganathan, 1931)
39
3.1 Edutella – Discussion
• Efficiently limits query distribution to relevant peers
• Very good scalability in terms of data size
– No data movement required
– Little index maintenance efforts
• Flooding limits super-peer backbone scalability
– Will never scale to millions of peers
• Mainly query forwarding
– Initial extension to full query planning exists
• No load-balancing mechanisms
40
Overview
1.
Why Peer-to-Peer Databases?
1.
2.
3.
4.
2.
P2P Databases
1.
2.
3.
Federation
Sensor networks
„New‟ internet
Challenges
Design Dimensions
Existing P2P Database systems
1.
2.
3.
4.
41
3.2 PIER
• P2P Relational Database
0
15
1
14
• Foundation: any DHT
2
13
3
12
• Extended hash interface
4
11
5
– put(namespace, key, value)
– get(namespace, key)
Spanning Tree
– namespace/key combination is used as hash value (DHT
Key)
10
6
9
8
7
• Extended network capabilities
• Exploit DHT structure for broadcast
• Required for joins and aggregate queries)
42
3.2 Application: Phi
• Phi: Public Health for the Internet
– Monitor ip network state world-wide
– Collect statistics
• Network traffic
• Latency
• …
– Malware alerts
43
3.2 Storing and Indexing Tuples
• Storing
– Every tuple needs a synthetic tuple key
– Choose combination of table name and tuple key as DHT
key
– Insert complete tuple into DHT using this key
• Indexing
– Additional attribute indexes are built by inserting
attribute value/tuple key pairs into the DHT
– Choose combination of attribute name and attribute value
as DHT key
– Insert tuple key as DHT value
44
3.2 Example
• Sample Database
Doc
Id
Title
Date
Language
Author
DocId
PersonId
Person
Id
Name
Surname
• Sample tuple : (456, „Critique of pure Reason‟, 1781,
„en‟)
• Storing
– put(Doc, 456, (456, „Critique...‟, „en‟, Philosophy))
• Indexing on „Title‟ and „Date‟ attributes
– put(Doc.Title, „Critique...‟, 456)
– put(Doc.Date, „1781‟, 456)
45
3.2 PIER Query Plans
project({Id,Title})
• DHT-Scan
– Use index to retrieve tuple key(s)
– Use key(s) to retrieve data tuple(s)
• Example
filter(Lang=‟en‟)
dht-scanSubject(Doc, Date=‟1781‟)
– SELECT Id, Title FROM Doc WHERE
Date= „1781‟ AND Lang = „en‟
• Each peer can create a query plan
• One DHT lookup per result tuple
• Filter has to be done on query originator
46
3.2 Aggregate and Range Queries
• Example
– SELECT COUNT(Id) FROM Doc WHERE
6
1
Date>„1780‟ AND Date<„1790‟
• Use spanning tree for broadcast
• Aggregate on return
3
1
1
1
1
1
47
3.2 Join Queries
• Example
– Assume a Person tuple (789, „Kant‟, „Immanuel‟)
– SELECT Id, Title FROM Doc WHERE Author.DocId = Doc.Id
AND Author.PersonId = 789
• Approach: Hierarchical Joins
– Use spanning tree for broadcast
– Do local select on peer table fragments
– Do local join on each peer
• Improves load balancing
– Forward table fragments and partial results to parent
– Repeat until query originator has all fragments
48
3.2 Hierarchical Joins
T12 T21 T32
T23 T22
T13 T31
D1
D3
T33
A1
A3
T11
D1
A1
D2
A2
A3
49
3.2 PIER - Discussion
• Real query planning
• Very efficient access to individual tuples and small
result sets
• Very good scalability in terms of network size
• Degrades to broadcast for many types of queries
– Aggregate queries
– Joins
• INSERT operation expensive (see P2P Inform.
Retrieval)
• No load-balancing mechanisms
50
Overview
1.
1.
2.
3.
4.
2.
P2P Databases
1.
2.
3.
Federation
Sensor networks
„New‟ internet
Challenges
Design Dimensions
1.
2.
3.
4.
51
3.3 Piazza
• Tackles problem of „reconciling different models of
the world” (A. Halevy)
• Goal: provide a uniform interface to a set of
autonomous data sources
• New abstraction layer over multiple sources
• Introduce mappings
between „world views‟
– Mapping rules are specified
manually by experts
– Don‟t need to be complete
52
3.3 Example – Publication Databases
UCSD
53
3.3 Mapping Rules
• Datalog is used to specify mapping rules
UCSD : Member(projName; member) :
UW : Member(;pid; member; );
UW : Project(pid; ; projName):
UPenn : Student(sid; member; );
UPenn : ProjMember(pid; sid);
UPenn : Project(pid; projName; )
UPenn : Faculty(sid; member; );
UPenn : ProjMember(pid; sid);
UPenn : Project(pid; projName; )
Mapping from UW
to UCSD
Mapping from
UPenn to UCSD
54
3.3 Storing and Indexing
• Unstructured network (Gnutella-like)
• Peer keeps its database
– No exchange of data between peers
• Indexing
– Only on schema level
– Each peer maintains schema catalog of its neighbors
– Mappings Stored in central catalog (hybrid system)
• could be replaced by DHT
– Replication of mappings to all relevant peers
55
3.3 Query Routing
• Query Flooding
– Peer translates query to
CiteSeer
schema of neighbor (if possible)
Q1
– Result tuples are
M(UCSD, UPenn)
converted on way back
UCSD
• Queries answered by
traversing semantic
paths
M(UW, UCSD)
M(UW, Stanford)
Q
UW
UPenn
Q3
Q4
DBLP
M(Stanford, DBLP)
Stanford
Q2
UC Berkeley
56
3.3 Piazza - Discussion
• Supports multiple schema world (more
realistic)
• Very expressive mapping mechanism
• Not scalable
– Gnutella-like topology and flooding
• Piazza mapping technique could be applied to
other network infrastructures
57
Overview
1.
1.
2.
3.
4.
2.
P2P Databases
1.
2.
3.
Federation
Sensor networks
„New‟ internet
Challenges
Design Dimensions
1.
2.
3.
4.
58
3.4 HiSbase
• Specialized on distributed spatial data
• Application: astronomy data
– Huge amounts of data (terabyte scale)
– Region-based queries
– Skewed data distribution
• Main ideas
– Distribute data on peers by region
– Use DHT for data access
– Use neighbor-preserving hash
function (space-filling curve)
59
3.4 Load Distribution
• Use Quad-Tree structure to split data space into
equally loaded regions
60
4.4 Data Hashing
• Use Z-Linearization for hashing coordinates
61
3.4 Insertion into DHT
62
3.4 Query Processing
• Point query
– Simple DHT access
• Region query
– Route to arbitrary peer in range (e.g. using upper left
region boundary)
– This peer acts as coordinator
– Forward query to peer region neigbors
• Until whole area is covered
– Collect results at coordinator
63
3.4 HiSbase - Discussion
• Very efficient for spatial queries
– But only spatial queries possible
• Not completely self-organizing
– Quad-Tree splitting needs central coordination
64
3. P2P Database Networks – Summary
• Challenges
– Multi-Dimensional Search Space
– Schema Heterogeneity
– Potentially large result sets
• Design Dimensions
– Network Properties (Data Placement, Topology and Routing)
– Data Access (Data Model, Query Language)
– Integration Mechanism (Mapping Representation/Creation/Usage)
• P2P Database Types
–
–
–
–
Focus on high network scalability (e.g., Edutella)
Focus on high query expressivity (e.g., PIER)
Focus on information integration (e.g., Piazza)
Focus on specific data structures (e.g. HiSbase)
65
3. Conclusion
• P2P Databases do already work
– although immature compared to traditional database
technology
• One size does not fit all
– Choose P2P database approach according to application
requirements
• Open problems
–
–
–
–
Load Balancing (Replication/Caching)
How to combine DHT and filtered flooding advantages
Reliability (probabilistic guarantees)
...
66

Slides - IfIS - Technische Universität Braunschweig

Transcription

Similar documents

f,`P`lm PEoPLE - The IBN SINA Pharmaceutical Industry Ltd.