Slides - IfIS - Technische Universität Braunschweig
Transcription
Slides - IfIS - Technische Universität Braunschweig
Peer-to-Peer Data Management Wolf-Tilo Balke Sascha Tönnies Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de Overview • Why Peer-to-Peer Databases? – Federation – Information integration – Sensor networks • P2P Databases – Challenges – Design Dimensions • Existing P2P Database systems – – – – Edutella: focus on expressivity PIER: focus on scalability Piazza: focus on integration HiSbase: focus on scalability for spatial data VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 2 1 Motivation • Peer-to-peer data management might need some database-like functionality – Complex queries over possibly large volumes of data • Examples of applications include – Federation of sources – Information integration – Sensor networks – „New‟ internet 1.1 Federation of similar data providers • Examples – (Digital) Libraries – Primary Scientific Data Providers (Gene Databases) – News Providers • All nodes offer the same kind of information • Homogeneous network (fixed schema) • Non-P2P solutions exist, but not open/scalable VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 4 1.2 Information Integration • Examples – Find German professors having published at least three papers at the Conference on Very Large Databases – Find introductory database book in German, written by a German professor – Find all recordings of Mozarts ‚Magic Flute„ with conductors who also once conducted Berliner Philharmoniker • Very tedious to find with current search engines • Needs database-like querying capabilities • Heterogeneous network – Information from several databases need to be combined VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 5 1.3 Sensor Networks • Examples – Network Monitoring: • network maps • event detections • ... – Car Traffic Monitoring • Huge amount of nodes • Low amount of data • Homogeneous network Screenshots from project PHI presentation, J. Hellerstein, Berkeley VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 6 Overview 1. Why Peer-to-Peer Databases? 1. 2. 3. Federation Information integration Sensor networks 2. P2P Databases 1. 2. Challenges Design Dimensions 3. Existing P2P Database systems 1. 2. 3. 4. Edutella: focus on expressivity PIER: focus on scalability Piazza: focus on integration HiSbase: focus on scalability for spatial data VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 7 2.1 Challenges of Schema-Based P2P Networks • Multi-Dimensional Search Space – DHTs only work for one dimension (one attribute) • Schema Heterogeneity – Sources use different database schemas for similar information • Potentially large result sets – SELECT * FROM Firewalls.BlockedPackets ... – Range and Aggregate Queries • And the usual P2P challenges... – Trust – Network Churn – Unbalanced Popularity VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 8 2.2 Design Dimensions • Network Properties – Data Placement – Topology and Routing • Data Access – Data Model – Query Language • Integration Mechanism – Mapping Representation – Mapping Creation – Integration Method VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 9 2.2 Data Placement • Placement according to ownership – Data stays at information source – Full control of data by owner (access policy, availability, etc.) – More autonomy of single nodes • Placement according to search strategy – Data is distributed according to later access mechanism (e.g., DHT) – No control over data access – More freedom to optimize query routing • Additional caching/replication possible – Essential for load balancing VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 10 2.2 Topology and Routing (1) • Unstructured Networks – – – – Flooding as routing algorithm Supports arbitrary expressive queries Agnostic to schema heterogeneity Inefficient (filtered flooding can help) • Short-cut networks – – – – Unstructured, but continuously optimize network connections Can develop into regular structures like Small-World networks Clustering & filtered flooding reduces query distribution traffic Fireworks routing VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 11 2.2 Topology and Routing (2) • Super-peer networks – Inherits advantages and disadvantages of unstructured network – Better efficiency and scaling (but still flooding) – Good match to distributed databases (super-peers become mediators) • DHT Networks – Create separate overlay for each attribute • Or use Multidimensional DHTs, e.g. Mercury – Limited query expressivity – Suitable for homogeneous schema – Not all queries are evaluated efficiently VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 12 2.2 Topology and Routing - Summary • Local indexing – No knowledge about other peers • Central indexing – One node holds complete index Doesn‘t scale Single point of control (and failure) • Distributed indexing – – – – Distributed Hash Tables Filtered Flooding Short-cut networks Super-peer networks VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 13 2.2 Data Model • Fixed set of attributes – Allows for sophisticated topologies – Inflexible – Applicability: custom applications • Relational model – Usual database model – Not designed for distribution • XML – Semi-structured data • RDF – Semantic Web exchange format – Very suitable for distributed data VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 14 2.2 Query Language • None – Fixed set of parameterized queries • Relational query language – Always subset of SQL • XML query language – XPath or XQuery • RDF Query Language – SPARQL or its predecessors – Logic language VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 15 2.2 Mapping Representation • Declarative – Translation between schema elements – Distributed database approaches applicable • Procedural – Imperative description how to translate/transform queries and data • Mapping characteristics – Unidirectional or Bidirectional – Simple (one-to-one) mapping or complex mappings • Mapping of objects – State equality of objects in different sources VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 16 2.2 Mapping Creation • Manual – Users create mappings – Network distributes mappings and uses them for translation • Semi-automatic – System proposes mappings, based on heuristics • attribute name • similar data – User feedback used to validate created mappings • Automatic – E.g., probabilistic mapping – Similar techniques like for semi-automatic mapping VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 17 2.2 Integration Mechanism • Query Rewriting – Query is translated to target schema – Data is translated back to source schema – Most common approach • Data Rewriting – Data is replicated to source schema – Only feasible for small data sets VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 18 2.2 Existing Systems - Typology • Focus on network scalability – homogeneous schema – low query expressivity – DHT as underlying network structure • Focus on expressivity – super-peer or unstructured – unlimited query complexity • Focus on integration – typically unstructured – query routing driven by mappings VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 19 2.2 Existing Systems – Overview Name Scalability Expressivity Integration Topology Data Data Placement Model Query Language PIER DHT (Bamboo) Distributed Relational SQL subset RDFPeers DHT (MAAN) Distributed RDF - Mercury DHT (Symphony) Distributed Tuples - SQPeer Super-peer Owner RDF RQL PeerDB Unstructured Owner Relational SQL subset Edutella Super-peer Owner RDF datalog (SQL) Piazza Unstructured Owner XML XQuery subset GridVine DHT (P-Grid) Distributed RDF - DRAGO Unstructured Owner Descr. Logics OWL subset List not complete VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 20 Overview 1. Why Peer-to-Peer Databases? 1. 2. 3. Federation Information integration Sensor networks 2. P2P Databases 1. 2. Challenges Design Dimensions 3. Existing P2P Database systems 1. 2. 3. 4. Edutella: focus on expressivity PIER: focus on scalability Piazza: focus on integration HiSbase: focus on scalability for spatial data VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 21 3.1 Edutella: Introduction • Initial Goal: Achieve interoperability between heterogeneous metadata-driven (e-learning) systems • Provides metadata only, not the resources – Resources are fetched via http • Query Examples – “Find software engineering course lecture notes for undergraduates in German language” – “Find an introduction to Enterprise Java Beans for professionals” – “Find a course in software requirements analysis from a Swedish university” VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 22 3.1 Query Service • Provides standardized query/retrieval of RDF metadata stored in distributed RDF repositories • Query Exchange Language – Based on Datalog (allows expression of rules) – RDF syntax – For exchange only • Adapters to enable QEL (query exchange language) query processing on diverse backends VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 23 3.1 Query processing App. specific format EQM QEL Query Formatter P2P Network Edutella Provider Interface Edutella Consumer Interface Consumer Application Query Parser • Parsers/Formatters convert between query languages • Applications and backends are shielded from communication layer • Query messages are exchanged in RDF/XML format Provider Provider ProviderBack-End (Repository) EQM Rep. specific format • Wrappers available for SQL, RDQL, RQL, and others VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 24 3.1 Edutella Topology • Super-Peers • Content Providers • Content Consumers • Use filtered flooding in super-peer backbone • HyperCuP topology for backbone VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 25 3.1 Cayley Graphs • Graph representing a permutation group G, described by a set of generators – Regular, vertex-symmetric, recursively decomposable – Optimal routing and broadcast algorithms exist a b 1234 2 6 7 0 2 2 1 3 1 0 4 3241 2431 1 0 1 0 2314 3124 0 2 2 c 1 1324 2 8 5 0 0 0 1 2 3421 0 1 4321 2 2 2 2 b 1 4312 2 2341 d 2 2 1 2 1 2134 3412 1 4231 3214 1 2 0 0 2 1432 2413 a 2 0 2 1 4213 1423 1 0 1 0 1342 4132 1243 4123 0 1 3142 0 2 d Hypercube Star Graph VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 1 2 2143 c 26 3.1 Super-peer Topology: HyperCuP Super-peers are arranged as hypercube Broadcast needs n-1 messages, log2(n) hops High connectivity, resilient against node failures SP5 SP6 0 2 1 SP1 0 SP7 1 SP2 SP1 SP4 SP3 SP7 SP2 SP6 SP8 0 1 2 0 Minimal spanning tree 1 2 SP3 SP5 2 SP4 VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig SP8 27 3.1 Super-Peer-based Query Routing • Database fragment summaries • Index structure and maintenance • Query Routing VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 28 3.1 Peer Fragment Summaries Peer1.Doc Peer2.Doc Identifier IdentifierTitle Title Date Date 1861978766Csdoi sdofi 1948 521354021 sfi sfdsf Eoite odsifj woifj 1394875966 593574021 Deor aodfiOewr sdfwe dls 1952 svonwe 1817305606Toid sdofijPsadoifh 1937 534536021 cvcdovasdafns dsf 1809239086Csdo asofdi 1916 528943021 Vsdweor sdfokj sfew 1345398705Epodsf csmieo 1924 529874521 mo sdfp Wdfj vspo dort 526983221 Awer fzwe xhzpwf 1959 Peer1 Doc.Identifier Doc.Title Doc.Date[1916-1959] Doc.Format [Book] Doc.Language[de] 1993 2005 1999 2001 1989 FormatLanguage Language Coverage Book Book Book Book Book Book en en en en en de de de de de de Scotland Wales York West Midlands London Peer2 Doc.Identifier Doc.Title Doc.Date[1989-2005] Doc.Language[de] Doc.Coverage[UK] VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 29 3.1 Super-peer / Peer Indices Peers forward summary to super-peer P1 P2 P4 SP1 0 1 SP3 SP2 1 0 SP4 Peer1 Summary Doc.Identifier Doc.Title Doc.Date[1916-1959] Doc.Format [Book] Doc.Language[de] Peer2 Summary Doc.Identifier Doc.Title Doc.Date[1989-2005] Doc.Language[en] Doc.Coverage[UK] P3 Super-Peer1 SP/P Index Doc.Identifier P1 , P2 Doc.Title P1 , P2 Doc.Date[1916-1959] P1 [1989-2005] P2 Doc.Format [Book] P1 Doc.Language[de] P1 [en] P2 Doc.Coverage[UK] P2 VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 30 3.1 Super-Peer Fragment Summaries Doc Identifier Title Date Format Language 521354021 593574021 534536021 528943021 529874521 526983221 1861978766 1394875966 1817305606 1809239086 1345398705 Csdoi sdofi sfi sfdsf Deor aodfi sdfwe dls Toid sdofij cvcdova Csdo asofdi weor Epodsf csmieo mo Awer fzwe xhzpwf Eoite odsifj woifj Oewr svonwe Psadoifh sdafns dsf Vsd sdfokj sfew Wdfj vspo sdfp dort 1948 1952 1937 1916 1924 1959 1993 2005 1999 2001 1989 Book Book Book Book Book Book de de de de de de en en en en en Super-Peer1 Coverage Scotland Wales York West Midlands London SP1 Summary Doc.Identifier Doc.Title Doc.Date[1916-2005] Doc.Format [Book] Doc.Language[de, en] Doc.Coverage[UK] VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 31 3.1 Super-peer/Super-peer Indices SP1 Summary Doc.Identifier Doc.Title Doc.Date[1916-2005] Doc.Format [Book] Doc.Language[de, en] Doc.Coverage[UK] Super-Peer2 SP/SP Index … … Doc.Language[de] SP1 [en] SP1 … … SP1 0 1 SP3 Super-Peer3 SP/SP Index … … Doc.Language[de] SP1 [en] SP1 … … SP2 1 0 SP4 Super-Peer4 SP/SP Index … … Doc.Language[de] SP2,SP3 [en] SP2,SP3 … … • Naively forwarding is not optimal VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 32 3.1 Super-peer/Super-peer Indices • Take edge dimension into account • forward SP/SP index entries only along lower edges SP1 Summary … Doc.Language[de, en] … SP1 0 1 Super-Peer3 SP/SP Index … … Doc.Language[de] SP1 (1) [en] SP1 (1) … … SP3 SP2 Super-Peer2 SP/SP Index … … Doc.Language[de] SP1 (0) [en] SP1 (0) … … 1 0 SP4 Super-Peer4 SP/SP Index … … Doc.Language[de] SP3 (0) [en] SP3 (0) … … VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 33 3.1 Query Routing • Use SP/P and SP/SP indices as filters SELECT * FROM Doc WHERE Language=”de“ AND … Super-Peer1 SP/P Index … … Doc.Language[de] P1 [en] P2 … … P1 P2 P4 SP1 0 1 Super-Peer3 SP/SP Index … … Doc.Language[de] SP1 (1) [en] SP1 (1) … … SP3 SP2 P3 1 0 SP4 Super-Peer4 SP/SP Index … Doc.Language[de] [en] … VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig … SP3 (0) SP3 (0) … 34 3.1 Application: P2P Digital Library Network • Large amount of individual DLs • Autonomous institutions • Users have to blah blah blah – find relevant DLs – search separately on every found DL • Violates 4th law of Library Science – “Save the time of the reader” (Ranganathan, 1931) VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 35 3.1 DL Search Engine Solution • Search engine approach blah blah blah – ‚Crawl„ DLs – Copy Content – Offer unified collection • Issues – Search engine controls content – Proprietary interface (or just Web crawl) – Difficult to preserve metadata – Single point of failure VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 36 3.1 Open Archive Initiative Solution • Standardize metadata ‚Crawling„ interface blah blah blah – OAI-PMH (Protocol for Metadata Harvesting) • Harvesters – collect metadata from DLs – offer search facilities • Issues – – – – No single entry point Harvesters control content Points of failure Incentive for Harvester? VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 37 3.1 From OAI to P2P • Create „peer wrapper‟ for existing DLs Super-peer backbone Digital Libraries Content Providers OAI-PMH Interface VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 38 3.1 OAI-P2P – a Digital Library Network • P2P approach: blah blah blah – DLs form self-organized network – User queries are distributed • Advantages – No dependency on service provider – Each DL still controls its content – No single point of failure • 5th law of Library Science: – “The library is a growing organism” (Ranganathan, 1931) VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 39 3.1 Edutella – Discussion • Efficiently limits query distribution to relevant peers • Very good scalability in terms of data size – No data movement required – Little index maintenance efforts • Flooding limits super-peer backbone scalability – Will never scale to millions of peers • Mainly query forwarding – Initial extension to full query planning exists • No load-balancing mechanisms VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 40 Overview 1. Why Peer-to-Peer Databases? 1. 2. 3. 4. 2. P2P Databases 1. 2. 3. Federation Information integration Sensor networks „New‟ internet Challenges Design Dimensions Existing P2P Database systems 1. 2. 3. 4. Edutella: focus on expressivity PIER: focus on scalability Piazza: focus on integration HiSbase: focus on scalability for spatial data VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 41 3.2 PIER • P2P Relational Database 0 15 1 14 • Foundation: any DHT 2 13 3 12 • Extended hash interface 4 11 5 – put(namespace, key, value) – get(namespace, key) Spanning Tree – namespace/key combination is used as hash value (DHT Key) 10 6 9 8 7 • Extended network capabilities • Exploit DHT structure for broadcast • Required for joins and aggregate queries) VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 42 3.2 Application: Phi • Phi: Public Health for the Internet – Monitor ip network state world-wide – Collect statistics • Network traffic • Latency • … – Malware alerts VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 43 3.2 Storing and Indexing Tuples • Storing – Every tuple needs a synthetic tuple key – Choose combination of table name and tuple key as DHT key – Insert complete tuple into DHT using this key • Indexing – Additional attribute indexes are built by inserting attribute value/tuple key pairs into the DHT – Choose combination of attribute name and attribute value as DHT key – Insert tuple key as DHT value VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 44 3.2 Example • Sample Database Doc Id Title Date Language Author DocId PersonId Person Id Name Surname • Sample tuple : (456, „Critique of pure Reason‟, 1781, „en‟) • Storing – put(Doc, 456, (456, „Critique...‟, „en‟, Philosophy)) • Indexing on „Title‟ and „Date‟ attributes – put(Doc.Title, „Critique...‟, 456) – put(Doc.Date, „1781‟, 456) VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 45 3.2 PIER Query Plans project({Id,Title}) • DHT-Scan – Use index to retrieve tuple key(s) – Use key(s) to retrieve data tuple(s) • Example filter(Lang=‟en‟) dht-scanSubject(Doc, Date=‟1781‟) – SELECT Id, Title FROM Doc WHERE Date= „1781‟ AND Lang = „en‟ • Each peer can create a query plan • One DHT lookup per result tuple • Filter has to be done on query originator VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 46 3.2 Aggregate and Range Queries • Example – SELECT COUNT(Id) FROM Doc WHERE 6 1 Date>„1780‟ AND Date<„1790‟ • Use spanning tree for broadcast • Aggregate on return 3 1 1 1 1 1 VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 47 3.2 Join Queries • Example – Assume a Person tuple (789, „Kant‟, „Immanuel‟) – SELECT Id, Title FROM Doc WHERE Author.DocId = Doc.Id AND Author.PersonId = 789 • Approach: Hierarchical Joins – Use spanning tree for broadcast – Do local select on peer table fragments – Do local join on each peer • Improves load balancing – Forward table fragments and partial results to parent – Repeat until query originator has all fragments VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 48 3.2 Hierarchical Joins T12 T21 T32 T23 T22 T13 T31 D1 D3 T33 A1 A3 T11 D1 A1 D2 A2 A3 VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 49 3.2 PIER - Discussion • Real query planning • Very efficient access to individual tuples and small result sets • Very good scalability in terms of network size • Degrades to broadcast for many types of queries – Aggregate queries – Joins • INSERT operation expensive (see P2P Inform. Retrieval) • No load-balancing mechanisms VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 50 Overview 1. Why Peer-to-Peer Databases? 1. 2. 3. 4. 2. P2P Databases 1. 2. 3. Federation Information integration Sensor networks „New‟ internet Challenges Design Dimensions Existing P2P Database systems 1. 2. 3. 4. Edutella: focus on expressivity PIER: focus on scalability Piazza: focus on integration HiSbase: focus on scalability for spatial data VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 51 3.3 Piazza • Tackles problem of „reconciling different models of the world” (A. Halevy) • Goal: provide a uniform interface to a set of autonomous data sources • New abstraction layer over multiple sources • Introduce mappings between „world views‟ – Mapping rules are specified manually by experts – Don‟t need to be complete VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 52 3.3 Example – Publication Databases UCSD VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 53 3.3 Mapping Rules • Datalog is used to specify mapping rules UCSD : Member(projName; member) : UW : Member(;pid; member; ); UW : Project(pid; ; projName): UCSD : Member(projName; member) : UPenn : Student(sid; member; ); UPenn : ProjMember(pid; sid); UPenn : Project(pid; projName; ) UCSD : Member(projName; member) : UPenn : Faculty(sid; member; ); UPenn : ProjMember(pid; sid); UPenn : Project(pid; projName; ) Mapping from UW to UCSD Mapping from UPenn to UCSD VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 54 3.3 Storing and Indexing • Unstructured network (Gnutella-like) • Peer keeps its database – No exchange of data between peers • Indexing – Only on schema level – Each peer maintains schema catalog of its neighbors – Mappings Stored in central catalog (hybrid system) • could be replaced by DHT – Replication of mappings to all relevant peers VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 55 3.3 Query Routing • Query Flooding – Peer translates query to CiteSeer schema of neighbor (if possible) Q1 – Result tuples are M(UCSD, UPenn) converted on way back UCSD • Queries answered by traversing semantic paths M(UW, UCSD) M(UW, Stanford) Q UW UPenn Q3 Q4 DBLP M(Stanford, DBLP) Stanford Q2 VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig UC Berkeley 56 3.3 Piazza - Discussion • Supports multiple schema world (more realistic) • Very expressive mapping mechanism • Not scalable – Gnutella-like topology and flooding • Piazza mapping technique could be applied to other network infrastructures VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 57 Overview 1. Why Peer-to-Peer Databases? 1. 2. 3. 4. 2. P2P Databases 1. 2. 3. Federation Information integration Sensor networks „New‟ internet Challenges Design Dimensions Existing P2P Database systems 1. 2. 3. 4. Edutella: focus on expressivity PIER: focus on scalability Piazza: focus on integration HiSbase: focus on scalability for spatial data VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 58 3.4 HiSbase • Specialized on distributed spatial data • Application: astronomy data – Huge amounts of data (terabyte scale) – Region-based queries – Skewed data distribution • Main ideas – Distribute data on peers by region – Use DHT for data access – Use neighbor-preserving hash function (space-filling curve) VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 59 3.4 Load Distribution • Use Quad-Tree structure to split data space into equally loaded regions VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 60 4.4 Data Hashing • Use Z-Linearization for hashing coordinates VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 61 3.4 Insertion into DHT VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 62 3.4 Query Processing • Point query – Simple DHT access • Region query – Route to arbitrary peer in range (e.g. using upper left region boundary) – This peer acts as coordinator – Forward query to peer region neigbors • Until whole area is covered – Collect results at coordinator VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 63 3.4 HiSbase - Discussion • Very efficient for spatial queries – But only spatial queries possible • Not completely self-organizing – Quad-Tree splitting needs central coordination VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 64 3. P2P Database Networks – Summary • Challenges – Multi-Dimensional Search Space – Schema Heterogeneity – Potentially large result sets • Design Dimensions – Network Properties (Data Placement, Topology and Routing) – Data Access (Data Model, Query Language) – Integration Mechanism (Mapping Representation/Creation/Usage) • P2P Database Types – – – – Focus on high network scalability (e.g., Edutella) Focus on high query expressivity (e.g., PIER) Focus on information integration (e.g., Piazza) Focus on specific data structures (e.g. HiSbase) VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 65 3. Conclusion • P2P Databases do already work – although immature compared to traditional database technology • One size does not fit all – Choose P2P database approach according to application requirements • Open problems – – – – Load Balancing (Replication/Caching) How to combine DHT and filtered flooding advantages Reliability (probabilistic guarantees) ... VDMS und P2P – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 66