Acknowledgements Distributed Hash Tables (DHTs) Chord & CAN CS 699/IT 818

Transcription

Acknowledgements Distributed Hash Tables (DHTs) Chord & CAN CS 699/IT 818
Acknowledgements
Distributed Hash Tables (DHTs)
Chord & CAN
The followings slides are borrowed or adapted from talks by Robert
Morris (MIT) and Sylvia Ratnasamy (ICSI)
CS 699/IT 818
Sanjeev Setia
1
2
Chord Simplicity
Chord: A Scalable Peer-to-peer Lookup Service for Internet
Applications
 Resolution entails participation by O(log(N))
nodes
 Resolution is efficient when each node
enjoys accurate information about
O(log(N)) other nodes
Robert Morris
Ion Stoica, David Karger,
M. Frans Kaashoek, Hari Balakrishnan
 Resolution is possible when each node
enjoys accurate information about 1 other
node
MIT and Berkeley
“Degrades gracefully”
SIGCOMM Proceedings, 2001
3
4
1
Chord Algorithms
Chord Properties
Basic Lookup
 Efficient: O(log(N)) messages per lookup

Node Joins
N is the total number of servers
 Scalable: O(log(N)) state per node
Stabilization
 Robust: survives massive failures
Failures and Replication
 Proofs are in paper / tech report

Assuming no malicious participants
5
Chord IDs
6
Consistent Hashing[Karger 97]
 Target: web page caching
 Key identifier = SHA-1(key)
 Like normal hashing, assigns items to
 Node identifier = SHA-1(IP address)
buckets so that each bucket receives
roughly the same number of items
 Both are uniformly distributed
 Both exist in the same ID space
 Unlike normal hashing, a small change in the
bucket set does not induce a total
remapping of items to buckets
 How to map key IDs to node IDs?
7
8
2
Consistent Hashing [Karger 97]
Key 5
Node 105
Basic lookup
N120
K5
N105
K20
Circular 7-bit
ID space
N32
“N90 has K80”
N32
K80 N90
A key is stored at its successor:
node with next higher ID
N90
N10 “Where is key 80?”
N105
K80
N60
9
10
Simple lookup algorithm
“Finger table” allows log(N)-time lookups
Lookup(my-id, key-id)
n = my successor
1/4
1/2
if my-id < n < key-id
call Lookup(id) on node n // next hop
1/8
else
return my successor
// done
1/16
1/32
1/64
1/128
N80
 Correctness depends only on successors
11
12
3
Lookup with fingers
Finger i points to successor of
n+2i-1
Lookup(my-id, key-id)
N120
112
look in local finger table for
1/4
highest node n s.t. my-id < n < key-id
1/2
if n exists
call Lookup(id) on node n // next hop
1/8
else
return my successor
1/16
1/32
1/64
1/128
// done
N80
13
Lookups take O(log(N)) hops
14
Node Join - Linked List Insert
N5
N10
K19
N20
N110
N25
N99
N32 Lookup(K19)
N36
1. Lookup(36)
N40
K30
K38
N80
N60
15
16
4
Node Join (2)
Node Join (3)
N25
N25
2. N36 sets its own
successor pointer
3. Copy keys 26..36
from N40 to N36
N36
N40
K30
K38
N36 K30
N40
K30
K38
17
Node Join (4)
18
Stabilization
 Case 1: finger tables are reasonably fresh
 Case 2: successor pointers are correct; fingers
are inaccurate
N25
4. Set N25’s successor
pointer
 Case 3: successor pointers are inaccurate or key
migration is incomplete
N36 K30
N40
K30
K38
 Stabilization algorithm periodically verifies and
refreshes node knowledge

Successor pointers
Predecessor pointers

Finger tables

Update finger pointers in the background
Correct successors produce correct lookups
19
20
5
Solution: successor lists
Failures and Replication
N120
 Each node knows r immediate successors
N10
N113
 After failure, will know first live successor
N102
N85
 Correct successors guarantee correct lookups
 Guarantee is with some probability
Lookup(90)
N80
N80 doesn’t know correct successor, so incorrect lookup
21
Choosing the successor list length
22
Chord status
 Working implementation as part of CFS
 Chord library: 3,000 lines of C++
 Assume 1/2 of nodes fail
 Deployed in small Internet testbed
 P(successor list all dead) = (1/2)r

I.e. P(this node breaks the Chord ring)

Depends on independent failure
 Includes:



 P(no broken nodes) = (1 – (1/2)r)N


Correct concurrent join/fail
Proximity-based routing for low delay
Load control for heterogeneous nodes
Resistance to spoofed node IDs
r = 2log(N) makes prob. = 1 – 1/N
23
24
6
Experimental overview
Chord lookup cost is O(log N)
Average Messages per Lookup
 Quick lookup in large systems
 Low variation in lookup costs
 Robust despite massive failure
 See paper for more results
Experiments confirm theoretical results
Constant is 1/2
Number of Nodes
25
Failure experimental setup
26
Massive failures have little impact
 Start 1,000 CFS/Chord servers
Successor list has 20 entries
Failed Lookups (Percent)

 Wait until they stabilize
 Insert 1,000 key/value pairs

Five replicas of each
 Stop X% of the servers
 Immediately perform 1,000 lookups
1.4
(1/2) 6 is 1.6%
1.2
1
0.8
0.6
0.4
0.2
0
5
10
15
20
25
30
35
40
45
50
Failed Nodes (Percent)
27
28
7
Latency Measurements
Chord Summary
 180 Nodes at 10 sites in US testbed
 16 queries per physical site ( sic) for random keys
 Chord provides peer-to-peer hash lookup
 Maximum < 600 ms
 Efficient: O(log(n)) messages per lookup
 Minimum > 40 ms
 Robust as nodes fail and join
 Median = 285 ms
 Good primitive for peer-to-peer systems
 Expected value = 300 ms (5 round trips)
http://www.pdos.lcs.mit.edu/chord
29
A Scalable, Content-Addressable Network
30
Content-Addressable Network
(CAN)
 CAN: Internet-scale hash table
 Interface
Sylvia Ratnasamy, Paul Francis, Mark Handley,
Richard Karp, Scott Shenker

insert(key,value)

value = retrieve(key)
 Properties
 scalable
 operationally simple
 good performance
 Related systems: Chord/Pastry/Tapestry/Buzz/Plaxton ...
31
32
8
Problem Scope

Design a system that provides the interface





CAN: basic idea
scalability
K V
robustness
performance
security
K V
K V
Application-specific, higher level primitives



K V
K V
K V
K V
keyword searching
mutable content
anonymity
K V
K V
K V
K V
33
CAN: basic idea
34
CAN: basic idea
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
insert
(K 1,V 1)
K V
K V
K V
insert
(K 1,V 1)
35
K V
K V
36
9
CAN: basic idea
(K 1,V 1)
CAN: basic idea
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
retrieve (K1)
37
CAN: solution
38
CAN: simple example
 virtual Cartesian coordinate space
 entire space is partitioned amongst all the nodes

every node “owns” a zone in the overall space
1
 abstraction

can store data at “points” in the space

can route from one “point” to another
 point = node that owns the enclosing zone
39
40
10
CAN: simple example
CAN: simple example
3
1
2
1
2
41
CAN: simple example
42
CAN: simple example
3
1
2
4
43
44
11
CAN: simple example
CAN: simple example
node I::insert(K,V)
I
I
45
CAN: simple example
CAN: simple example
node I::insert(K,V)
(1) a = hx(K)
46
node I::insert(K,V)
I
I
(1) a = hx(K)
b = hy(K)
y=b
x=a
47
x=a
48
12
CAN: simple example
CAN: simple example
node I::insert(K,V)
(1) a = hx(K)
node I::insert(K,V)
I
(1) a = hx(K)
b = hy(K)
I
b = hy(K)
(2) route(K,V) -> (a,b)
(2) route(K,V) -> (a,b)
(K,V)
(3) (a,b) stores (K,V)
49
CAN: simple example
50
CAN
node J::retrieve(K)
(1) a = hx(K)
b = hy(K)
(K,V)
(2) route “retrieve(K)” to (a,b)
Data stored in the CAN is addressed by
name (i.e. key), not location (i.e. IP
address)
J
51
52
13
CAN: routing table
CAN: routing
(a,b)
(x,y)
53
CAN: routing
54
CAN: node insertion
A node only maintains state for its
immediate neighboring nodes
Bootstrap
node
new node
1) Discover some node “I” already in CAN
55
56
14
CAN: node insertion
CAN: node insertion
(p,q)
I
new node
2) pick random
point in space
I
1) discover some node “I” already in CAN
57
CAN: node insertion
new node
58
CAN: node insertion
(p,q)
J
J new
I
new node
3) I routes to (p,q), discovers node J
59
4) split J’s zone in half… new owns one half
60
15
CAN: node insertion
CAN: node failures
 Need to repair the space
Inserting a new node affects only a single
other node and its immediate neighbors

recover database
 soft-state updates
 use replication, rebuild database from replicas

repair routing
 takeover algorithm
61
CAN: takeover algorithm
CAN: node failures
 Simple failures


62
Only the failed node’s immediate neighbors
are required for recovery
know your neighbor’s neighbors
when a node fails, one of its neighbors takes over
its zone
 More complex failure modes

simultaneous failure of multiple adjacent nodes

scoped flooding to discover neighbors

hopefully, a rare event
63
64
16
Design recap
Evaluation
 Basic CAN

completely distributed

self-organizing

 Scalability
 Low-latency
nodes only maintain state for their immediate
neighbors
 Load balancing
 Additional design features

multiple, independent spaces (realities)

background load balancing algorithm

simple heuristics to improve performance
 Robustness
65
CAN: scalability
CAN: low-latency
 Problem
 For a uniformly partitioned space with n nodes and d
dimensions

per node, number of neighbors is 2d

average routing path is (dn 1/d)/4 hops

simulations show that the above results hold in practice


latency stretch = (CAN routing delay)
(IP routing delay)
application-level routing may lead to high stretch
 Solution
 Can scale the network without increasing per-node
state
 Chord/Plaxton/Tapestry/Buzz

66

increase dimensions

heuristics
 RTT-weighted routing
log(n) nbrs with log(n) hops
 multiple nodes per zone (peer nodes)
 deterministically replicate entries
67
68
17
CAN: low-latency
CAN: low-latency
#dimensions = 2
140
w/o heuristics
120
w/ heuristics
100
80
60
40
20
Latency stretch
Latency stretch
160
0
#dimensions = 10
10
180
8
w/o heuristics
6
w/ heuristics
4
2
0
16K
32K
65K
131K
16K
#nodes
32K
65K
131K
#nodes
69
CAN: load balancing
70
Uniform Partitioning
 Two pieces

Dealing with hot-spots
 popular (key,value) pairs
 Added check
 nodes cache recently requested entries
 overloaded node replicates popular entries at neighbors

Uniform coordinate space partitioning

at join time, pick a zone

check neighboring zones

pick the largest zone and split that one
 uniformly spread (key,value) entries
 uniformly spread out routing load
71
72
18
Uniform Partitioning
CAN: Robustness
65,000 nodes, 3 dimensions
100
 Completely distributed
w/o check
80
Percentage
of nodes
w/ check
60

no single point of failure
 Not exploring database recovery
 Resilience of routing
40
V = total volume
n
20
0
V
16
V
8
V
4
V
2
V
2V
4V

can route around trouble
8V
Volume
73
Routing resilience
74
Routing resilience
destination
source
75
76
19
Routing resilience
Routing resilience
destination
77
Routing resilience
78
Routing resilience
 Node X::route(D)
If (X cannot make progress to D)
– check if any neighbor of X can make progress
– if yes, forward message to one such nbr
79
80
20
Routing resilience
Routing resilience
CAN size = 16K nodes
CAN size = 16K nodes
#dimensions = 10
Pr(node failure) = 0.25
100
Pr(successful routing)
Pr(successful routing)
100
80
60
40
20
0
2
4
6
8
60
40
20
0
0
10
dimensions
80
0.25
0.5
0.75
Pr(node failure)
81
82
Summary
 CAN

an Internet-scale hash table

potential building block in Internet applications
 Scalability

O(d) per-node state
 Low-latency routing

simple heuristics help a lot
 Robust

decentralized, can route around trouble
83
21