How To Make Chord Correct

Transcription

How To Make Chord Correct
How To Make Chord Correct
Pamela Zave
AT&T Laboratories—Research
One AT&T Way, Room 4D148K
Bedminster, New Jersey 07921, USA
+1 908 901 2204
[email protected]
February 10, 2014
Abstract
The Chord distributed hash table (DHT) is well-known, frequently
used to implement peer-to-peer systems, and frequently used as a foundation for research on designing DHTs with additional properties. Despite claims of proven correctness, previous work has shown that the
Chord ring-maintenance protocol is not correct. It has not, however,
discovered whether Chord could be made correct. The principle contribution of this paper is to solve the problem by providing the first
specification of a correct version of Chord, and a proof of its correctness. In addition the paper provides a simple, necessary, and sufficient
inductive invariant, as well as interesting new insights into the workings of distributed systems that use ring-shaped data structures in
large identifier spaces.
This paper is a regular submission.
It is not eligible for a student paper award.
i
1
Introduction
Peer-to-peer systems are distributed systems featuring decentralized control,
self-organization of similar nodes, and scalability. A distributed hash table
(DHT) is a peer-to-peer system that implements a persistent key-value store.
It can be used for shared file storage, group directories, and many other
purposes.
The distributed hash table Chord was first presented in a 2001 SIGCOMM paper [20]. This paper has been the fourth-most-cited paper in
computer science for several years (according to Citeseer), and won the
2011 SIGCOMM Test-of-Time Award.
The nodes of a Chord network have identifiers in an m-bit identifier
space, and reach each other through pointers in this identifier space. Because
the pointer structure is based on adjacency in the identifier space, and 2m −1
is adjacent to 0, the structure of a Chord network is a ring.
The ring structure is disrupted when nodes join, leave, or fail. The
original Chord papers [20, 21] specify a ring-maintenance protocol whose
minimum correctness property is eventual reachability: given ample time
and no further disruptions, the ring-maintenance protocol can repair all
disruptions in the ring structure. If the protocol is not correct in this sense,
then some nodes of a Chord network can become permanently unreachable
from other nodes.
The introductions of the original Chord papers say, “Three features that
distinguish Chord from many other peer-to-peer lookup protocols are its
simplicity, provable correctness, and provable performance.” An accompanying PODC paper [12] lists invariants of the ring-maintenance protocol.
The claims of simplicity and performance are certainly true. The Chord
algorithms are far simpler and more completely specified than those of other
DHTs, such as Pastry [18], Tapestry [25], CAN [17], and Kademlia [14].
There is no attempt to specify or enforce fairness on distributed nodes.
There are no atomic operations involving multiple nodes.
The ease of implementing Chord is probably the reason for its popularity as a component of peer-to-peer systems. Its fundamental simplicity is
probably the reason for its popularity as a basis for building DHTs with
stronger guarantees and additional capabilities, such as protection against
malicious peers [2, 4, 16], key consistency [6], range queries [7], and atomic
access to replicated data [13, 15].
Unfortunately, the claim of correctness is not true. The original specification does not have eventual reachability, and none of the claimed invariants
is actually an invariant [23]. The principal contribution of this paper is to
1
solve the problems that were revealed in [23], by providing the first specification of a correct version of Chord, and a proof of its correctness.
These new results are achieved with lightweight modeling, a technology
that has advanced to maturity in recent years. Lightweight modeling consists
of building small formal models of key concepts of a system, and checking
properties of them by means of exhaustive enumeration of instances over a
bounded domain. This push-button analysis either yields a counterexample,
or proves that the property holds in the bounded domain. In this paper
lightweight models are written in the Alloy language and checked with the
Alloy Analyzer [8]; the reasons for this choice are discussed in [24].
The results in this paper are significant in three major ways:
(1) Many people implement Chord, use Chord as a component of their
distributed systems, or base their research on Chord. They should have a
correct version of Chord to use. They should also have a correct invariant
for Chord, as this is a design principle for enhancing DHT security [19].
(2) Many people reason about Chord behavior for the purposes of their
research. This reasoning should have a sound foundation. For example, the
performance analysis in [10] makes incorrect assumptions about Chord behavior [23]. The research on augmenting and strengthening Chord, as referenced above, relies on informal descriptions of Chord and informal reasoning
about its behavior. As automated proof checking increasingly becomes the
norm in distributed systems, attempts to prove properties of systems based
on original Chord will fail or yield unsound results.
(3) As explained in Section 4, efforts to find the best version of Chord and
the best invariant for a proof have led to interesting insights into how Chord
works. People who build on Chord should be aware of these properties so
as to preserve them. Some principles may be applicable to all systems that
use ring-shaped data structures in large identifier spaces (e.g., [1, 17]).
The paper begins with an overview of Chord using the revised, correct
ring-maintenance operations (Section 2), and a new specification of these operations (Section 3). Although the specification is informal for maximum accessibility, it is a paraphrase of a formal specification in Alloy. The complete
Alloy model, including specification, invariant, and all steps of the proof, can
be found at http://www2.research.att.com/~pamela/chord.html.
Correct operations are necessary but not sufficient. It is also necessary
to have an initialization condition and an invariant to use in constructing a
proof. Original Chord is initialized with a network of one node; this is not
correct, and Section 4 shows why. Chord must be initialized and maintained
with a ring containing a minimum of r + 1 nodes, where r is the length of
each node’s list of successors.
2
Section 4 shows that it is also necessary for r + 1 initial nodes to form a
“stable base,” in the sense that these nodes remain members of the network
throughout its lifetime. The stable base is sufficient to enforce a simple
invariant and allow a straightforward proof of correctness (Section 5).
The stable base is an interesting requirement, because the stable base has
few members and a Chord network can have millions of them. This means
that there are arbitrarily large segments of the ring that are far from any
base member, and that the base members can have no local effect on. The
effect of the stable base is global, preventing complications due to “wrapping
around the ring” in a sense explained in Section 4.
The stable base is also a fortunate requirement, because it is simple,
eliminates the need for additional implementation mechanisms, and has a
practical effect only when the ring is small. Once a Chord network reaches
a reasonable size, it can be ignored as an implementation requirement.
Although other researchers have found problems with Chord implementations [5, 9, 22], they have not discovered any problems with the specification of Chord. Other work on verifiable ring maintenance operations [11]
uses multi-node atomic operations, which are avoided by Chord.
2
Overview of correct Chord
Every member of a Chord network has an identifier (assumed unique) that
is an m-bit hash of its IP address. Every member has a successor list of
pointers to other members. The first element of this list is the successor,
and is always shown as a solid arrow in the figures. Figure 1 shows two Chord
networks with m = 6, one in the ideal state of a ring ordered by identifiers,
and the other in the valid state of an ordered ring with appendages. In
the networks of Figure 1, key-value pairs with keys from 31 through 37
are stored in member 37. While running the ring-maintenance protocol, a
member also acquires and updates a predecessor pointer, which is always
shown as a dotted arrow in the figures.
The ring-maintenance protocol is specified in terms of three operations,
each of which changes the state of at most one member. In executing an operation, the member queries another member or sequence of members, then
updates its own pointers if necessary. The specification of Chord assumes
that inter-node communication is bidirectional and reliable, so we are not
concerned with Chord behavior when inter-node communication fails.
A node becomes a member in a join operation. A member node is also
referred to as live. When a member joins, it contacts an existing member
3
63
53
50
10
48
16
37
9
62
62
10
48
16
37
30
30
Figure 1: Ideal (left) and valid (right) networks. Members are represented
by their identifiers. Solid arrows are successor pointers.
7
10
10
joins
19
7
7
10
10
stabilizes
10
19
rectifies
19
19
7
7
7
stabilizes
10
10
rectifies
19
Figure 2: A new node becomes part of the ring. A gray circle marks the
pointer updated by an operation, if any. Dotted arrows are predecessors.
and gets its own current successor from that member. (It also contacts the
current successor to get a full successor list.) The first stage of Figure 2
shows successor and predecessor pointers in a section of a network where 10
has just joined.
When a member stabilizes, it learns its successor’s predecessor. It adopts
the predecessor as its new successor, provided that the predecessor is closer
in identifier order than its current successor. Because a member must query
its successor to stabilize, this is also an opportunity for it to update its
successor list with information from the successor. Members schedule their
own stabilize operations, which should be periodic.
Between the first and second stages of Figure 2, 10 stabilizes. Because
its successor’s predecessor is 7, which is not a better successor for 10 than
its current 19, this operation does not change the successor of 10.
After stabilizing (regardless of the result), a node notifies its successor
of its identity. This causes the notified member to execute a rectify operation. The rectifying member checks whether its current predecessor is still a
member, and then adopts the notifying member as its new predecessor if the
4
19
10
notifying member is closer in identifier order than its current predecessor (or
if it has no live predecessor). In the third stage of Figure 2, 10 has notified
19, and 19 has adopted 10 as its new predecessor.
In the fourth stage of Figure 2, 7 stabilizes, which causes it to adopt
10 as its new successor. In the last stage 7 notifies and 10 rectifies, so
the predecessor of 10 becomes 7. Now the new member 10 is completely
incorporated into the ring, and all the pointers shown are correct.
The assumption of the protocol is that a member in good standing always
responds to queries in a timely fashion. A node ceases to become a member
in a fail event, which can represent failure of the machine, or the node’s
silently leaving the network. A member that has failed is also referred to
as dead. After a member fails, it no longer responds to queries from other
members. With these assumptions, members can detect the failure of other
members perfectly by noticing whether they respond to a query before a
timeout occurs. Another assumption about failure behavior is that successor
lists are long enough, and failures are infrequent enough, to ensure that a
member is never left with no live successor in its list.
Failures can produce gaps in the ring, which are repaired during stabilization. As a member attempts to query its successor for stabilization, it
may find that its successor is dead. In this case it attempts to query the
next member in its successor list and make this its new successor, continuing
through the list until it finds a live successor.
It is well known that a Chord network must preserve the following structural property. Defining a member’s best successor as its first successor
pointing to a live node (member):
• there must be a ring of best successors;
• there must be no more than one ring;
• on the ring of best successors, the nodes must be in identifier order;
• from each member in an appendage, the ring must be reachable through
best successors.
If any of these rules is violated, there is a disruption in the structure that
the ring-maintenance protocol cannot repair. The inevitable result is that
some members will be permanently unreachable from some other members.
A network is ideal when each pointer is globally correct. For example, on
the right of Figure 1, the globally correct successor of 48 is 50 because it is
the nearest member in identifier order. The minimum correctness criterion
for the ring-maintenance protocol is simple: In any execution state, if there
5
are no subsequent join or fail events, then eventually the network will become
ideal and remain ideal. This is not a particularly stringent requirement, as
it allows the protocol ample time and no further disruptions while it works
to repair the ring.
The Chord papers define the lookup protocol, which is not discussed
here. They also define the maintenance and use of finger tables, which
improve lookup speed by providing pointers that cross the ring like chords
of a circle. Finger tables are built from successor lists and the correctness of
ring maintenance does not depend on them, so they are not discussed here.
3
Specification of ring-maintenance operations
Appendix A contains pseudocode for the join, stabilize, and rectify operations, following the overview in the previous section. In format, it is much
more complete and explicit than the original specification of Chord, particularly with respect to communication between nodes.
There is a type Identifier which is a string of m bits. Implicitly,
whenever a member transmits the identifier of a member, it also transmits
its IP address so that the recipient can reach the identified member. The pair
is self-authenticating, as the identifier must be the hash of the IP address
according to a chosen function.
The Boolean function between is used to check the order of identifiers,
and will be referred to in Section 4. Because identifier order wraps around
at zero, it is meaningless to compare two identifiers—each precedes and
succeeds the other. This is why between has three arguments:
Boolean function between (n1, n2, n3: Identifier) {
if (n1 < n3) { n1 < n2 && n2 < n3 }
else
{ n1 < n2 || n2 < n3 }
}
It is important to note that, for all distinct x and y, between(x,y,x) is
always true, and between(x,x,y) and between(y,x,x) are always false.
The join, stabilize, and rectify operations are quite different from the
original pseudocode specification of Chord. There are two major changes.
First, multiple smaller operations are assembled into larger operations. This
ensures that the successor lists of members are always fully populated, rather
than having missing entries to be filled in by later operations. Second, before
incorporating a pointer to a node into its state, a member usually checks
that it is live. This prevents cases where a member replaces a pointer to a
live node with a pointer to a dead one. Both changes increase the amount
6
62
37
48
37
48
stabilizes,
62
rectifies
48
62
37
48
fails
62
Figure 3: Why the ring cannot be initialized at size 1. Dashed arrows are
second-successor pointers. Predecessor pointers are not shown in the last
two stages, as they are irrelevant.
of useful information available to members. In [23] there are numerous
examples of how the previous operations can result in fatal errors or violation
of desirable invariants.
Including both major and minor changes, the new specification has three
sources:
• pseudocode and text selected from the three papers [12, 20, 21];
• clarifications about the orginal implementation of Chord [3];
• bug fixes found with extensive Alloy modeling and analysis, as described further in the next two sections.
4
Initialization and invariant
Correct operations are necessary but not sufficient. We also need an initialization condition and an invariant to use in constructing a proof.
Original Chord initializes a network with a single member that is its own
successor, i.e., the initial network is a ring of size 1. This is not correct,
as shown in Figure 3. In this example, the length r of the successor list
is 2. The problem is that when the ring is too small, successor lists must
start with some equal entries (62 and 37 start with both list entries equal).
According to the operating assumptions, 48 can fail, leaving members 62
and 37 with insufficient information to find each other. Only when members
have r distinct entries in their successor lists do the operating assumptions
guarantee that each member will always have at least one live successor.
For members to have r distinct entries in their successor lists, a Chord
network must be initialized and maintained with a minimum ring size of r+1.
7
However, this is not sufficient. In addition, r + 1 of the initial ring members
must serve as a “stable base” of ring members that are highly available and
will continue to be members throughout the life of the network. The typical
range for r is 3-5, so the typical stable base would require 4 to 6 members.
The stable base is necessary for correctness. Briefly, we say that a member n1 skips another member n2 if n2 is not in the successor list of n1, yet
between(n1, n2, n1.lastSucc), where n1.lastSucc is the last member of
the successor list. Now consider a network whose ring has shrunk (through
failures) to its minimum size, and each remaining ring member skips the
next remaining ring member, having in its successor list the second-closest
remaining ring member. If the minimum size is odd, the result is a ring
whose members are not in identifier order. If the minimum size is even,
the result is two disjoint rings. The stable base prevents these situations
by ensuring that the remaining nodes are the nodes of the stable base,
and that—because they have always been members, rather than failing and
rejoining—they are not skipped in successor lists.
The good news about the stable base is that it is not only necessary, but
sufficient. The remainder of this section describes an invariant developed on
the idea of a stable base. It also explains how the invariant works and why
this is all good news. A Chord network can be initialized to any state that
satisfies this invariant.
The first conjuncts of the invariant, which is paraphrased completely here
and expressed in Alloy in Appendix B, are merely the formalizations of the
structural properties in Section 2. The conjunct StructuredSuccessorList
says that each node’s successor list has r distinct entries. Also, considering
node n’s extended successor list append(n, n.succList), for all contiguous
sublists (x, y, z), between(x, y, z) holds. The conjunct BaseNotSkipped
says that, for all contiguous sublists (x, y) of node n’s extended successor
list, and for all members b of the stable base, between(x, b, y) is false.
Note that x = b or y = b is allowed.
Figure 4 illustrates a failure that cannot occur when there is a stable
base, for r = 2. Although the initial phase violates the invariant (between
its first successor 3 and its second successor 45, 52 skips over 20 and 31,
at least one of which must be in the stable base of size 3), its individual
peculiarities are not impossible. 45 is an appendage that attaches to the
ring in the wrong place. The second successor of 52 points outside the ring.
Examples in [23] show how both peculiarities can occur (and still can, even
with the revised operations).
If we look at the extended successor lists (52, 3, 45) and (45, 3, 20), both
are well-formed in the sense that they satisfy StructuredSuccessorList.
8
45
45
3
52
20
3
fails
31
20
52
31
Figure 4: Without the stable base, failure can leave the ring disordered.
Only the relevant pointers are drawn.
However, (52, 3, 45) implies that 45 is close to and clockwise of 3, and (45,
3, 20) implies that 45 is close to and counterclockwise of 3. Both could only
be realistic if the ring were smaller than it really is. In other words, together
these two successor lists imply a tight “wrap around” of identifiers that is
not correct. When 3 fails, the ring becomes disordered.
The stable base works because it provides enough permanent spacers
around the ring to prevent such anomalies. Note that its beneficial effect
relies only on r, not on the distribution of base nodes around the ring. All
r + 1 base nodes could be adjacent, and it would work perfectly well.
The first advantage of this invariant is that it is simple. There are localized structural properties that were studied as possible parts of an invariant,
but they are far more complex (as well as being neither necessary nor sufficient). The full paper will discuss these properties, as they might be useful
for security (in the spirit of [19]).
The second advantage of this invariant is that it is powerful in eliminating the need for additional implementation mechanisms. For example,
some plausible localized properties can be falsified when a node rejoins after failing, if other nodes still have pointers to it from its last episode of
membership. Their obsolete pointers can be misleading. To prevent this
problem and enforce the plausible properties, it would be necessary to enforce a “blackout interval” during which a failed node cannot rejoin. This
mechanism would be both uncertain and inconvenient, and it is not necessary with a stable base.
The third and most important advantage of this invariant is that a stable
base has few nodes and a Chord network can have millions. This means
that there are arbitrarily large segments of the ring that are far from any
base member, and that the base members can have no local effect on. The
problems prevented by the stable base can only occur when the ring is small.
9
As a result, once a Chord network reaches a reasonable size, the effect of
the stable base is negligible and it can be ignored as an implementation
requirement.
5
Proof of correctness
The formal model uses shared memory communication between nodes to
simulate queries. Concurrency has an interleaving semantics. The interleaved atomic steps model the local computations performed by nodes between or after queries. The following proof steps can all be found in full at
the previously given URL.
(1) Check that the computation steps of the operations are consistent.
The join operation has two steps such that the first establishes a precondition
for the second. Since the precondition concerns the stable base, it cannot
be falsified by interleaved computations.
(2) Check that each computation step preserves the invariant.
(3) Define and validate effectiveness predicates for the stabilize and rectify operations. If EffectiveOpEnabled[n,t] for some node n and state t,
then the referenced operation is enabled for n in state t, and will change the
state of n if it occurs.
(4) Check that in any valid state that is not ideal, some effective stabilize
or rectify operation is enabled. Also check that in any ideal state, no effective
operation is enabled. This shows that stabilize and rectify can always change
a non-ideal state, and can never change an ideal state.
(5) Define a measure of the error in a Chord network, such that the
measure of an ideal network is 0 and the measure of a non-ideal network
is a positive integer. Show that every effective stabilize or rectify reduces
the measure. This proves that a network with no new joins or fails will
eventually become ideal.
Step 5 is manual, while all the others have been automated by the Alloy
Analyzer for r = 2 and up to 8 nodes. In general, the claim that this
constitutes a proof is based on the empirical “small scope hypothesis,” which
says that most bugs can be illustrated by small examples [8]. In particular,
there is no evidence that longer successor lists would exhibit new problems.
Concerning the number of nodes, although many new behaviors were found
by increasing the number of nodes from 5 to 6, no new behaviors were ever
found by increasing the number from 6 to 7; this makes 8 look safe as a limit.
Nevertheless, the current proof is straightforward, and open to improvement
with techniques that apply to successor lists and networks of all sizes.
10
References
[1] B. Awerbuch and C. Scheideler. The hyperring: A low-congestion deterministic data structure for distributed environments. In Proceedings
of SODA. ACM, 2004.
[2] B. Awerbuch and C. Scheideler. Towards a scalable and robust DHT.
In Proceedings of the 18th Annual ACM Symposium on Parallelism in
Algorithms and Architectures, pages 318–327. ACM, 2006.
[3] H. Balakrishnan and I. Stoica, 2013. Personal communication.
[4] A. Fiat, J. Sala, and M. Young. Making Chord robust to Byzantine
attacks. In Proceedings of the European Symposium on Algorithms,
pages 803–814. Springer LNCS 3669, 2005.
[5] M. J. Freedman, K. Lakshminarayanan, S. Rhea, and I. Stoica. Nontransitive connectivity and DHTs. In Proceedings of the 2nd Conference
on Real, Large, Distributed Systems, pages 55–60. USENIX, 2005.
[6] L. Glendenning, I. Beschastnikh, A. Krishnamurthy, and T. Anderson.
Scalable consistency in Scatter. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles. ACM, October 2011.
[7] A. Gupta, D. Agrawal, and A. E. Abbadi. Approximate range selection queries in peer-to-peer systems. In Proceedings of the 1st Biennial
Conference on Innovative Data Systems Research (CIDR 2003), 2003.
[8] D. Jackson. Software Abstractions: Logic, Language, and Analysis.
MIT Press, 2006, 2012.
[9] C. Killian, J. A. Anderson, R. Jhala, and A. Vahdat. Life, death, and
the critical transition: Finding liveness bugs in systems code. In Proceedings of the 4th USENIX Symposium on Networked System Design
and Implementation, pages 243–256, 2007.
[10] S. Krishnamurthy, S. El-Ansary, E. Aurell, and S. Haridi. A statistical
theory of Chord under churn. In Peer-to-Peer Systems IV. Springer
LNCS 3640, 2005.
[11] X. Li, J. Misra, and C. G. Plaxton. Active and concurrent topology
maintenance. In Distributed Computing, pages 320–334. Springer LNCS
3274, 2004.
11
[12] D. Liben-Nowell, H. Balakrishnan, and D. Karger. Analysis of the evolution of peer-to-peer systems. In Proceedings of the 21st ACM Symposium on Principles of Distributed Computing, pages 233–242. ACM,
2002.
[13] N. Lynch, D. Malkhi, and D. Ratajczak. Atomic data access in distributed hash tables. In Proceedings of IPTPS, pages 295–305. Springer
LNCS 2429, 2002.
[14] P. Maymounkov and D. Mazi`eres. Kademlia: A peer-to-peer information system based on the XOR metric. In Proceedings of the 1st
International Workshop on Peer-to-Peer Systems, 2002.
[15] A. Muthitacharoen, S. Gilbert, and R. Morris. Etna: A fault-tolerant
algorithm for atomic mutable DHT data. MIT CSAIL Technical Report
2005-044, http://hdl.handle.net/1721.1/30555, 2005.
[16] K. Needels and M. Kwon. Secure routing in peer-to-peer distributed
hash tables. In Proceedings of the ACM Symposium on Applied Computing. ACM, 2009.
[17] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Shenker. A scalable content-addressable network. In Proceedings of ACM SIGCOMM.
ACM, August 2001.
[18] A. Rowstron and P. Druschel. Pastry: Scalable, decentralized object
location and routing for large-scale peer-to-peer systems. In Proceedings of the 18th IFIP/ACM International Conference on Distributed
Systems Platforms (Middleware), 2001.
[19] E. Sit and R. Morris. Security considerations for peer-to-peer distributed hash tables. In Proceedings of IPTPS. Springer LNCS 2429,
2002.
[20] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan.
Chord: A scalable peer-to-peer lookup service for Internet applications.
In Proceedings of ACM SIGCOMM. ACM, August 2001.
[21] I. Stoica, R. Morris, D. Liben-Nowell, D. Karger, M. F. Kaashoek,
F. Dabek, and H. Balakrishnan. Chord: A scalable peer-to-peer lookup
protocol for Internet applications. IEEE/ACM Transactions on Networking, 11(1), February 2003.
12
[22] M. Yabandeh, N. Kneˇzevi´c, D. Kosti´c, and V. Kuncak. CrystalBall:
Predicting and preventing inconsistencies in deployed distributed systems. In Proceedings of the 6th USENIX Symposium on Networked
Systems Design and Implementation. USENIX, April 2009.
[23] P. Zave. Using lightweight modeling to understand Chord. ACM SIGCOMM Computer Communication Review, 42(2), April 2012.
[24] P. Zave. A practical comparison of Alloy and Spin, 2013. Submitted
for publication.
[25] B. Y. Zhao, L. Huang, J. Stribling, S. C. Rhea, A. D. Joseph, and
J. D. Kubiatowicz. Tapestry: A resilient global-scale overlay for service deployment. IEEE Journal on Selected Areas in Communications,
22(1):41–53, January 2004.
Appendix A: Pseudocode specification of the ringmaintenance operations
There is a type Identifier which is a string of m bits. The parameter r is
the length of successor lists.
The function
Identifier function lookupSucc (joining: Identifier) { }
takes the identifier of a joining node and returns the identifier of its proper
successor. In other words, for two members n and lookupSucc(joining)
that are adjacent in the ring, between(n,joining,lookupSucc(joining)).
Each node has the following variables:
myIdent: Identifier;
known: Identifier;
pred: Identifier;
succList: list Identifier;
// length is r
where myIdent is the hash of its IP address, known is a member of the Chord
network known to the node when it joins, and pred is the node’s predecessor.
succList is its entire successor list; the head of this list is its first successor
or simply its successor.
To join, a node executes the following pseudocode.
13
// Join operation
newSucc: Identifier;
query known for lookupSucc(myIdent);
if (query returns before timeout) {
newSucc = lookupSucc(myIdent);
query newSucc for newSucc.succList;
if (query returns before timeout) {
succList =
append(newSucc,
butLast(newSucc.succList));
pred = null;
}
else retry Join later;
}
else retry Join later;
First, the node asks the known node to look up the node’s identifier and get
its proper successor, storing the value in newSucc. The node then queries
newSucc for its successor list. Finally the node constructs its own successor
list by concatenating newSucc and newSucc’s successor list, with the last
element of the list trimmed off to produce a result of length r. If either of
the queries fail the node has no choice but to retry again later.
To stabilize, a node executes the following pseudocode.
// Stabilize operation
newSucc: Identifier;
while (succList is not empty) {
query head(succList) for head(succList).pred and
head(succList).succList;
if (query returns before timeout) {
newSucc = head(succList).pred;
succList =
append(head(succList),
butLast(head(succList).succList));
if (between(myIdent,newSucc,head(succList))
{ query newSucc for newSucc.succList;
if (query returns before timeout)
14
succList =
append(newSucc,
butLast(newSucc.succList));
};
notify head(succList) of myIdent;
break;
}
else succList = tail(succList);
};
In the outer loop of this code, the node queries its successor for its successor’s
predecessor and successor list. If this query times out, then the node’s
successor is presumed dead. The node promotes its second successor to first
and tries again. Once it has contacted a live successor, it executes inner
code ending in a break out of the loop. The loop is guaranteed to terminate
before succList is empty, based on the assumption that successor lists are
long enough so that each list contains at least one live node.
Once it has contacted a live successor, the node first updates its successor
list with its successor’s list. It then checks to see if the new pointer it has
learned, its successor’s predecessor, is an improved successor. If so, and if
newSucc is live, it adopts newSucc as its new successor. Thus the stabilize
operation requires one or two queries for each traversal of the outer loop.
Whether or not there is a live improved successor, the node notifies its
successor of its own identity.
A node rectifies when it is notified, thereafter executing the following
pseudocode:
// Rectify operation
newPred: Identifier;
receive notification of newPred;
if (pred = null) pred = newPred;
else {
query pred to see if live;
if (query returns before timeout) {
if (between(pred,newPred,myIdent))
pred = newPred;
}
else pred = newPred;
};
15
When a node fails or leaves, it ceases to stabilize, notify, or respond to
queries from other nodes. When a node rejoins, it re-initializes its Chord
variables.
Appendix B: Alloy specification of the invariant
There is an implicit linear order on instances of type Node, which acts as
the identifier order.
For simplicity, this is a formalization for r = 2. Successor lists of arbitrary but bounded length would not be more difficult to model in Alloy, but
examples and counterexamples would be more difficult to read.
one sig Network { base: set Node } { # base = 3 }
sig Node {
succ: Node lone -> Time,
succ2: Node lone -> Time,
prdc: Node lone -> Time,
bestSucc: Node lone -> Time }
{
-- This defines bestSucc.
all t: Time |
(Member[succ.t,t] && Member[succ2.t,t] => bestSucc.t = succ.t)
&& (Member[succ.t,t] && NonMember[succ2.t,t] => bestSucc.t = succ.t)
&& (NonMember[succ.t,t] && Member[succ2.t,t] => bestSucc.t = succ2.t)
&& (NonMember[succ.t,t] && NonMember[succ2.t,t] => no bestSucc.t)
-- A node is unambiguously a member or not.
all t: Time | Member[this,t] || NonMember[this,t]
}
pred Member [n: Node, t: Time] { some n.succ.t }
pred NonMember [n: Node, t: Time] {
no n.succ.t && no n.prdc.t && no n.succ2.t }
pred Invariant [t: Time] {
OneOrderedRing[t]
&& ConnectedAppendages[t]
&& StructuredSuccessorList[t]
&& BaseNotSkipped[t]
16
}
pred OneOrderedRing [t: Time] {
let ringMembers = { n: Node | n in n.^(bestSucc.t) } |
Network.base in ringMembers
-- at least one ring containing base
&& (all disj n1, n2: ringMembers | n1 in n2.^(bestSucc.t) )
-- at most one ring
&& (all disj n1, n2, n3: ringMembers |
n2 = n1.bestSucc.t => ! Between[n1,n3,n2]
-- ordered ring
)
}
pred ConnectedAppendages [t: Time] {
let members = { n: Node | Member[n,t] } |
let ringMembers = { n: members | n in n.^(bestSucc.t) } |
all na: members - ringMembers |
-- na is in
some nc: ringMembers | nc in na.^(bestSucc.t)
-- an appendage
-- yet reaches ring
}
pred StructuredSuccessorList [t: Time] {
all n: Node | Member[n,t] => {
some n.succ2.t
-- the successor list is full
Between[n,n.succ.t,n.succ2.t]
-- successor list is ordered
-- guarantees that n != n.succ.t and n.succ.t != n.succ2.t
n != n.succ2.t
-- no self-successors
} }
pred BaseNotSkipped [t: Time] {
all n: Node |
Member[n,t] =>
{ no b: Network.base |
(Between[n, b, n.succ.t] && b ! in (n + n.succ.t))
no b: Network.base |
(Between[n.succ.t, b, n.succ2.t] && b ! in (n.succ.t + n.succ2.t))
}
}
17