Coordination in distributed systems Mutual Exclusion q Why needed, sources of problems
Transcription
Coordination in distributed systems Mutual Exclusion q Why needed, sources of problems
Coordination in distributed systems q Why needed, sources of problems q The problem: Ø for resource sharing: concurrent updates of Ø N asynchronous processes, for simplicity no failures Ø guaranteed message delivery (reliable links) Ø to execute critical section (CS), each process calls: § records in a database (record locking) § files (file locks in stateless file servers) § a shared bulletin board § request() § resourceAccess() § exit() Ø to agree on actions: whether to § commit/abort database transaction § agree on a readings from a group of sensors P2 P1 q Requirements Ø to dynamically re-assign the role of master P3 Critical section Ø At most one process is in CS at the same time. Ø Requests to enter and exit are eventually granted. Ø (Optional, stronger) Requests to enter granted according to causality order. § choose primary time server after crash § choose coordinator after network reconfiguration Synchronization and Coordination Mutual Exclusion 1 Why difficult? Synchronization and Coordination 4 Mutual Exclusion q Centralized solutions not appropriate Two requirements: q Safety: at most one process can be in the critical section. Ø communications bottleneck q Fixed master-slave arrangements not appropriate q Liveness: a process requesting entry to critical section will eventually succeed. Ø process crashes q Varying network topologies Ø ring, tree, arbitrary; connectivity problems P2 q Failures must be tolerated if possible P1 P3 Ø link failures Ø process crashes q Impossibility results Critical section Ø in presence of failures, esp asynchronous model Synchronization and Coordination 2 Coordination problems Synchronization and Coordination 5 Safety vs. Liveness q Mutual exclusion q A safety property describes a property that always holds; sometimes we put it in this way “nothing ‘bad’ will happen”. Ø distributed form of critical section problems Ø must use message passing q Leader elections q A liveness property describes a property that will eventually hold; sometimes we put it in this way “something ‘good’ will eventually happen”. Ø after crash failure has occurred Ø after network reconfiguration q Consensus (also called Agreement): next lecture Ø similar to coordinated attack Ø some based on multicast communication Ø variants depending on type of failure, network, etc Synchronization and Coordination What are the following properties belong? deadlock free, mutual exclusion, bounded delay 3 Synchronization and Coordination 6 1 Some Solutions Token Rings - Discussion q continuous use of network bandwidth q delay to enter depends on the size of ring q causality order of requests not respected - why? q Use a centralized server q Ricart and Agrawala’s Distributed Algorithm q Tree q Quorum q Token Ring Synchronization and Coordination 7 Synchronization and Coordination A Centralized Solution P2 request P1 Ricart and Agrawala’s Distributed Algorithm 1. A process requesting entry to the CS sends a request to every other process in the system; and enters the CS when it obtains permissions from every other process. 2. When does a process grant another process’s request? Ø conflict resolved by logical timestamps of requests P3 grant q Use a centralized coordinator to main a queue of requests, which are ordered by physical timestamps. q A process wishing to enter CS sends a request to the coordinator, and enters the CS when the coordinator grants its request. 10 P1 Coordinator queue p5 Problems : The coordinator becomes a bottleneck and single failure point. Synchronization and Coordination p4 p3 p4 query p2 p3 p5 p2 ack p1 8 Token Rings p1 Synchronization and Coordination 11 How to implement the timestamps? q Physical clocks? q It is trivial to solve mutual exclusion over a ring---by using a unique token. Ø How to synchronize physical clocks? Ø Will it work without a perfect clock synchronization scheme? q For ordinary network, a logical ring has to be constrcuted. q Logical clocks? P0 P8 token P1 P7 P2 P6 P3 P5 P4 Synchronization and Coordination 9 Synchronization and Coordination 12 2 Token-Based on Trees 1. The tree is dynamically structured so that the root always holds the token. P2 P1 P6 3. A process requesting the token or receiving a request from its successor appends the request to its queue and then request its predecessor for the token if it does not hold the token. P5 [Raymond 1989] Synchronization and Coordination P2 P6 P4 P6 P5 P5 13 Synchronization and Coordination 1. P5 and P6 request the token from P3, and suppose P5‘s request arrives first. 4. P3 receives the token. It then removes P5 from its queue, and sends the token to P5, which is the new predecessor of P3. P2 P1 P6 P3 P6 P4 16 Token-Based on Trees (contd.) P5 P6 P3 P5 P6 P3 Token-Based on Trees (contd.) P1 3. P3 receives the token. It then removes P5 from its queue, and sends the token to P5, which is the new predecessor of P3. P2 P1 2. Each process maintains a FIFO queue of requests for the Token from its successors, and a pointer to its immediate predecessor. P3 P4 Token-Based on Trees (contd.) P6 P6 P4 P5 P5 P6 Since P3‘s queue is still not empty, it also sends a request to the new predecessor. P5 P3 Synchronization and Coordination 14 Token-Based on Trees (contd.) Synchronization and Coordination 17 Quorum Systems -- [Garcia-Molina & Barbara, 1985] 2. Since P3 does not hold the token, it requests the token from its predecessor P1. P2 P1 P5 P6 P3 P6 P4 P6 A quorum system is a collection of sets of processes called quora. In resource allocation, a process must acquire a quorum (i.e., lock all the quorum members) in order to access a resource. Resource allocation algorithms that use quora typically have the following advantages: o Less message complexity o Fault tolerant P5 P5 Synchronization and Coordination 15 Synchronization and Coordination 18 3 Formal Definition Projective Planes -- [Garcia-Molina & Barbara, 1985] Let P = {p0, p1, p2, … , pn-1} be a set of processes. A coterie C is a subset of 2P such q A Projective Plane is a plane satisfies the following: Ø Any line has at least two points. Ø Two points are on precisely one line. Ø Any two lines meet. Ø There exists a set of four points, no three of which are collinear. that n Intersection: ∀ Qi, Qj ∈ C , Qi ∩ Qj ≠ ∅ n Minimality: ∀ Qi, Qj ∈ C , Qi ≠ Qj ⇒ Qi ⊄ Qj q A Projective Plane is said to be order n if a line contains exactly n+1 points Each set in C is call a quorum. Synchronization and Coordination 19 Some Quorum Systems Synchronization and Coordination 22 Projective Planes (contd.) q A projective plane of order n has the following properties: q Majority q Tree quora q Grid q Finite Projective Plane p1,1 p1,2 p1,3 p1,4 p1,5 p2,1 p2,2 p2,3 p2,4 p2,5 p3,1 p3,2 p3,3 p3,4 p3,5 p4,1 p4,2 p4,3 p4,4 p4,5 p5,1 p5,2 p5,3 p5,4 p5,5 Synchronization and Coordination Ø Every line contains exactly (n+1) points Ø Every point is on exactly (n+1) lines Ø There are exactly (n 2 +n+1) points Ø There are exactly (n 2 +n+1) lines Fano plane (the projective plane of order 2) 20 Fully Distributed Quorum Systems A quorum system C ={Q1, Q2, … , Qm} over P that additionally satisfies the following conditions: q Uniform: ∀ 1 ≤ i, j ≤ m:Qi = Qj q Regular: ∀ p, q ∈ P: np = nq, where np is the set {Qi | ∃ 1 ≤ i ≤ m: p ∈ Qi }, and similarly for n q . E.g., Finite Projective Planes of order p k, where p is a prime. Q1 = {l, 2} Q1 = {l, 2, 3} Q5 = {2, 5, 7} Q2 = {1, 3} Q2 = {1, 4, 5} Q6 = {3, 4, 7} Q3 = {2, 3} Q3 = {I, 6, 7} Q7 = {3, 5, 6} the projective plane of order 3 Synchronization and Coordination 23 Maekawa’s Algorithm q A process p wishing to enter CS chooses a quorum Q, and sends lock requests to all nodes of the quorum. q It enters CS only when it has locked all nodes of the quorum. q Upon exiting CS, p unlocks the nodes. q A node can be locked by one process at a time. q Conflicting lock requests to a node are resolved by priorities (e.g., timestamps). The loser must yield the lock to the high priority one if it cannot successfully obtain all locks it needs. Q4 = {2, 4, 6} Synchronization and Coordination 21 Synchronization and Coordination 24 4 Message Complexity Comparison Algorithm Message per entry/exit Centralized 3 Maekawa’s algorithm needs 3c to 6c messages per entry to CS, where c is the size of the quorum a process chooses. Best case: 3c p p p Request locks Grant locks Release locks Tree O(log n) O(log n) Token Ring 1 to ∞ 1 to n−1 Token loss, process crash 2 Need to determine a suitable coterie 25 quorum size Token loss, process crash Synchronization and Coordination 28 The Election Problem Many distributed algorithms require one process to act as coordinator, initiator, sequencer, or otherwise perform some special role, and therefore one process must be elected to take the job, even if processes may fail. Worst case: 6c Request locks Single failure point Crash of any process 2 2 if multicast is supported; 2(n−1) otherwise 2(n−1) Worst Case p Drawbacks Distributed Voting Synchronization and Coordination Delay before entry (in message time) p p Requirements: • Safety: at most one process can be elected at any time. Inquire Yield locks • Liveness: some process is eventually elected. Assumptions Each process has a unique id. p Return locks p Release locks p I am the leader Grant locks Synchronization and Coordination 26 Read/Write Quorums Synchronization and Coordination 29 The Bully Algorithm When a process P notices that the current coordinator is no longer responding to requests, it initiates an election, as follows: 1. P sends an ELECTION message to every process with a larger id. For database concurrency control, – every read quorum must intersect with every write quorum, 2. If after some timeout period no one responds, P wins the election and becomes the coordinator. – every two write qrora must intersect. 3. If one of the higher-ups answers, it takes over the election; and P’s job is done. (A process must answer the ELECTION message if it is alive.) 4. When a process is ready to take over the coordinator’s role, it sends a COORDINATOR message to every process to announce this. 5. When a previously-crashed coordinator is recovered, it assumes the job by sending a COORDINATOR message to every process. Synchronization and Coordination 27 Synchronization and Coordination 30 5 Message Complexity Complexity Best case: n−2; Worst case: O(n2). If only a single process (say p) starts an election, then, in the worst case, n−1 messages are required to “wake up” the process with the largest id ( which resides at p’s right side); another 2n messages for that process to elect itself as the coordinator. Synchronization and Coordination 31 A Ring-Based Algorithm Synchronization and Coordination 34 When Processes May Fail… Assumptions: 1. Processes do not know each other’s id. Things get a little complicated when processes may fail in the above ring 2. Each process can only communicate with its neighbors. 3. All processes remain functional and reachable. 1 7 2 non-participants max id seen 0 3 7 4 participants 6 5 Synchronization and Coordination 32 A Ring-Based Algorithm (contd.) Synchronization and Coordination 35 When Processes May Fail… Each process is either a participantor a non-participant of the game. Initially, all processes are non-participant. 2. When a process wishes to initiate an election, it marks itself as a participant, and then sends an ELECTION message (bearing its own id) to its left neighbor. 3. When a process P receives an ELECTION message, it compares its id with the initiator’s. If the initiator’s is larger, then it forwards the message. Otherwise, if P is not a participant, then it substitutes its own id in the message and forward it to its left neighbor; otherwise it simply discards the message. On forwarding an ELECTION message, a process marks itself as a participant. q Things get a little complicated when processes may fail in the above ring-based algorithm as it relies on the topology that may be destroyed when processes may fail. q How to cope with the problem? 4. When a process P receives an ELECTION message with its own id, it becomes the coordinator. It announces this by issuing a COORDINATOR message (bearing its id) to its left neighbor, and marks itself as a non-participant. 5. When a process other than the coordinator receives a COORDINATOR message, it also marks itself as a non-participant, and then forwards the message to its left neighbor. Synchronization and Coordination 33 Synchronization and Coordination 36 6 Leader Election in Mobile Ad Hoc Networks q Assumptions: Solutions for k-exclusion q Token-Based Ø Each node has a unique ID Ø Nodes do not know the total number of nodes in the system Ø Nodes may move, fail, or rejoin the network Ø Extension of Raymond’s token-based algorithm for mutual exclusion ? q Permission-Based q Goal: Ø Design an efficient distributed algorithm for the nodes to elect a leader so that § If the system is stable, then eventually there is a unique leader in every connected component and for every other node in the component, there is a (unique) path to the leader Ø Extension of Ricart and Agrawala’s algorithm for mutual exclusion ? Ø Design of quorum systems ? § Can the definition of ordinary quorum systems be used? § What’s the new definition? The system needs to be self-stabilizing! Synchronization and Coordination 37 Synchronization and Coordination Mutual Exclusion [Dijkstra 1965] 40 k-coteries q A quorum system S for k-exclusion (called k-coterie) is a collection of subsets of processes satisfying: Only one process can access a resource at a time. Ø Intersection: ∀R ⊂ S , | R | = k+1 ⇒ ∃ Qi, Qj ∈ R, Qi ∩ Qj ≠ ∅ Ø Minimality: ∀Qi, Qj ∈ S , Qi ≠ Qj ⇒ Qi ⊄ Qj Are the above conditions enough? We need a non-intersection property! Examples: k-majority, cohorts, degree-k tree quorum, ... Synchronization and Coordination 38 k-Exclusion [Fisher, Lynch, Burns, & Borodin 1979] Synchronization and Coordination 41 Group Mutual Exclusion (GME) [Joung 1998] At most k processes can be in critical section at a time. A resource can be shared by processes of the same group, but not by processes of different groups. CD JUKEBOX Requirements: p mutual exclusion p lockout freedom p concurrent entering Variations: § Limit the number of processes that can be in CS. § Increase the number of groups that can be in CS. Synchronization and Coordination 39 Synchronization and Coordination 42 7 Construction of Sm Solutions for group mutual exclusion q Token-Based S1 Ø Extension of Raymond’s token-based algorithm for mutual exclusion ? p0 p1 q Permission-Based Ø Extension of Ricart and Agrawala’s algorithm for mutual exclusion ? Ø Design of quorum systems ? S2 § Can ordinary quorum systems or k-coteries be used? § What’s the new definition? Synchronization and Coordination 43 P0,0 P0,1 P0,2 P0,3 P0 ,4 P1,0 P1,1 P1,2 P1,3 P1,4 P2,0 P2,1 P2,2 P2,3 P2,4 P3,0 P3,1 P3,2 P3,3 P3,4 P4,0 P4,1 P4,2 P4,3 P4,4 p2 p3 p4 p5 p6 S3 Synchronization and Coordination 46 Construction of Sm (contd.) Group Quorum Systems Let P = {1,2,… , n} be a set of nodes. An m-group quorum system is a tuple S = (C1, C2, … , Cm), where each Ci ⊆ 2P satisfies n Intersection : ∀ 1 ≤ i ≠ j ≤ m, ∀Q1 ∈ Ci, ∀Q2 ∈ Cj : Q1 ∩ Q2 ≠ ∅ n Minimality: ∀ 1 ≤ i ≤ m, ∀ Q1, Q2 ∈ Ci , Q1 ≠ Q2 : Q1 ⊄ Q2 S3 We call each Ci a cartel, and each Q ∈ Ci a quorum. The degree of a cartel C is the maximum number of pairwise disjointed quora in C. Synchronization and Coordination 44 The Surficial Group Quorum System Sm Synchronization and Coordination Construction of Sm (contd.) q It is balanced, uniform, and regular. q It minimizes process’s load by letting np=2 for all p ∈ P . q Each cartel has degree 2n S2 m (m − 1 ) S3 2 n (m − 1) m S4 q Each quorum has size 47 S5 Synchronization and Coordination 45 Synchronization and Coordination 48 8 Time Cristian’s Algorithm q Time is important in computer systems Time server Request a new time Ø Every file is stamped with the time it was created, modified, and accessed. Ø Every email, transaction, … are also timestamped. Ø Setting timeouts and measuring latencies q Sometimes we need precise physical time and sometimes we only need relative time. P S It’s time t When P receives the message, it should set its time to t+Ttrans , where Ttrans is the time to transmit the message. Ttrans ≈ Tround /2, where Tround is the round-trip time Accuracy. Let min be the minimum time to transmit a message one-way. Then P could receive S’s message any time between [t+min, t+ Tround − min] So accuracy is ±(Tround /2 − min) Synchronization and Coordination 49 Synchronizing Physical Time Ø E.g., a quartz crystal clock has a drift rate of 10-6 (ordinary), or 10-7 to 10-8 (high precision). Ø C.f. an atomic clock has a drift rate of 10-13 . q Provide a service enabling clients across the Internet to be synchronized accurately to UTC, despite the large and variable message delays encountered in Internet communication. q The NTP servers are connected in a logical hierarchy, where servers in level n are synchronized directly to those in level n−1 (which have a higher accuracy). The logical hierarchy can be reconfigured as servers become unreachable or failed. q NTP servers synchronize with one another in one of three modes (in the order of increasing accuracy): Ø Multicast on high speed local LANs Ø Procedure call mode (a la Cristian’s algorithm) Ø Symmetric mode (for achieving highest accuracy). Questions: 1. How do we synchronize computer clocks with real-world clocks? 2. How do we synchronize computer clocks themselves? 50 Synchronization and Coordination Compensation for clock drift 53 Symmetric Mode q A computer clock usually can be adjusted forward but not backward. Ø Typical example: Y2K problem. q Common terminology: A pair of servers exchange timing information Ti−2 Ti−1 Server A m Server B Ti−3 time m′ Ti Assume: m takes t to transfer, m′ takes t′ to transfer Offset between A’s clock and B’s clock is o; i.e., A(t) = B(t) + o Ø Skew (offset): the instantaneous difference between (the readings of) two clocks. Ø Drift rate: the difference between the clock and a nominal perfect reference clock per unit of time. Then, Ti−2 = Ti−3 + t + o and Ti = Ti−1 − o + t′ Assuming that t≈ t′ , then the offset o can be estimated as follows: o i = (Ti−2 − Ti−3 + Ti−1 − Ti ) / 2 q Linear adjustment: Ø Let C be the software reading of a hardware clock H. Then the operating system usually produces C in terms of H by the following: C(t) = α H(t) + β Synchronization and Coordination 52 The Network Time Protocol (NTP) Observations: q In some systems (e.g., real-time systems) actual time are important, and we typically equip every computer host with one physical clock. q Computer clocks are unlikely to tick at the same rate, whether or not they are of the ‘same’ physical construction. Synchronization and Coordination Synchronization and Coordination 51 Since Ti−2 − Ti−3 + Ti − Ti−1 = t + t′ (let’s say, t + t′ equal to di ) Then o = o i + (t′− t)/2 Given that t′, t ≥ 0, the accuracy of the estimate of o i is: o i − d i /2 ≤ o ≤ o i + d i /2 Synchronization and Coordination 54 9 Symmetric Mode (contd.) Logical Clocks q The eight most recent pairs <oi, di> are retained; the value of oi that corresponds to the minimum di s chosen to estimate o. A logical clock Cp of a process p is a software counter that is used to timestamp events executed by p so that the happened-before relation is respected by the timestamps. q Timing messages are delivered using UDP. The rule for increasing the counter is as follows: • LC1: Cp is incremented before each event issued at process p. • LC2: When a process q sends a message m to p, it piggybacks on m the current value t of Cq; on receiving m, p advances its Cp to max(t, Cp). back to mutual exclusion Synchronization and Coordination 55 Synchronization and Coordination Logical Time 58 Illustration of Timestamps timestamp P1: Motivation q Event ordering linked with concept of causality. Ø Saying that event a happened before event b is same as saying that event a could have affected the outcome of event b Ø If events a and b happen on processes that do not exchange any data, their exact ordering is not important P2: 1 x:=1; 2 y:=0; 3 send (x) to P2; timestamp y:=2; 1 receive(x) from P1; 4 Observation x:=4; 5 q If two events occurred at the same process, then they occurred in the order in which it observes them. send (x+y) to P1; 6 x:= x+y; 7 q Whenever a message is sent between processes, the event of sending the message occurred before the event of receiving the message. Synchronization and Coordination 56 Causal ordering (happened-before relation) 7 receive(y) from P2; 8 x:=x+y; Synchronization and Coordination Reasoning about timestamps 1. If process p execute x before y, then x → y. Consequence: if a → b, then C(a) < C(b) 2. For any message m, send(m) → rcv(m). The partial ordering can be made total by additionally considering process ids. 3. If x → y and y → z, then x → z. Two events a and b are said concurrent if neither a → b nor b → a. Process p X a X Process q Process r X X Suppose event a is issued by process p, and event b by process q. Then the total ordering → t can be defined as follows: a → t b iff C(a) < C(b) or C(a) = C(b) and ID(p) < ID(q). d c 59 X Does C(a) < C(b) imply a → b ? e b Synchronization and Coordination 57 Synchronization and Coordination 60 10 Total Ordering of Events Reasoning about vector timestamps q Happened Before defines a Partial Ordering of events (arising from causal relationships). q We can use the logical clocks satisfying the Clock Condition to place a total ordering on the set of all system events. Partial orders ‘≤’ and ‘<’ on two vector timestamps u, v are defined as follows: u ≤ v iff u[k] ≤ v[k] for all k’s, and u < v iff u ≤ v and u ≠ v. Property: e happened-before f if, and only if, vt(e) < vt(f). Ø Simply order the events by the times at which occur Ø To break the ties, Lamport proposed the use of any arbitrary total ordering of the processes, i.e. process id q Using this method, we can assign a unique timestamp to each event in a distributed system to provide a total ordering of all events q Very useful in distributed system Ø Solving the mutual exclusion problem back to mutual exclusion Synchronization and Coordination 61 Synchronization and Coordination 64 Vector Timestamps Each process P i maintains a vector of clocks VTi such that VTi[k] represents a count of events that have occurred at P k and that are known at and that are known at P i. The vector is updated as follows: 1. All processes P i initializes its VTi to zeros. 2. When P i generates a new event, it increments VTi[i] by 1; VTi is assigned as the timestamp of the event. Messagesending events are timestamped. 3. When P j receives a message with timestamp vt, its updates its vector clock as follows: VTi[k] := max(VTi[k], vt[k]) Synchronization and Coordination 62 Illustration of Vector Timestamps timestamp P1: 〈1,0 〉 x:=1; 〈2,0 〉 y:=0; 〈3,0 〉 send (x) to P2; 〈4,4 〉 receive(y) from P2; 〈5,4 〉 x:=x+y; P2: timestamp y:=2; 〈0,1 〉 receive(x) from P1; 〈3,2 〉 x:=4; 〈3,3 〉 send (x+y) to P2; 〈3,4 〉 x:= x+y; 〈3,5 〉 Synchronization and Coordination 63 11