Quorum consensus in nested-transaction systems
Transcription
Quorum consensus in nested-transaction systems
Quorum Consensus systems KENNETH J. GOLDMAN Washington University in Nested Transaction and NANCY LYNICH Massachusetts Institute of Technology Giffords Quorum Consensus algorithm for data replication is studied in the context of nested transactions and transaction failures (aborts), and a fully developed reconfiguration strategy is presented. A formal description of the algorithm is presented using the Input/Output automaton model for nested transaction systems due to Lynch and Merritt. In this description, the algorithm itself is described in terms of nested transactions. The formal description M used to construct a complete proof of correctness that uses standard assertional techniques, is based on a natural correctness condition, and takes advantage of modularity that arises from describing the algorithm as nested transactions. The proof is accomplished hierarchically, showing that a fully replicated reconfigurable system “simulates” an intermediate replicated system, and that the intermediate system simulates an unreplicated system. The presentation and proof treat issues of data replication entirely separately from issues of concurrency control and recovery. Categories and Subject Descriptors: H.2.4 [Database distributed Management]: Systems—concurrency; systems General Terms: Algorithms, Theory, Verification Additional Key Words and Phrases: Concurrency control, data replication, hierarchical proofs, 1/0 automata, nested transactions, quorum consensus 1. INTRODUCTION In distributed database systems, data are often replicated in order to improve availability, reliability, and performance. Whenever replication is used, a replication algorithm is required to ensure that replication is transparent K. J. Goldman was supported in part by the National Science Foundation under Grant CCR-91of the 6th ACM Symposium 10029. A preliminary version of this paper appeared in Proceedings on Principles of D~stributed Computing. N, Lynch was supported in part by the National Science Foundation under Grant CCR-86-11442, in part by the Defense Advanced Research Projects Agency (DARPA) under Contract NOOO14-89-J-1988,and in part by tbe Office of Naval Research under Contract NOO014-85-0168. Authors’ addresses: K. J. Goldman, Department of Computer Science, Washington University, St. Louis, MO 63130; N. Lynch, Laboratory for Computer Science, MIT, Cambridge, MA 02139. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. 01994 ACM 0362-5915/94/1200-0537 $03.50 ACM Transactions on Database Systems, Vol. 19, No. 4, December 1994, Pages 537-585 538 K. J. Goldman . and N. Lynch to the user programs. In understanding replication algorithms, it is convenient to think of each logical object as being implemented by a collection of replicas and logical access programs. The replicas retain state information, and write write the user programs the logical operations invoke logical access programs in order object. The logical access programs complete by accessing some subset of the replicas. to read these read or or One of the best-known replication algorithms is the quorum consensus algorithm of Gifford [1979]. Gifford’s work does not include transaction nesting, but permits the user to access the logical objects using single-level transactions. ideas of this Based on the majority-voting algorithm of Thomas [1979], the method underlie many of the more recent and sophisticated replication techniques Abbadi et replication al. [ 1985], algorithm replicas to ensure write operation. approach (where replica suffices), consensus assigned (e.g., Eager and Sevcik [ 1983], and El Abbadi and Toueg is to ensure that each read that it will return the value Herlihy [1984], El [ 1989]). The goal of a operation sees enough written by the “preceding” This can be guaranteed by a simple read-one/write-all every replica is updated on each write, and reading a single or by a read-majority /write-majority approach. The quorum algorithm a number generalizes of votes, these and associated version number. Each ration consisting of two integers, approaches retains logical called in its as follows. state a data Each replica is value with an object X has an associated read-quorum and write-quorum. configuIf k is the total number of votes assigned to replicas of X, then the configuration for X is constrained so that read-quorum + write-quorum > k. To read X, a logical access program collects the version numbers from enough replicas so votes; then it returns the value associated that it has at least read-quorum with the highest version number. To write X, a logical access program collects the version numbers from enough replicas so that it read-quorum votes; then it writes its value with a higher version collection of replicas with at least write-quorum votes. In our discussion use a configuration described above. In [1985], a configuration of quorum consensus algorithms for data has at least number to a replication, we strategy that is slightly more general than the one this strategy, justified by Barbara and Garcia-Molina consists of a set of read quorums and a set of write quorums. Each quorum is a set of replica names; thus, the voting strategy above corresponds to the special case where any set of replicas is a read quorum if and only if the total votes among its members exceeds the threshold read-quorum. To read a logical object, a logical access program accesses all the replicas in some read quorum and chooses the value with the highest version number. To write a logical object, a logical access program first discovers the highest version number written so far by accessing all the replicas in some read quorum; then the logical access increments number by one and writes the new value and version number replicas in some write quorum. Every read quorum must have intersection with every write quorum. This ensures that each logical object reads at least one of the replicas written by the logical write. ACM TransactIons on Database Systems, Vol. 19, No, 4, December 1994 that version to all the a nonempty read of the most recent Quorum Consensus in Nested Transaction In this paper, Gifford’s quorum consensus algorithm proven correct in the context of nested transaction systems to invoke accesses Transaction programs, rithms to nesting but also themselves This is because tions of the user access may logical turns for to describing (even the objects out access transactions, be written arbitrary useful and if the user logical using be not transactions and the is described that permit and users for the of the algo- are not nested). as subtransac- performed logical user replication be written tasks transactions. describing themselves may different as subtransactions 539 nested only understanding programs . Systems by a logical access transactions. We show how Gifford’s algorithm can be structured in this way. Once this is done, it is very natural to allow nesting of user transactions as well. The quorum consensus algorithm is well suited to systems in which transaction failures (aborts) may occur, and is therefore a natural choice for nested transaction systems in which one of the primary goals is to mask failures. For example, if the replica accesses performed by a logical access program are invoked as subtransactions of that program, then an operation to access a logical data item can complete even if some of its associated abort. Our presentation includes a fully the read- and write-quorums dynamically. figuration, Mutchner failures. was outlined by Gifford, [1990]. Reconfiguration In this paper, data replication of distributed paper. The reconfiguration. ting where uses Two a fixed Both algorithms transaction failures reconfiguration accesses and has also been studied by Jajodia is useful for coping with site or and link we describe and prove the correctness of two algorithms for using a general framework that is suitable for a wide range algorithms. first replica developed mechanism for changing This mechanism, known as recon- algorithm is that principal algorithms configuration, are presented are are whereas possible. the configuration described the in a nested An second in this supports transaction interesting aspect information itself cated and managed by the data replication algorithm. Separating the concerns of data replication from concurrency recovery has been an important goal in the formal treatment setof the is repli- control and of replicated data management algorithms (e.g., El Abbadi and Toueg [1989]). Our presentation achieves this separation in the nested transaction setting. We present the replication issues solely in the context of systems without concurrency. We prove that a system that includes the new replication algorithm and that is serial at the level of the individual data copies looks the same, to the user transactions, as a system that is serial at the level of the logical objects. The fact that both simplify the reasoning. systems involved in this simulation are serial systems helps to Of course, systems that are truly serial at the level of the data copies are of little practical interest. However, previous work on nested transaction concurrency control and recovery algorithms [Moss 198 1; Reed 1983; Lynch and Merritt 1986; Fekete et al. 1987] has produced several interesting algorithms that guarantee that a system appears to be serial, as far as the transactions can tell. Combining any of these algorithms (at the replica level) with the replication algorithm we present yields a combined algorithm that appears ACM TransactIons on Database Systems, Vol. 19, No. 4, December 1994. 540 K. J, Goldman . unreplicated and transactions can tell. that the serial and N, Lynch (at the In fact, replication logical our algorithm guarantees “serializability” serializable e at the logical case of nested transaction rency control theory: data results can item provide be combined at the copy level, data item level. Thus, systems, an important level), as far a careful with as the proof any of the algorithm user fact that to yield a system our work formalizes, claim from classical that is for the concur- quorum consensus works with any correct concurrency control algorlthm. As long as the algorithm produces serializable executions, quorum consensus will ensure that the effect is just like an execution on a single copy database [Bernstein et al. 1987]. We are able to accomplish this clean separation in the case of nested transactions because our correctness conditions are stated from the point of view of the transactions, instead of from the point of view of the database interacting with a particular scheduler. Furthermore, the classical serializability theory correctness conditions are stated in terms of executions of the same system, separate ‘I’here proof tation this while our specification have been of replicated and proof work correctness conditions data given is based algorithms. Most by Bernstein on classical presented et al. [ 1987] 6, we show rigorous among et al. [ 1990]. the correctness follows directly from these suggests possible directions terms presentation these of Gifford’s theory. in and is the presen- basic Their of a algorithm; approach, how- easily to the case where nesting and [1984] extends Gifford’s algorithm to and offers a correctness proof. Nested In Section In Section 4 we describe the by the reconfigurable algorithm that defined of the paper is organized as follows. In system model we use for nested transaction by Fekete configurations. tion, followed notable serializability ever, does not appear to generalize failures are allowed. Also, Herlihy accommodate abstract data types transactions are not considered. The remainder summarize the are of the allowable executions. some previous attempts at 3 we give a formal definition algorithm for a fixed in Section 5. Finally, of interesting results. Section for further work. Section 2 we systems, as of configurain Section concurrent replicated 7 summarizes our systems results and 2. BACKGROUND We adopt the model of nested transaction systems constructed by Lynch and Merritt [ 1986] with the 1/0 automaton model of Lynch and Tuttle [ 1987] as a foundation. We begin with a review of the 1/0 automaton model and then describe its use in modeling nested may be found in Fekete et al. [ 1990]. transaction systems. Complete details We note that whereas classical serializability theory is designed specifically to deal with database concurrency and replica control, the 1/0 automaton model is well suited for studying a wide range of distributed algorithms. The model provides a foundation for precise problem specifications, detailed algoACM TransactIons on Database Systems, Vol 19, No 4, December 1994 Quorum Consensus rithm descriptions, results. and 2,1 The Input / Output An 1/0 initial automaton states. careful Automaton in Nested TransactIon correctness proofs, a state as well 541 . as impossibility Model A has a set of states, Usually Systems is given some as an of which are designated assignment of values as to a collection of named typed variables. The automaton has actions, divided into input actions, output actions, and internal actions. We refer to both input and output actions as external actions. An automaton has a transition relation, which is a set of triples of the form (s’, is an action. This triple means that in perform action w and change to state is called a step of the automaton. The input actions model the actions T, s), where s‘ and s are states, and rr state s‘, the automaton can atomically s. An element of the transition relation that are triggered by the environment of the automaton, the output actions model the actions that are triggered the automaton itself and are potentially observable by the environment, the internal actions the environment. Given a state a state model changes s r and an action s for which of state that T, we say that are not directly T is enabled (s’, w, s) is a step. We require that by and detected by in s‘ if there each input action is m be enabled in each state s‘, i.e., that an 1/0 automaton must be prepared to receive any input action at any time. Rather than listing triples, we describe transition relations by associating a precondition and effect with each action. For a given automaton in state s‘, if the precondition for action m is true in s‘, then the automaton may perform m and go to state s, as defined by the effect. If T is enabled in all states, the precondition is omitted. An execution of A is an alternating sequence so nl Slrz “”” w.s. of states and actions of A such that occurs as a consecutive execution is the sequence so is a start state and each triple (s’, v, s ) that subsequence is a step of A. The schedule of actions that occur in that execution. The of an behav- ior of an execution of A is the sequence of external actions of A in the execution. A behavior represents the information that the environment can detect about the execution. Since the same action may occur several times in an execution, schedule, or behavior, we refer to a single occurrence of an action as an euent. We say that execution with a schedule (behavior) schedule (behavior) @‘ of A is a prefix reachable from ~‘ of schedule in P iff P can leave A in state s if there is some a and final state s. If schedule (behavior) (behavior) s is the final B of A, state we say that in an execution state s is a of A with schedule (behavior) ~ ‘y, where ~ ‘y is a prefix of /3. We describe systems as consisting of interacting components, each of which is an 1/0 automaton. It is also convenient and natural to view systems as 1/0 automata. Thus, we define a composition operation for 1/0 automata, to yield a new 1/0 automaton. A collection of 1/0 automata is said to be strongly compatible if (1) every internal action of every automaton is not an action of any other automaton ACM in the TransactIons collection, on Database’ . (2) every Systems, Vol. output 19, No. action 4, December of 1994. 542 K. J. Goldman . and N. Lynch each automaton is not an output action shared by infinitely many automata in collection system of strongly 9. A state compatible of the automata composed an internal action that of any and (3) no action is A (possibly infinite) be composed is a tuple to create of states, a one for the start states are tuples consisting action of the composed automaton is automata. It is an output of the system is an internal action of the system if it component. has an action may automaton which component automaton, and start states of the components. An action of a subset of the component it is an output of any component. It components of any other, the collection. During n- carries an action of an if is n- of &’, each of the out the action, while the remainder stay in the same state. If @ is a sequence of actions of a system with component A, then we denote by /3 IA the subsequence of fl containing all the actions finite of A. Clearly, behavior if @ is a finite of A. There behavior is an important of the converse compatible ~ be a strongly of the collection. Suppose LEMMA 2.1.1. Let {A,},. and let A be the composition external actions of A, such that Then ~ is a finite behavior of A. /3 IA, is a finite system, then ~ IA is a result. collection of automata, P is a finite sequence of behavior of Al for every i. Let A and to implement 1? be automata with the same external actions. Then A is said B if every finite behavior of A is a finite behavior of B. One way in which this that an automaton notion satisfy some specified B is also correct. 2.2 Serial Serial Systems systems can be used A is “correct,” property. is the following: in the Then sense if another that Suppose its automaton finite we can show behaviors B implements all A, and Correctness [Fekete et al. 1990] a transaction-processing system. object are used to characterize Serial automata systems consist communicating with the correctness of transaction a serial of au- tomata and automaton. serial scheduler Transaction Serial object automata represent code written by application programmers. automata serve as specifications for permissible behavior of data objects in the absence of concurrency. They describe the responses the objects should make to arbitrary sequences of operation invocations, assuming that later invocations wait for responses to previous invocations. The serial scheduler handles the communication among the transactions and serial objects, and thereby controls the order in which the transactions can take steps. It ensures that no two sibling transactions are active concurrently and decides whether each transaction commits or aborts. The serial scheduler a transaction to abort only if its parent has requested its creation, not actually transactions been created. are run serially, Thus, in a serial system, all sets of sibling and in such a way that no aborted transaction ever performs any steps. A serial system allows no concurrency only a very limited ability to cope with ACM TransactIons on Database Systems, can permit but it has Vol. 19, No, among sibling transactions, and has transaction failures. However, serial 4, December 1994 Quorum Consensus systems provide interesting useful systems, of replication specifications from the pattern those leaf, ancestor, descendant.) The the The accesses and set of return a transaction ualues name we say that the pair only The a system type, by a set Y– (A transaction is its own ancestor each tree element Additionally, are called of the the and accesses. partition system contains type (T, v) is an operation specifies a of X. transactions will actually take steps. We imagine structure is known in advance by all components in general, be infinite and have infinite branching. classical with TO such as can be thought of as a predefine naming scheme for all that might ever be invoked. In any particular execution, some of these that the tree The tree will, object. more the concerns for transactions (henceforth simply called ualues). If T is that is an access to the object name X and u is a value, The tree structure possible transactions however, of other, control. nesting, transaction-naming so that to a particular behavior 543 . into a tree by the mapping parent, tree, we use traditional terminology, of the are partitioned accesses Systems for separating of concurrency descendant. leaves correct models of transaction of transaction names, organized as the root. In referring to this child, for and as intermediate algorithms We represent in Nested TransactIon transactions of concurrency control theory of a system. (without nesting) appear in this model as the children of a “mythical” transaction TO, the root of the transaction tree. Transaction To models the environment in which the rest of the transaction system runs. It has actions that describe the invocation and return of the classical data, and so are distinguished below the root. The function is to create directly. A serial transaction serial are described system addition Input: of a given for automaton each for each object model transactions whose but not to access data type node name, and is the of the composition transaction a serial of a tree, scheduler. A nonaccess transaction T is modeled a These as a transaction AT, an 1/0 automaton with the following external to this, AT may have arbitrary internal actions.) actions. (In CREATE(T) REPORT_ COMMIT(7”, value u‘ for T‘ REPORT _ABORT(T’), Output: system nonaccess access the at any level below. 2.2.1 Transactions. automaton internal nodes of the tree and manage subtransactions, automaton object transactions. Only leaf transactions as “accesses.” Accesses may exist REQUEST_ REQUEST_ u ‘), for every child T‘ of T, and every return for every child T‘ of T CREATE(T ‘), for every child T‘ of T COMMIT(T, v), for every return value u for T. The CREATE input action (an output of the scheduler automaton we describe later) “wakes up” the transaction. The REQUEST_ CREATE output action is a request by T for the scheduler to create a particular child transaction. The REPORT_ COMMIT input action reports to T the successful completion of one of its children, and returns a value recording the results of that child’s execution. The REPORT _ABORT input action reports to T the ACM TransactIons on Database Systems, Vol. 19, No. 4, December 1994. 544 K. J. Goldman . unsuccessful information. and N, Lynch completion of one of its children, The REQUEST_ COMMIT action it has finished its work, and includes without returning is an announcement a return any other by T that value. We leave the executions of particular transaction automata largely unconstrained; the choice of which children to create and what value to return will depend on the particular implementation. For the purposes of the systems studied here, the transactions are “black boxes. ” Nevertheless, it is convenient to require that all transaction automata preservel certain syntactic as follows. A sequence ~ of conditions, called transaction well- forrnedness, external action the following of automaton conditions (1) The first CREATE event well-formed for T, provided in ~, if any, is a CREATE(T) event, and there are no other events. (2) There is at most T’ of T. report (3) Any CREATE(T’) (4) There A~’I is transaction hold: one REQUEST_ event in ~. is at most for child one report CREATE(T’ of T‘ event in T ) event is in preceded B for each child f? for each child by T‘ REQUEST_ of T. COMMIT event for T occurs in ~, then it is preceded by (5) If a REQUEST_ a report event for each child T‘ of T for which there is a REQUEST_ CREATE(T’) (6) If a REQUEST_ event in ~. in ~. COMMIT event for T occurs A transaction well-formed sequence is always starts with CREATE(T), ends with REQUEST_ tween has some interleaving of a collection REQUEST_ CREATE( T ‘) REPORT_ COMMIT( in ~, then it is the last a prefix of a sequence that COMMIT(T, u), and in beof two-element sequences T‘, u‘ ), for various children T‘ of T. 2.2.2 Serial Objects. Recall that transaction automata are associated with nonaccess transactions only, and that access transactions model abstract operations on shared data objects. We associate a single 1/0 automaton with each object name. The external actions for each object are just the CREATE and REQUEST_ COMMIT actions for all the corresponding access transactions. Although we give these actions the same kinds of names as the actions of nonaccesa transactions, it is helpful to think of the actions of access transactions in other terms also: a CREATE corresponds to an invocation of an operation on the object, while a REQUEST_ COMMIT corresponds to a response by the object to an invocation. Thus, we model the serial specification of an object X (describing lThe concept of preserumg a property means automaton does ACM that the TransactIons on Database not its activity of behaviors violate Systems, Vol the 19, in the absence is formally property No. unless 4, December defined Its of concurrency in Fekete environment 1994. et has al. and [ 1990] done and so first. failures) (In by a serial addition to this, Quorum Consensus in Nested Transaction object Sx with aqtomaton S’x may Input: CREATE(T), Output: REQUEST_ have arbitrary Systems the following internal . external 545 actions. actions.) for every access T to X COMMIT( l’, u), for every access to X and every return T value v for T. As with transactions, is convenient tic conditions. while serial-object well-formed where T* + Tj when preserve serial-object Read/Write has an COMMIT(T1 it that every of the form ) REQUEST_ serial object COM- automaton access to X), and and function T where kind(T) data (an element actiue of D). = do. The transition CREATE(T) REQUEST_ s.actiue : accesses(X) = “write” of two components: so ,data s.actiue kind + {“read”, “write”}, function data : T G czccesses(X) : kind(T) = “write”+ return values for each access T where kind(T) = “read” an access Effect: of a sequence , u ~) CREATE(T2 i + j. We require well-formedness. associated of S’x consists = “nil”, unconstrained, Objects. One especially important class of serial-object reaci/write serial objects. Each read/write serial object Sx domain of values, D, with a distinguished initial value do. and an associated The set of possible state are. left largely Serial automata are the has an associated D, while objects for X if it is a prefix CREATE(TI) REQUEST_ MIT(T2, u2)... SAY also specific to require tb at behaviors of serial objects satisfy certain syntacLet a be a sequence of external actions of S’x. We say that a is for = T has return (either The start relation state “OK”. The or the name of an so has so .actiue is as follows: COMMIT(T, kind(T) value “nil”, D. is u), = “read” Precondition: = T s’. actiue = T REQUEST_ COMMIT(T, = “write” for kind(T) Precondition: u), s’. data = u Effect: s.active = “nil” s’. actil~e = T ~ =~~o~~~ Effect: s.actiue s.data 2.2.3 Serial =“nil” = data(T) Scheduler. The third kind of component in a serial system is the serial scheduler. The transactions and serial objects have been specified to be any 1/0 automata whose actions and behavior satisfy simple restrictions. The serial scheduler, however, ia a fully ~pecified automaton, particular to each system type. It runs transactions according to a depth-first traversal of the transaction tree. The serial scheduler can choose nondeterministically ACM TransactIons on Database Systems, Vol. 19, No. 4, December 1994. 546 . K. J Goldman to abort any transaction the transaction Lynch whose parent has not actually is requested must overlapping and N be either its execution, has requested been created. aborted before or run REQUEST_ REQUEST_ Output: The CREATE(T) COMMIT(T), ABORT(T), and COMMIT(T, ABORT(T), mark no siblings of a transition can or abort has occurred. inputs are intended T # TO u) u ), T # TO T + TO CREATE and REQUEST_ COMMIT with the corresponding outputs of transaction and correspondingly for the CREATE, REPORT_ REPORT_ as creation T # To REPORT_ REQUEST_ with The result after the commit are as follows: as long of T whose T # To REPORT_ to be identified object automata, actions CREATE(T), COMMIT(T, its creation, child to commitment T can commit. be reported to its parent at any time The actions of the serial scheduler Input: Each ABORT output the point in time actions. where The COMMIT the decision and and serialCOMMIT, ABORT on the fate output of the transac- tion is irrevocable. Thus, for each transaction T‘ involved in an execution, we have a REQUEST_ CREATE from the parent, followed by either (1) an ABORT and REPORT_ scheduler, REPORT_ Each ABORT a REQUEST_ COMMIT state from COMMIT from the scheduler from or (2) a CREATE the transaction, from and a COMMIT the and the scheduler. s of the serial scheduler consists of six sets, denoted via record notation: s.create _requested, s.created, s.commit _requested, s.committed, s.aborted, and s.reported. The set s.commit _requested is a set of operations. The others are sets of transactions. There is exactly one start state, in which the set create _requested is {TO}, and the other sets are empty. The notation s.completed denotes s.committed u s.aborted. Thus, s.completed is not an actual variable in the state, but rather a deriued variable determined as a function of the actual state variables. REQUEST_ CREATE(T) Effect: s.create _ requested = whose value ABORT(T) Precondition: s’.create_requested T E s ‘.create_requested u {T} — s’.completed T @ s’.created siblings(T) n s’.created s’.completed Effect: REQUEST_ COMMIT(T, Effect: s.commit_requested s. aborted u) = s’.aborted = s’.commit_requested U {(T, u)} u {T} REPORT _COMMIT( T, Precondition: T E s ‘committed (T, U) E s’. commit ACM TransactIons on Database Systems, Vol. 19, No 4, December 1994 _requested U) c is Quorum Consensus in Nested Transaction CREATE(T) Precondition: . Systems 547 T @ s‘ reported Effect: T G s ‘.create_requested — s ‘created s.reported T E s ‘aborted siblings(T) Effect: = s‘. reported n s‘ created c s ‘completed REPORT_ V (T} ABORT(T) Precondition: s.created = s ‘created {T} u T = s‘ aborted T @ s‘. reported COMMIT(T) Precondition: Effect (T, u) E s ‘commit _requested s. reported = s’. reported for some v [T} U T @ s ‘completed Effect: s.comrnitted 2.2.4 sition Serial = s ‘committed Systems of a strongly and u {T) Serial compatible Behaviors. A serial set of automata system consisting is the compo- of a transaction automaton AT for each nonaccess transaction name T, a serial-object tomaton S’x for each object name X, and the serial-scheduler automaton the given system type. The discussion in the remainder of this fixed, system type and serial system, with automata, and behauiors S’x for the external actions actions are called as the system’s of the define transaction(n) automata. system. and paper assumes an arbitrary, but AT as the nonaccess transaction use The COMMIT(T) m is one of the serial term serial actions and u), where actions to the ABORT(T) CREATE(T), _COMMIT(T’, u’), u ), where T‘ is a child to be T. If w is a serial to be X. If ~ is a sequence the serial for T. REPORT COMMIT(T, _COMMIT(T, We We give the name actions name _CREATE(T’), ‘), or REQUEST_ or REQUEST object(w) serial completion If T is a transaction REQUEST ABORT(T serial-object behaviors. aufor action of the form T is an access of actions, REPORT_ of T, then we to X, CREATE(T) then T a transaction we define name, and X an object name, we define ~ IT to be the subsequence of ~ consisting of those serial actions w such that transaction(w) = T, and we define B 1X to be the subsequence of ~ consisting of those serial actions n such that object(m) = X. We define serial(~) If @ is a sequence to be the of actions subsequence of ~ consisting of serial actions. and T is a transaction name, we say T is an orphan in ~ if there is an ABORT(U) action in @ for some ancestor U of T. We say the T is liue in ~ if F contains a CREATE(T) event but does not contain a completion event for T. LEMMA 2.2.4.1. Let ~ be a serial behavior. (1) If T is live in ~ and T’ is an ancestor (2) If T and T‘ are transaction then either T is an ancestor 2.2.5 Serial ness condition Correctness. that ACM T’ is live in fl. names such that both T and T‘ are live in /3, of T‘ or T‘ is an ancestor of T. The we expect of T, then serial other, TransactIons system more is used efficient on Database to specify systems Systems, Vol the to satisfy. 19, No. correctWe say 4, December 1994. 548 K. J. Goldman . that a sequence provided interested ~ and N. Lynch of’ actions data objects, interacts with believe corresponds ought serially correct for transaction name serial to behave. all finite behaviors of these root transaction representing correctness precisely to Serial the to be a natural intuition correctness of how for T that ~ IT = y IT. We are representing replicated where each replica uses a concurrency control method a controller that passes information between transactions objects. We will show that correct for TO, the mythical We is that there is some serial behavior y such in studying particular systems of automata and and systems are serially the environment. notion of’ correctness nested transaction T is a condition that that systems guarantees to implementors of T that their code will encounter only situations that can arise in serial executions. Correctness for To is a special case that guarantees that the external world will encounter only situations that can arise in serial executions. 3. CONFIGURATIONS If S is an arbitrary (c. read, subsets with set, we define a configuration c.write), where each of c.read of S, and where every element every element of c.write. (We will c of S to be any and c.write is a nonempty of c. read has a nonempty sometimes refer to c.read as the sets of read quorums and write quorums, respectively, c.) We write corzfzgs(s) for the set of all configurations of S. For example, if the set S consists of three elements configuration c has c.read = {{x}, {y}, {z}} and c.write pair, collection of intersection and c.write of configuration x, y, and z, then one = {{x, y, z}}. This con- figuration corresponds to the read-one/write-all replication strategy. Another configuration has c.read = {{x, y}, {.x, z}, {y, z}} and c.write == {{x, y}, {x, z}, {y, z}}, corresponding 4, to a read-majority/write-majority FIXED-CONFIGURATION QUORUM strategy. CONSENSUS In this section, we present and prove the correctness of a fixed-configuration quorum consensus algorithm. For simplicity, we carry out the formal development in this paper for the replication of a single object, which we call X. The same arguments First, Section apply to any finite number of objects. 4.1 defines system M, a serial system that has X as one of its object names. In system .@, the object automaton associated with X is a read/write serial-object automaton Sx, representing the logical object to be replicated. System .w’ is used in order to define system correctness. Then Section 4.2 defines the replicated serial system @. System ,% is identical to d, except that logical object Sx is implemented as a collection, SY, Y E ~, of objects, and each logical read (or serial read/write objects that we call replica write) of X is replicated as a subtransaction that performs multiple read and write accesses to some subset of the replicas according to the quorum consensus algorithm. Section 4.3 shows that the algorithm is correct by demonstrating a relationship between W and .@; this proof is easy because both systems are serial. ACM TransactIons on Database Systems, Vol 19, No. 4, December 1994 Quorum Consensus 4.1 System We begin serial .W The Unreplicated by defining system Serial unreplicated having in Nested Transaction Sx associated and initial value with . 549 System serial a distinguished automaton Systems system object W. System name X is a read/write X, serial .@ is an arbitrary in which object the with object domain D do. We define la, and la,O to be sets containing the respective names of the la read and write accesses to X, in W, and define la = la, U la,,. (Here, stands for “logical access.”) Since each transaction name T = la, is a read access to Sx, the definition of a read/write serial object says that kind(T) = read for each such T. Also, each transaction name T = la LOhas kind(T) = write, and also has an associated value, data(T) G D. 4.2 System Q?: The Fixed-Configuration Here we describe system rithm with configuration. a fixed S’x is replicated; that .%, which data the Algorithm quorum A??is a serial as several system serial just one. The accesses to X called logical access automata the quorum consensus algorithm quorums, as long as each read logical Consensus represents System is, it is implemented SY, Y = ~, rather than subtransaction automata every write quorum. the replicas in some write-LA writes to property guarantees Quorum consensus algo- in which object objects (replicas), are implemented (LA’s). The LA’s to manage the replicas. quorum has a nonempty as use We allow arbitrary intersection with In order to read the logical object, a read-LA accesses all read quorum, while in order to write the logical object, a all the replicas in some write quorum. The intersection that each read-LA receives at least one copy of the latest value. Since reads of different replicas could return mechanism is needed to distinguish the most recent values. Thus, each nonnegative integer replica in the version-number different data values, a value from other possible quorum consensus algorithm maintains a in addition to its data value. Successive write-LA’s use successively larger version numbers, and a read-LA selects the data value associated with the largest version number it receives. This strategy requires each write-LA to learn what version number it is supposed to write, previously preliminary which used. read in In turn order requires to learn of a read-quorum it to learn the largest version number this number, each write-LA first does a of the replicas. Since both the read-LA and the write-LA programs must perform reads of a read quorum of the replicas, it is helpful to have a subtransaction that performs such a read. This subroutine can be invoked by a read-LA to perform nearly all of its work, and can also be invoked by a write-LA to perform its preliminary read. We can describe such a subroutine in the nested transaction framework by introducing two extra levels of nesting. Thus, read-LA and write-LA transactions will have children known as readcoordinator transactions, which will be responsible for reading from a read quorum of replicas. Write-LA transactions will also have children known as write-coordinator transactions, ACM which Transactions will be responsible on Database Systems, Vol for 19, No. writing 4, December to 1994 a 550 K. J. Goldman and N. Lynch . write quorum turn, have replicas. of replicas. children Read-coordinators that are and individual read write-coordinators and We begin by defining the type of 9, and then give appear in W but not in ~. The new automata are the coordinator and LA transaction automata. We define up,” starting with the replicas and the configuration 4.2.1 System Type. The type of system write will, accesses to in the the new automata that replica objects, and the these automata “bottom automata. .% is defined to be the same as that of M, with the following modifications. First, the object name X is now replaced by a new set of object names ~, representing the replicas. (The replicas will transaction store both names, co, a value and version number.) There are some new read- and write-coordinators, COW representing acca,, which are the read and write accesses to the and respectively, and ace, and replicas, respectively. We let co = co, U co,,, and ace = ace, U accu,. The transaction names in la are accesses in w-, but in # they are not; each that are in co,, while each transaction name in lar now has children transaction name in la ~, now has children in C07 and children in col~,. Also, each transaction name in co~ has children in acc~, and each transaction name in co,” has children in accU,. These are the only changes to the system type—for parents example, of the no new transaction names transaction names. other The than naming those indicated structure for are logical accesses and their descendants is shown in Figure 1. In system .&, some information is associated with some of the transaction names; for example, each T in la (that is, each access to X) has hind(T) indicating whether the access reads or writes, and write accesses have ciata(l”) indicating information from action name the value to be written. In Q?, we keep all such associated .@ and add more information as follows. First, every trans- T G COU has an associated version number uersion-number(T) for X and the associated version value data(T) G N. These denote, number to be written G D and an associated respectively, the value by the write-coordina- tor in the quorum consensus algorithm. Second, we associate tion with the accesses to the replicas. Namely, we assume that some informaevery transac- tion name T E ace, has kind(T) = read, and that every transaction name ● N X D, the version number T E ace,,, has kind(T) = write and data(T) and value to be written.z Throughout the rest of this section, we let c denote a fixed configuration of y. 4.2.2 Replica Automata. A replica automaton is defined for each object name Y ~ ~. Each replica is a read/write serial object that keeps a version number and a value for X. More formally, the object automaton for each Y = ~ in system Q? is a read/write serial object for Y with domain N x D 2The transaction names in czccWname accessesto read/write objects. Hence, to agree with the definitions in Section 2.2.2, the version number and data values are combined into a single data attribute. ACM Transactions on Database Systems, Vol. 19, No. 4, December 1994 Quorum Consensus in Nested Transachon Systems . 551 la, Cor Fig. 1. Logical accessesand their descendants. accr i CREATE(T) Effect: s.awake = true REPORT_ABORT(T’) Effect: s.mparted = e’.mported U {T’) v.wake = true s.reparted = REQUEST.CREATE(T’) Precondition: /awake = true ~ f? s’. rcquesfed Effect: s.requested = /.wque9tcd s.read = s’.read {2”) REQUEST_COMMIT(T,v) Precondition: s’. ou,ake = true s’. requested= a’. reported s’.read) (37? g c.read)(??s v = (9’. uersion-number, a’. value) Effect: a. awake = Jal$e U {T’) REPORT.COMMIT(T’,V) Effect: s.reporfed = s’.reported s’. repartedu U {~} U {object(T’)) if u.uervion-number > s’. uersion-number then 9.uer9i0n-number = u.uersion-number s.wdue = u,ualue s.reported = 9’. reported U {T’ ) s.read = s’. read U {object(T’)) if u.uersion-number > s’. uer9ion-number then s version-number = u.uersion-number s.ua I ue = v.ua 1ue Fig. 2. Transition relation for Read-Coordinators and initial value (O, do). For u = N x D, we use the record u.uersion-nurnber and v.ualue to refer to the components of U. 4.2.3 Coordinators. In this section, we define .27, the transaction automata that are replicas. We first define read-coordinators. is to determine the latest version number the coordinator notation automata in invoked by the LA’s and access the The purpose of a read-coordinator and value of X, on the basis of the data returned by the read accesses it invokes. A read-coordinator for transaction name T G co, has state components awake, value, version-number, requested, reported, and read, where awake is a Boolean variable, initially false; ualue G D, initially do; version-number EN, initially 0; requested and reported are subsets of children(T), initially empty; and read is a subset of ~, initially empty. The set of return values VT is equal to IV x D. The transition relation for read-coordinators is shown in Figare 2. Recall that c is a fixed configuration. This is used to determine when the coordinaACM TransactIons on Database Systems, Vol. 19, No. 4, December 1994. 552 tor K. J. Goldman and N. Lynch . has recent collected value enough information to be sure of X. A read-coordinator collects of having data from seen replicas the most for object names in ~, and keeps track of the value from the replica with the highest version number seen so far. Whenever the read-coordinator reaches a state in which (1) some read quorum, according to c, is a subset of the replicas it has seen (i.e., those in s‘. read) and (2) it has learned of the fates (i.e., commit or abort) of all of its requested request to commit We have written In any number practice, replica at a time, earlier access transactions, one and to that then the read-coordinator may its data. the read-coordinator ism. It may request active. child and return would then probably request replica. with a high degree of accesses to replicas, start another There is also at of nondetermin- at any time most one only one after a natural while access the failure trade-off in to it is each of an practice between request request two methods of choosing which replicas to access at first: one could initially only enough access to complete one read quorum, and then others if these fail to commit in reasonable time, or else one could request accesses to all replicas concurrently. The first method communication cost in the best case (when all accesses complete and the second reduces latency in the bad case where some Each of these mentation possibilities, and of the automaton many given others, here, could obtained minimizes successfully), accesses fail. be modeled by restricting as an implenondetermin- ism. We next write define a given write-coordinators. value to a write The purpose quorum write-coordinator for transaction requested, reported, and written, false; requested and reported are ~~ritten is a subset of ~, initially of a write-coordinator of replicas for object names is to in $’. A name T G CO,Ohas state components awake, where awake is a Boolean variable, initially subsets of children(T), initially empty; and empty. The set of return values VT is equal {“OK”}. to The transition relation for a write-coordinator Figure 3. When created, a write-coordinator replicas for object names in j’, overwriting at the After mit. replicas writing 4.2.4 with its to a write own named ( version-number(T) quorum, the T G co,,, is shown begins invoking write the version numbers and write-coordinator data(T), may in accesses to and values respectively). request to com- Access Automata. Now we define logical access automata, read-LAs. The purpose of a read-LA for a transaction name ia to perform a logical read access to X. A reaci-LA has state T E la, where awake components awake, L~alue, requested, reported, and value-read, and ualue-read are Boolean variables, initially false; ualue E D I., {nil), initially nil; and requested and reported are subsets of children(T ], initially beginning Logxal with empty. The set of return values VT is equal to D. The transition relation for read-LAs is shown in Figure 4. The read-LA invokes any number of read-coordinators. After receiving a REPORT_ COMMIT from at least one of these coordinators, and receiving reports of the completion of all that were invoked, the read-LA may request to commit, ACM TransactIons on Database Systems, Vol 19. No 4, December 1994 Quorum Consensus in Nested Transaction CREATE(T) Effect: a.awake = true 9.awake= .9.reported T’) 553 U {~} T,,J) 9’. awak-e = true s’. reque.rted = St. reported = (wersion-number( Z’), data(l”)) (3W Effect: v= s. requested= = s’. mporkd REQUEST.COMMIT( Precondition: T’ $ s’.reguested data(Z”) . REPORTABORT(~) Effect: s.rwported = s’. reported U {T’) true REQUEST.CR.EATE( Precondition: 8’. arvake = true Systems s’. requested u {T’} e c.write)(W ~ s’.writtar) “OK” Effect: s.riwcike = fa/9e REPORT. COMMIT(T’,W) Effect: .9.reparted = s’, m-ported U {T’) smitten = s’, written U {object(T’)} s.reported = .s’. reportedu s.written == s’. writfen Fig. 3. {T’} U {object(T”)) Transition relation for Write-Coordinators. CREATE(T) Effect: s.awake = b-we s.awwke = true REPQRTABORT(~) Effect: s.reportecf = s’.reported U {~} REQUEST.CREATE(~) Precondition: s~.awake = tr-ue F ~ s’. ~guested Effed : REQUEST’-COMMIT( Precondition: s’. awake = true s.requested s.mperted ~’.value-reod v = = s’. requested U {T’} s.ualue-read u {T’) T,V) s’. r-eqaested = s’. reported = true s’. volue Effect: s.arvake= false REPORT.COMMIT(V,V) Effect: s.reparted = s’. mporfed U {~} if s’. value-rwd = fake then s. value = v.value s.rqwrted = s’. reported = true = s’.reported U {T’} if s’. value-read = false then s.rralue = v. value s.value-rmd = true Fug. 4. Transition relation returning the value component of the data nator. We next define write-LA’s. The purpose a logical write name in la ~ is to perform for Read LA’s. returned by the first read-coordi- of a write-LA for a transaction access to X on behalf of a user transaction. A write-LA has state components awake, value-read, value-written, version-num her, requested, and reported, where awake, value-read, and inivalue-written are Booleans, initially false; version-number e N u {nil}, tially nil; and requested and reported are subsets of children(T), initially empty. The set of r-eturn values ACM VT is equal Transactions to {“OK”}. on Database Systems,Vol 19, NO 4, December 1994. 554 K. J. Goldman . and N. Lynch R.EPORT_COMMIT(~,u), CREATE(T) Effect: 9.awake = true Effect: a’. vo!ue-written = true s.reported = s’. reported a awake = true REQUEST-CREATE( Precondition: T’), T’ c CO. .9’. value-written REPORT.ABORT(T’) Effect: s.reported g’. awake = true T’ @ a’. requested Effect: s.requested = s’. reguestedu {T’} s.reported s.wlue.rwd s.reported 9.uer9i0n-number REQUEST-CREATE( Precondition: u {~) U {T’) T,U) a’. awake = true = s’. reported s’. ualue- written = true u = ‘OK” = s’. reporfedu s.ualue-read U {T’} = s’. reported = s’. reported a’. requested = true Effect: s.awake = ja/9e {T’) = false U {T’} = true REQUEST.COMMIT( Precondition: REPORT_ COMMIT (Z’’, v), T’ G COr Effect: s.reported = s’. reported u {T’] if s’, ualue-read = jalse then s.uer.sion-number = v.ucrs:on-number if s’. ualue-read P G COW = s’. rtported s.mported then = u.ucr9ion-number = true T’), T’ c COW s’, au, ake = true = true s’. wdue-rmd data(T’) uersion. = d.t.(T) nurnbcr(T’) = ~’. uersion-nutnber + 1 T’ ~ s’. requested Effect: s.reguested = s’. mquestedu {T’} Fig. 5. TransitIon relatlon The transition relation for write-LA’s invokes any number of read-coordinators. MIT from for Write-LA’s, is shown in Figure 5. A write-LA After receiving a REPORT_ COM- one of the read-coordinators, the write-LA remembers the version number returned. The write-LA then invokes any number of write-coordinators, using a version number one greater than that returned with the first REPORT_ COMMIT of a read-coordinator, along with its particular data value. In order for the write-LA to request to commit, it must receive at least one REPORT_ 4.3 COMMIT Correctness In this of a write-coordinator. Proof section, we prove the correctness of the quorum consensus algorithm given above as system 4?. That is, we show that @ is indistinguishable from .w?;to the transactions and objects the two systems have in common. We begin by stating some definitions that are useful for reasoning of logical accesses in an execution of W. In each definition, actions First, of @. the logical defined to REQUEST_ ACM Transactions access sequence of ~, denoted about the sequence @ is a sequence of logical-sequence( be a subsequence of ~ containing the CREATE(T) COMMIT(T, u) events, where T G la. That is, the logical on Database Systems, Vol 19, No 4, December 1994 ~), is and access Quorum Consensus sequence to is the sequence in Nested Transaction of requests and responses Systems for the 555 . logical accesses x. Next, if D is finite, REQUEST_ transaction name REQUEST_ the logical the logical Finally, carrying is in laW that COMMIT the is defined last occurs event among next correctness properties write logical-value(B) u) occurs to be either REQUEST_ COMMIT data(T) event if for a in logical-sequence( /3), or do if no such in ~). logical-sequence( In other words, value is the value of the last logical write (or the initial value of object if no such write occurs). if B is finite, then current-un( /3) is defined to be the highest version-number led to by /3. The then COMMIT(T, lemma is the states the key of all replica to the automata proof in the global of Theorem 4.3.7, state the main theorem. Condition (1) of the lemma, which enumerates some on the state led to by particular schedules of Y, is only needed for out the inductive quorum any replica in which with data. The more each read-LA argument. every the current important returns These replica version part the properties has the current number (a) there number also has the logical of the lemma value say that version is condition associated with the value (2), which previous is a and (b) as its says that logical write. (That is, each read-LA returns the logical-value of the logical object.) The schedules of J% which we consider are those in which no logical access to X is active.3 The fact that .%’ is a serial system makes the reasoning simple (though detailed). LEMMA 4.3.1. X is active Let (1) The following (a) /3 be a finite There with properties exists (2) hold a write obJ”ect name v.version-number (b) of 9?’ such that quorum ~ state WrE c.write if v is = current-vn( no logical access to led to by B: such the data that for any component replica of SY, COMMIT(T, is a serial SY then B ). if v is the data component of SY and D ) then v.value = logical-value( /3). If B ends in REQUEST_ value(~). Since in any global Y E Y/, For any replica SY, number = current-vn( PROOF. schedule in P. system, v) with Lemma T E la,, 2.2.4.1 then shows v.version- v = logical- that logical- sequence( B) consists of a sequence of pairs, each of the form CREATE(T) REQUEST_ COMMIT(T, v), where T G la. We proceed by induction on the number of such pairs. For the basis, suppose logical -sequence( ~ ) contains no such pairs. Then logical-value(B) = dO. Since @ contains no CREATE(T) events for T ~ la, it contains no REQUEST_ COMMIT events for accesses in acc; therefore, = dO, this since is also all replicas the case in initially any have global version-number state led to by = O and value ~. It follows that 3Recall that a transaction name is active in a sequence if the sequence contains a CREATE action but no REQUEST_ COMMIT action for that transaction. ACM Transactions on Database Systems, Vol. 19, No. 4, December 1994. 556 K. J. Goldman and N. Lynch . cw-rent-vn( ously. For /? ) = O. This the pairs, inductive and assume implies step, that condition suppose (1), and that the lemma condition (2) holds vacu- contains k >1 k – 1 pairs. That logical-sequence(~) holds for sequences with is, let /3 = /3’ y, where logical-seguence( y ) begins with the last CREATE event in logiccd-sequence( ~), and assume that the lemma holds for ~‘. (Note that Lemma 2.2.4.1 sequence(y) shows that LIf only in terms Any This action follows _ COMMIT(T,,, The following subtree access _COMMIT PROOF. type. of the transaction 4.3.2. REQUEST access to X is live )REQUEST of the appropriate and Claim no logical = CREATE(7’f or for some Tf G kz us to argue ❑ which in y is a descendant @ is a serial ~ ‘.) So, logical- for enables at T~. coordinator appears since observation rooted in ZJf ) a CREATE or of Tt. ❑ system. Now, since Tt has a REQUEST_ COMMIT in y, we know by the definition of Tt that there is at least one REl?ORT _ COMMIT event for a transaction in co, with the first name in co, in y. Let T‘ be the read-coordinator REPORT_ portion is in while COMMIT in y; by Claim of y up to and including la,, then the system if T~ is in la U,, the before receiving T’ type shows that code shows that a REPORT_ of any write-coordinator no REPORT_ COMMIT 4.3.2, G children(T~ the REPORT_ COMMIT it invokes Tt invokes ). Let event about Thus appear in y‘. Therefore, descendants of Tt that the states y‘ for be the T‘. If T~ no write-coordinator, no write-coordinator for read-coordinator. or its descendants events in y‘ for accesses. We now give two claims COMMIT of the automata no actions there are are write for T~ and T’. Claim 4.3.3. Let s be the state of the transaction automaton associated with T‘ in any global state reachable from ~‘ in /3 ‘y’. Then s.version -n umber and s.value contain the highest version-number and associated ualue among the states led to by ~‘ of the replicas whose names are in s. value-read. This PROOF, reported by claim a read REPORT_ COMMIT y‘, the version-number must be the same holds because access, T‘ retains the maximum with associated together its version-number value upon each of a read access. Again, since no write accesses occur in and value components of all replicas observed by T‘ during y‘ as in the state led to by ~‘. ❑ Claim 4.3.4. Let s be the state of the automaton associated with Tt in any global state reachable from ~ ‘y’ in ~. Then s.uersion-number = currentvn( /?’) and s.ualue = logical-value( ~ ‘). PROOF. MIT event Let s‘ be the state of T‘ just before it issues in y‘. By the definition of a read-coordinator, its REQUEST_ s‘. read must COMcontain a read quorum ~ = c.read. By l(a) of the inductive hypothesis, there is some for object names write quorum ZI” ● c.write such that the states of all replicas in W in any global state led to by ~‘ have version-number = current-vn( /3’). Since c is a configuration, .9? and W“ have a nonempty intersection. So s‘. read ACM TransactIons on Database Systems, Vol. 19. No, 4, December 1994 Quorum Consensus must contain at least number = current-un( s‘ value = logical-value( When T‘ nents Of T’ reports are one object ~ ‘). m Nested TransactIon name Therefore, Systems in %. So, by Claim by l(b) of the 557 . 4.3.3, inductive s‘ .uersiorzhypothesis, /?’). its commit to T’f, the version-number and value compoequal to s‘ version-number and s‘. value, respectively. BY set definition of T~, these ❑ 4.3.4 is proved. components We now return to the possible cases for Tf. main are never proof again of Lemma modified. 4.3.1, Therefore, and consider Claim the two Then logical-value(@) = logical-value( ~‘ ) by definition. Also, (1) Tf G la,. since Tf invokes only read-coordinators, which in turn invoke only read accesses, the uersion-number and value components of the states of the replicas in any global state led to by ~ are the same as in any global state led to by holds ~‘, and so current-un( ~ ) = current-vn( fl’ ). Therefore, By definition, coordinators its commit, Tf cannot request commits. Since the REQUEST_ to commit until Then (2) Tf ● la,.. about the information at least T‘ is the first read-coordinator COMMIT for Tf must occur D ‘Y’. When Tf requests to commit, it returns state. By Claim 4.3.4, this value is logical-value( condition (2) holds for ~. ing condition (1) for ~. logical-value( associated one of its read- child that reports at some point after the value component of its ~‘ ) = logical-value( B). Thus, /3 ) = data(Tt ). We first give two claims with the descendants of Tt invoked dur- y. Claim 4.3.5. nuntber(T) Let PROOF. event that + 1 and ordinator PROOF. T is s be the occurs data(T) until REQUEST_ Claim 4.3.4, Claim data(T) If = current-un( a write-coordinator + 1 and ~‘) state of T+ just in y. By definition, = data(T~ at least ). Also invoked data(T) before the by definition, The type @ version- CREATE(T invoke reports a write-co- its commit. So, all for write-coordinators in y occur after ~ ‘y’. = current-vn( ~ ‘). Thus, Claim 4.3.5 holds. is constrained follows from so that Claim ) in $Y invoked T is invoked By ❑ in y, then by 4.3.5 and the definition some of a ❑ By definition, REPORT _CONIMIT write-coordinator be the portion cannot request write accesses of The result then = s.version-number T~ cannot 4.3.6. If T is a write access for an object = (current-vn( ~‘) + 1,data(Tf )). write-coordinator. write-coordinator. Tf, REQUEST_ version-number(T) one of its read-coordinators CREATE events s.uersion-number by = data(T~). Tf cannot request to commit until from at least one of its write-coordinators. it Let receives a TU, be the child of Tf that has the first COMMIT event in y, and let 5 of y up to and including COMMIT(T,O ). By definition, TU, to commit until it has received REPORT_ COMMIT events for to a write quorum of replicas. Therefore, by Claim 4.3.6, it ACM TransactIons on Database Systems, Vol. 19, No. 4, December 1994 K, J. Goldman 558 . follows that any global and N. Lynch current-vn( state f?’8 ) = current-un( We now show that condition (1) still By Claim 4.3.6, any write-coordinators propagate the new value and version may execute ,6’) + 1, so condition (1) holds in led to by ~‘ 8. in y after 8 cannot holds in any global state led to by /3. that may execute in y after 8 merely number. Any read-coordinators that change the values at the replicas, since they do not invoke write accesses. Therefore, condition (1) holds after ~. Since T~ does not name a read-LA, condition (2) holds vacuously. Thus in both cases, the lemma holds. (This ends our proof of Lemma 4.3.1.) ❑ Now we distinguish prove that @ is between replicated tem .@. Here, respectively. THEOREM schedule correct serial let 7A and %~ denote 4.3.7. y of& Let ~ such that be a finite ylT = PIT for each transaction (2) YIY = P IY for each object We construct the transaction schedule the following (1) PROOF, by showing that transactions cannot system E? and unreplicated serial sys- tion names y by removing T in co U ace. Clearly, It is easy to see that, H. object names Then there of @, exists a hold: T e Y< – la, and Y G .7A. from CREATE(T), REQUEST_ COMMIT(7’, REPORT_ COMMIT(T, u), and REPORT theorem hold. It remains to show it suffices to show that y projects of two conditions name name and ~ all REQUEST _CREATE(T), u), COMMIT(T), _ABORT( T ) events the two conditions ABORT(T), for all transac- in the statement that y is a schedule of d. By Lemma to yield a schedule of each component for each nonaccess schedule of the transaction automaton Y E %~ – {X}, y IY is a schedule of the transaction name for T, and for object automaton of the 2.1.1, of w“. T of M, y IT is a each for object name Y. Since we construct y by removing all the actions associated with a specific set of transactions, and since none of these transactions have children with actions in -y, the fact that ~ is a schedule of the serial scheduler implies that y is as well. It remains to consider the automaton for object name X: we must show that y IX is a schedule of S’x, where Sx is the serial object associated with X in x?. We proceed by induction on the length of P. The basis, where /3 is of length O, is trivial. Suppose that ~ = ~ ‘m, where m is a single event and the result is true for /3’. The only case that is not immediate is the case where m is a REQUEST _COMMIT event for a transaction name in la. So suppose that m = REQUEST_ COMMIT(T, u), where T G la. Let y = y ‘w; by the inductive hypothesis, y‘ IX is a schedule of Sx. Let s‘ be the state of S’x led to by y’lx. One precondition for REQUEST_ COMMIT(T, U) as an output of Sx is that T E s ‘created. By transaction well-formedness of ~, CREATE(T) occurs in /3’. By the construction, CREATE(T) occurs in y‘, so that this precondition is satisfied. We consider the two possible cases for T in order to show that the ACM TransactIons on Database Systems, Vol. 19, No. 4, December 1994. Quorum Consensus remaining clauses of the in Nested Transaction precondition for Systems REQUEST_ 559 . COMMIT(T, u) are satisfied. (1) T ● laW. which Then satisfies the definition the remaining (2) T G la,. Then by of a write-LA implies that u = “OK”, precondition. the construction, y ‘IX= there is a REQUEST_ COMMIT for a write last such write access. Then s‘ data is equal logical-sequence( /3 ‘). If access to Sx in y‘, let T‘ be the to datcz(T ‘). By the construction, we know that T‘ is also the name of the logical write access in U? having the last REQUEST_ COMMIT in ~‘. Therefore, data(T’) = Jogical-value( ~ ‘). So the data component of the state of Sx led to by -y’ is logical-value( @‘). Sx On the other in y‘, then hand, if there is no REQUEST_ COMMIT for a write s ‘data is dO, which is logical-vahze( /3 ‘). Therefore, component of the state of Sx led to by y‘ is again logical-value( Thus, in either case, logical-value( ~‘) is the data component Sx in any global state value v = logical-value( Sx led to REQUEST_ Therefore, by y’, which implies that the COMMIT(T, v) in Sx is satisfied. ❑ y is a schedule of ~. QUORUM Now we present implementations algorithm are our place as the that a transaction object of in that the holds remaining precondition for CONSENSUS with ultimate must interact correctly of a logical access. Thus configuration /3 ‘). of the state led to by y‘. By Lemma 4.3.1, we know that the return ~ ‘). Thus, v is the data component of the state, s‘, of 5. RECONFIGURABLE configuration processing access to the data reconfiguration. For the concurrent goal, the activity of changing the with each reading the reconfiguration transaction nesting the current configuration during needs to take its structure, configuration. This accessing might a be done with reconfiguration transactions as children of TO; however, the correctness arguments work equally well no matter which existing transaction is taken as parent of each reconfiguration transaction. Thus we allow each reconfiguration to occur at an arbitrary location in the transaction tree. The presentation and proof are carried out in stages. First, in Section 5.1, we define system ~, a serial and that also has another automaton Sx associated The object simply automaton receives system Sz associated requests that has X as one of its object from with Z is a special transactions to change dummy the tion and responds to them with a trivial acknowledgment. M is used to define correctness. Then in Section 5.2, we define replicated serial system identical Section automaton names4 special object name Z. As in Section 4, the object with X is a read/write serial object automaton. object, current As before, system System $3’ is to .@, except that logical object Sx is replicated as in system 4, and also dummy object Sz is replaced by a configuration L?? of object Sx, which is a read/write object containing ~. which con&ura- the current con figura- ~In this section, we redefine certain notation from Section 4, such as the system names d and .@. ACM Transactions on Database Systems, Vol. 19, No 4, December 1994 560 K. J. Goldman . tion of replicas object and N, Lynch of X. In ~, Say. In the course the configuration reconfiguration of performing object S-y is read requests involve each logical to determine read a single or write the current write of access to X, configuration. In Section 5.3, we show that the new algorithm is correct by demonstrating relationship between @ and .&. The proof is analogous to that in Section the changes involve the handling of the new configuration object. Next, in Section except that the 5.4, we define configuration serial object system ‘3’, which is replicated. That is the is, the same a 4.3; as &?’ configuration object SF from @ is replaced in % by a collection of serial read/write objects that we call configuration replica objects, and each access to a configuration object is implemented by a subtransaction of the configuration in replicas. an interesting Section turn way on the 5.5, we prove implies 5.1 System We begin The data connection defining the performs object with Sz domain II associated below. As before, we define Z System with serial la, value Dummy 3, in which in system Reconfiguration .&. System dummy ~“ is an object names, X and Z, in X is a read/write serial dO, and in which is a special and %“ and Finally, & and x7. unreplicated and initial with is based scheme. between arbitrary serial system having two distinguished which the object automaton Sx associated with ton management management between Serial accesses to a subset replica relationship .c#: The Unreplicated by replica a simulation a similar that configuration the object object la,,, to be the respective automa- automaton, names defined of the read and la,,, to be the set of names of write accesses to X. Now we also define As before, accesses to the dummy object Z. We define la = la, U la,,, U la,,,.5 each transaction name T E la, has kind(T) = read, and each transaction name T = laW has kind(T) = write and also has an associated value, data(T’) e D. Recall that system .@ is used to define defined at the transaction same transaction interfaces correctness, and that correctness is boundary. Thus, it is desirable that we use the in ,cf as we do in the later systems, Q? and %. Since those systems will permit arbitrary transaction automata to invoke reconfiguration operations, we also allow them to do so in w’. However, since X is not replicated in .v”, these operations will not do anything interesting in .&”. We simply make them operations on the dummy object, which merely responds “OK” in all cases.G 5Notc that umon of GIt this la, may seem transaction ration is different artificial to automaton approach, which where augmentation than settle ACM application instead Transactmns and .&” would be defined, the the place interface to users invisible .w’. In this from defimtion in Section 4, where la was just defined to be the in the lau. and would the in invocations .W; an apphcatlon programmers be defined represent just made We leave for the simpler alternative on Database Systems, reconfiguration would by not the of m with details 19, No. of this having added to make them approach 1994, these appear database explicitly in .cY”’would operations, system to interested requests reconfigu- system reconfiguration of the reconfiguration 4, December transactions be 4. An auxiliary by programmers of including Vol the approach as it is m Section an augmentation is presumably programmers. of alternative m C/ readers, rather and Quorum Consensus CREATE(T) Effect: s.actiue = T s.acfiue = T in Nested Transaction REQUEST.COMMIT( Precondition: Systems 561 . T,W) T = a’. actiue v = “OK” Effect: s.actit, e = nil Fig. 6, 5.1.1 Dummy Dummy Object Automata. object More automaton. precisely, we define a single dummy object automaton Sz, which is a serial-object automaton having a single state U {nil}, initially nil. The set of return values component, actiue G accesses(Z) VT for accesses to Z is equal to {“OK”}. The (trivial) transition relation for Sz is shown in Figure 6. 5.2 System .%: The Reconfigurable with a Centralized In this system Y ~ ~, section, Quorum Consensus Algorithm Configuration we define replicated serial system ~. This is similar to .2? of Section 4 in that object Sx is implemented by replicas SY, and the accesses to X are implemented as subtransaction automata called read-LA’s and write-LA’s, using the same basic strategy as in the fixed-configuration algorithm. Additionally, in the new M, the dummy object Sz is implemented by a configuration object Sx, which is a read/write serial object that maintains a configuration, and the accesses to Z are implemented as subtransactions the configuration LA’s will invoke We first in S tion define but reconfigure-LA’s the type these and that of K?, and then not in .ti. The new automata object, define called invoke accesses to S4Y. (In %’, objects themselves will be replicated, and the reconfigurecoordinator subtransactions to read and write them.) the automata coordinator “bottom and give the new automata are the replica LA objects, transaction that appear the configura- automata, We again up.” 5.2.1 System Type. The type of system @ is defined to be the same as that of .&, with the following modifications. As in Section 4, the object name X is now replaced by a new set of object names ~, and there are new read- and write-coordinators, transaction names, co, and COU,,representing respectively, and ace, and accW, which are the read and write accesses to the replicas, object respectively. name ~, and Additionally, there object are new are the read and write accesses the configuration accesses in this coordinator automata.) We let co The transactions names in la name transaction Z is now names replaced CO,C and by a new CO,U ~, which to object name ~, respectively. (We denote way, because in % they will be replaced by = co, u COW U corC U col~)~. are accesses in JYj but in @ they are not; each transaction name in lar now has children that are in CO,Cand co,; each transaction name in laU, now has children in COrC, Cor, and COU,; and each now has &il&n in CO,C, COW,, CO,, and cow,, &o, transaction name in la,,, as in Section 4, each transaction name in Cor has children in ace,, and each in accu. These are the only changes to transaction name in COU,has children ACM Transactions on Database Systems, Vol 19, No 4, December 1994 562 K, J, Goldman and N. Lynch . la, law 3 Com Core Cor co Occr co r )# accw accr % Fig. 7. Logical accessesand their descendants. the system type—for example, no transaction names other than those indi- cated are parents of the new transaction names. The naming structure for logical accesses and their descendants is shown in Figure 7. In defining .@, we associated information with some of the transaction names. In S, we keep all such information from .@ and add more information as follows. First, to denote we associate with the new every configuration transaction for each logical name T G la,,, in conf’’gs(~), the set of all configurations Second, to denote the configuration that we assume that every transaction name of j?’. is used reconfigure a configuration access, config(T ) by each read-coordinator, T G co, has an associated config-ura- tion config(T ) E corzfigs(~ ). Similarly, every transaction name T G co. has an associated value data(T) G D, an associated version number uersionnumher(T) G N, and an associated configuration config(T) G corzfigs(~ ). These denote the value for X written by the write-coordinator, the associated version number in the quorum consensus algorithm, and the configuration to be used for writing the new value and version number. Third, we associate some information with the accesses to both the conjuration objects and the replicas. We assume that every transaction name T G CO,Chas kind(T) kind(T) = write and = read, data(T) and that every G configs(j2’), tion. Also, as before, we assume that every kind(T) = read and that every transaction ● N X D. write and data(T) transaction representing name the T E CO,O.has new confiWra- transaction name T G ace, has name T G accw has kind(T) = 5.2.2 Replica and Configuration Automata. As in Section 4, we define a replica automaton for each object name Y E ~. This is a read/write serial object for Y with domain N X D and initial value (O, do). As before, for u.version-number and v.value to refer to the compov ● PJ x D, we write nents of v. For the rest of this paper, let C = configs(~), and let co ~ C be a distin- guished initial configuration. The configuration automaton is defined to be a read/write serial object for ~, with domain C and initial value Co. (Recall that its read accesses are the elements of corC, and its write accesses are the elements of COWC.) ACM Transactions on Database Systems, Vol. 19, No. 4, December 1994. Quorum Consensus CREATE(T) Effect: s.awake s.awake in Nested Transaction REPORT_ABORT(T’) Effect: s, reported= = true = true s.mported REQUEST-CREATE(2”) Precondition: s’. au,ake = trwe 7“ @s’. requested Effect: s.reque$ted = s’, reported = s’. reporfed U {T’) u {T’) s’. au,oke = true 9’. wgue9ted = s’.rcported (3?2 c conjig(T). rmd)(7? ~ s’. read) v = (.qt. vcr.rion tlwnber, .qf.ualue) U {T’} Effect: 9.awOke = ja19e T’,V) s’. reported s’. mad .563 . RIXJUEST.COMMIT(T,U) Precondition: = s’. requested REPOR.T.COIVIMIT( Effect: 9 reported= s.read Systems u {T’} U {object(Z”)} if v.uersion-nurnber > s’. uersion-nu!nfmr- then s.uersion-nomber .= u.ver9ion-number s.ualue = u.urilue s.r,-ported s-.rcad = s’. reported U {’l”] s’. read u {object(T’)) = if v.uer.. !on-namber > s’. ucrsion-nt{?nbc-r then s.ucr~ion-nornber s.ualue = u ucraion-ftumber = u.ualue Fig. 8. Transition relation for Read-Coordinators. 3. 5.2.3 Coordinators. These transaction In this section, automata are we define the coordinator automata invoked by the LA’s and access in the replicas, We first define read-coordinators; these are the same as those in Section 4, except that instead of using the fixed configuration, the new read-coordinator uses its own associated configuration. A read-coordinator for transaction ber, initially name requested, false; T ● co, has state reported, value G D, and read, initially components where do; awake, awake is version-number quested and reported are subsets of children(T), a subset of ~, initially empty. The set of return The transition relation for read-coordinators value, version-num- a Boolean G N, variable, initially code is identical determine when assured of having to that in Figure 2, except that config(T) is the coordinator has collected enough information seen the most recent value of X. We next define Section 4, except write-coordinators. Again, these are the same that instead of using the fixed configuration, write-coordinator uses its own associated O; re- initially empty; and read is values VT is equal to N X D. is shown in Figure 8. This configuration. used to to be as those in the new A write-coordinator for transaction name T G COW has state components awake, requested, reis a Boolean variable, initially false; ported, and written, where awake requested and reported are subsets of children(T), initially empty; and written is a subset of ~, initially empty. The set of return values VT is equal to {“OK”}. The transition relation for a write-coordinator named T G cow is shown in Figure 9. This code is identical to that in Figure 3, except that config(T) is used to determine when the coordinator has written to enough replicas. ACM Transactions on Database Systems, Vol. 19, No. 4, December 1994. .564 K. J, Goldman . CREATE(T) Effect: s.awake s.otwke and N. Lynch REPORT_ABORT(~) Effect: s.reported = s’. rcportedu = true = &rue s.reported RXQUEST.CREATE(Z’) Precondition: REQUEST_COMMIT( Precondition: T,u) s’. uwake = true 9’. au,ake = true T’ $$s’. reque~ted s’. requested dafa(T’) = (uersion-number( T), data(T)) (3W E Eifect: = s’, reguestedu REPORT-COMMIT(T’,V) Effect: s.reported = s’, wrvtten 9’. wrifte”) s.turvtten {T’} EtTect: s awake = jalse reported U {T’} = s’. wr:tten U {object) 9.reported = s’. repot tedu = s’, wr:tten Fig 5.2.4 = s’ reported cOnjig(T)tur:te)(W ~ V=” OIY” s,rcquested 9 {T’] repor~edu {~) = s’ Logical {T’) U {object(T’)} 9. Access Transltmn relation Automata. for Write-Coordinators. Now we define logical access beginning with read-LA’s. These are the same as those in Section that the new read-LA first invokes a read-con figuration-coordinator automata, 4, except in order to determine the current configuration. It then invokes only read-coordinators whose associated configuration is the same as the determined current configuration. A read-LA value-read, Boolean u {nil}, has state and variables, initially initially empty. The transition the code REQUEST_ taining components config-read, initially nil; and awake, where con fig, awake, false; config requested and value, l}alue-read, requested, and initially G C U {nil}, reported are subsets reported, config-read are nil; value G D of children(T), The set of return values VT is equal to D. relation for read-LA’s is shown in Figure 10. The changes in Figure CREATE responses 4 and from tional state components: returned, and config-read, read. Also, there are the are as REPORT follows. _COMMIT configuration-coordinators. config, a flag extra to First, there are additional actions for invoking and obThere are two addi- which keeps track of the configuration that records that a configuration has been clauses in the precondition for the REQUEST_ CREATE events for read-coordinators, which ensure that only read-coordinators with the current configuration are invoked. We next define write-LAs. These are the same as those in Section 4, except that the new write-LA first invokes a read-con figuration-coordinator in order to determine the current configuration, and then invokes only read-coordinators and write-coordinators having the proper associated configuration. A write-LA has state components awake, config, config-read, value-read, version-number, requested, and reported, where awake, value-written, are Booleans, initially false; con fig config-read, value-read, and value-written initially nil; version-number ● N U {nil}, initially nil; and reG C u {nil}, quested and reported are subsets of children(T), return values VT is equal to {“OK”}. initially ACM 1$394 Transactions on Database Systems, Vol 19, No 4. December empty. The set of Quorum Consensus in Nested Transaction REQUEST_CREATE(~), Precondition: T’ c COm s.value-read s. s’. awake = true reported = T’ @ s’. requested s.ualue-read = s’, reguestedu s.conf7g-mad = true s.reported = s’. reported U {T’} read 9.config = false = u {T’} REPORT_A130RT(Z”) Eflect: s.reported = s.reported = 9’. reported = s’. reporfed REQUEST.COMMIT( P~econdition: s’, awake = true then sf. requested frue u {r} U {T’) T,U) = s’. reported s’. ualue-read v = al,”al”e v s.config-reacf = true {T’] REPORT..COMMIT( T’,V), T’ E cm,. Effect: a.reported = .’. reported u {T’] if s’. conjig-read = false then s.conjig = v s’. config. true = s’. reporteti if s’. ua?ue.read = false then .9.value = ti. ualue Effect: s.reguested 565 . REPORT.COMMIT(~, v), ~ E CO, Effect: s. reported= s’.reportedu {’~) if s’. uafue-rwad = jalne then s.twlue = v.ualue CREATE(T) Effect: s.cmmke = trw 9.awake = frue if Systems = true Effect: REQUEST-CREATE(T), Precondition: s’. awake = true 9’. c0nfig-read = true config(’1”) = s’. config T’ @s’. requested Effect: s.reqaested ~ G CO, = 3’.reque3&edU Fig. 10. The transition the code in relation 5 Transition are = fake {~] for write-LRs Figure a.awake as relation for Read is shown follows. LA’s, in Figure First, REQUEST_ CREATE and REPORT_ COMMIT actions taining responses from the configuration-coordinators, components the for keeping precondition for track the of the results. REQUEST_ CREATE There 11. The changes there are for invoking and oband associated state are also extra events to additional for clauses in read-coordinators and write-coordinators, ensuring that only coordinators with the configuration are invoked. Finally, we define recon@-ure-LA’s. The purpose of a reconfigure-LA current is to change the current configuration to a given target configuration. The main thing it does in order to accomplish this change is to invoke a write access to the configuration object. However, this alone is not enough; transaction must also maintain the crucial property that the recont3gure-LA all the members of some write quorum, according to the current configuration, must haue the latest version number and data value. (The importance of this property can be seen in the proof of Lemma 4.3.1.) An arbitrary configuration change might cause this property to become violated. Thus, extra activity is required on the part of the reconfigure-LA to ensure that the latest data gets configuration. In particular, ACM propagated to some write quorum of the new the reconfigure-LA must first obtain this latest Transactions on Database Systems, Vol 19, No 4, December 1994 566 K. J. Goldman . and N. Lynch REQUEST.-CREATE( Precondition: s’. awake = true CREATE(T) Effect: 9. awake = true $.awake = true a’.ualue-rcad REQUEST.CREATE(T’) Precondition: , ?“ c CO,c conjig(~) T’), T’ c COUI = true = 9’. conjig rfata(T’) = riata(T) uernion-number(T’) Y $?3’. requested a’. awake = true T’ @ s’, requested = s’. uersion-number Effect: Effect: s.mquested s.requested = s’. requestedu REPORT.COMMIT(T’, REP C)RT. CO MMIT(T’,V), T’ 6 CO,, Effect: s.reported = s’. reportedu {T’) Effect: s.config-read if 9 c~njig s.reported REPORT_4BORT(T’) Effect: s.reported then v s.config-read 9 = s’ reported = s’. rt;,orted u {T’} u {T’] T,u) a’. awoke = true s’. oumke = true s’, config-read s’. requested = true = s’, conjig v= = true “OK” Effect: s ~wake = jalse T’ $ s’.requcsted = s’. reguc~tedu REPORT-COMMIT(7”, s.repo,-ted = s’ reported sf. ualue-rurilten Effect Effect: reported REQUEST_COMMIT( Precondition: CO. Precondition: s.r{quested u {T’} = true = true REQUEST_ CREATE (T’), T’ ~ conjig(T’) COW u {T’} = true = s’. re~rted s’, ualue-written u {T’) = false = U), T’ c = s’. reported = true = s’.reportcd s’. co~~jig-rmd s.reperted s’. uolue-ruritten if 9’ conjig-reed = \a/se then 9 config = u s.reported {T’} = s’. requested u {T’} {T’} U)) T’ < = .s’. reportedu CO, {T’) if s’, ualue-mad = jalse then .s.uersion-nurnber = v,uersion-number s.ualoe-recrd s,rcported = true = s’, reportcdu {T’] = jalse then if s’. ua)ue-read s.uers:on-number = v.uer9ton-number s.ualue-read= true Fig. 11. Transition relation for Write LA’s. data, which requires that it first obtain the value of the old configuration from a read-configuration-coordinator and then the latest data value from a Once it has the latest data, the read-coordinator with the old configuration. reconfigure-LA must cause this data to be written to some write quorum, according to the new configuration. It accomplishes this by invoking write-coordinators whose corresponding configuration is the new configuration. (Note that the reconfigure-LA can invoke the write-coordinators and the write accesses to the configuration object concurrently, even though in the serial systems considered here, the subtransactions themselves will be run serially. This concurrent invocation will be useful in the concurrent systems we study later, where these activities can run in parallel to reduce latency.) ACM TransactIons on Database Systems, Vol 19, No. 4, December 1994. + 1 Quorum Consensus A reconfigure-LA awake, for config, value-read, is in version-number subsets and u {nil}, c {“OK”}. The transition read- config-written, nil; value and for reconfigure LA’s state components config-read, variables, G D U {nil), first and values is shown 567 config-read, awake, requested a reconfigure-LA . reported, The set of return empty. write-LA’s, has Boolean are nil; initially Systems where config-written initially relation and T G la,,, requested, initially ~ N U {nil}, of children(T), for the and value-written, config name version-number, value-written, value-read, false; transaction value, in Nested Transaction initially initially nil; reported are VT is equal in Figure to 12. Just determines the as current configuration with a read access to the configuration object. Then the reconfigure-LA invokes read-coordinators using that configuration; when the first read-coordinator reports its commit, the reconfigm-e-LA remembers the value and version number number returned. Then of write-coordinators, value and version using number the current value as needed. Concurrently, returned to every replica the the its reconfigure-LA new may configuration, invoke along by the read-coordinator. This any with the propagates in a write quorum of the new configuration, invokes at least one write LA access to the configuration object in order to record the new configuration. In order to request to commit, the reconfigure-LA must receive REPORT_ COMMIT and at least one write access responses for at least one write-coordinator to x. 5.3 Correctness In this begin %. Proof section, with First, exactly we prove the some definitions. the logical as in Section correctness of system In each definition, access sequence of ~, 4 to be the subsequence ~. As in Section ~ is a sequence logical-sequence( ~), of D containing 4, we of actions of defined is the CREATE(T) and REQUEST_ COMMIT(~, v ) events, where tion is based on the new definition of la, which T E la. (But now this definiincludes the names in la,, C.) Also, current-un( if before, ~ is finite, to be the then value logical-value(B) of the last and logical write (or the /3 ) are defined initial value logical object if no such write occurs), and the highest version-number automata in any global state led the local states of all replica as of the among to by ~, respectively. We require one new definition, analogous to logical-value( ~ ). finite, then logical-config( /3 ) is defined to be either REQUEST_ COMMIT(T, v) is the last REQUEST_ COMMIT or transaction name in la~,c that occurs in logical-sequence(@), is REQUEST_ is the value COMMIT event of the last logical no such reconfigure The next lemma correctness inductive (c) says theorem. argument. that the occurs. In other words, the logical configuration reconfigure access (or the initial configuration access occurs). is the key to As in Parts Lemma the 4.3.1, proof of Theorem condition (a) and (b) of condition configuration ACM Namely, if ~ config(T) if event for a co if no such object Transactions holds on Database the (1) 5.3.11, is only (1) are as before, logical Systems, Vol. the needed main for the and part configuration. 19, No. if 4, December As 1994. 568 K, J. Goldman s and N. Lynch REQUEST..CREATE( CREATE(T) Effect: s’. awake = true ~’. armrke = true s’ wdue-mmd = true 3’. awake = true REQUEST. CREATE(T’), T’ ~ data(T’) Core = .d. wdue uersion-nu Precondition: true T’ $ s’. r-equeated st. awake T’), T’ 6 co. Precondition: niber(T’) = s’ uersion-aumber config(T’) = co.fig(T) T’ $ ~’.requested = Effect: Effect: s.requestea’ s.reguested = s’. requestedu REPORT-COMMIT REPORT-COMMIT Effect: if (T’,rJ), s.reported 9’ T’ C COW = s’. reportedu {T’] s.conjig-read s.mported = REQUEST .CREATE(T’), s’. t,olue-rcild = true data(T’) == conjig(T) T’), T’ 6 co. T’ @ s’ requested Effect: s.requested REPORT. COMMIT(T’,V), Effect: s reported Effect: s reported= s.reparfed $.rmrsion-number s’. reque~ted = s’. reported s’. ualue- written = true u.version-number s(. conjlg-wrxtten v=” OK” 12. important returns 7’,V) s’. au,ake = true = true Fig, Transition relation = true = jolse for Reconfigure LA’s. pax-t of the lemma is condition (2), which tells us that the value associated with the previous logical write. ~ be a finite schedule of U? such that no logical properties hold in any global state led to by P: There exists a write quorum YI ~ logical-config( /3 ).write any replica SY for an obJ”ect name Y E 7?;-,if v is the data Sy, then v.version-number = current-vn( B ). Transactions access to in ~. (1) The following ACM u {T’) {T’) Precondition: EtTect: s.awake (a) s’. reperted = s’. reportedu REQUEST.COMMIT( U {T’) = Let U {T’) = v.uer9ion-number if s’. uafue-wad = }rd.se then s.ualue = v value 5.3.1. {T’] = true REPORTABORT(T’) = true = s’. reported s.uatue-read T’ ~ COwc s reported = s’. rvprfed s.config-written = true = s’. requested U {T’) s.uersion-number LEMMA {T’} = s’. reporftdu s.config-written s.tla!ue-rmd X is active = g’. reguesfedu = true if s’. ualue-mad = false then ir. t,alue = v.mlue the T’ G COvJc s’. awake = true = true REPORT. COMMIT(T’,V), T’ C co, Effect: s.reparted = s’. reported U {T’] each read-LA U {T’} = true Precondition: config(l”) = 3’ conjig T’ $ s’. wquested before, = s’. reported 9’. ualue-wrttten v REQUEST-CREATE( Precondition: s’ atiuke = true s.reporfed U {T’] = true = true 9.conjig-read s’. conjig-r.md T’ 6 COW = s’. reported s’. uolue-written s.reported = s’. rcportedu {T’] if s’. conjlg-read = ~ahe then 9.cOflJig (T’,rJ), s.re~rted Effect: config-mad = /a/se then 3 conjig = u Effect: s.regue~ted {T’} = s’. regue9ted U {T’) on Database Systems, Vol. 19, No 4, December 1994 such that component for of Quorum Consensus (b) Systems if v is the data component of SY and ~ ) then v.value = logical-value( P). The data of the configuration ~ ends value( component in REQUEST_ COMMIT(T, object U) with 569 . For any replica SY, number = current-vn( (c) (2) If m Nested TransactIon v.version is logical-config( T G la,, then - ~ ). v = logical- P). As before, PROOF. of the form logical-sequence(~) CREATE(7’ ) REQUEST_ ceed by induction on the logical-sequence( ~ ) contains consists of a sequence COMMIT(T, of pairs, each T ● la. We PrO- v), where number of such pairs. For the no such pairs. Then logical-config( basis, suppose ~) = co. Parts (a) and (b) follow as before, with logical-config( P ) used in place of the fixed configuration. For part (c), since ~ contains no CREATE(T) events for T G la, it contains no REQUEST_ COMMIT events for accesses to the configuration object; therefore, since the data component of the configuration object is initially co, this is also the case in any global state led to by /3. This yields part (c), and condition For the inductive the last CREATE holds for COMMIT(Tf, (2) holds step, let event in vacuously. ~ = /3 ‘y, where ~), and assume f?’. So, logical-sequence(y) = CREATE(T’f) Vf) for some T+ = la and Vf of the appropriate the fact that the system is serial shows T~ have actions in y; in particular: Claim REQUEST_ is at least that only begins that with the lemma REQUEST_ type. As before, descendants or ancestors and if CREATE(T) is in ace, U ace,,, v) occurs in y, then T G descendants(Tt ). Also, 5.3.2. If T COMMIT(T, is in co and if CREATE(T) T = children(T~). There logical-sequence(y) logical-sequence( or REQUEST_ one REPORT_ COMMIT(T, COMMIT event v ) occurs of or if T in y, then for a transaction name in co, in y; let T‘ be the read-coordinator in co, with the first REPORT_ COM). Let y‘ be the portion of Y UP to MIT in y. By claim 5,3.2, T’ ● children(Tf and including the REPORT_ COMMIT shows that no LA requests the creation event for T‘. Then, because the code of any write-coordinator (for replicas or configuration objects) until after a read-coordinator we see that there are no REPORT_ COMMIT events T~ that are write accesses. Also, since -y must contain a REQUEST_ CREATE name in REPORT_ CO,C with up co,, the COMMIT the to and claims first including about PROOF. REPORT this the states We have accesses event for a transaction definition of T~ implies that there is at least one event in y‘ for an access in CO,C;let T“ be the access in _COMMIT REPORT_ of the automata Claim 5.3.3. Let s be the with Tf in any global state logical-conjig( ~ ‘). y‘ for write has returned a value, in y‘ for descendants of in y‘, and let COMMIT event. for Tf and y“ We be the now portion give of y several T‘. state of the transaction reachable from ~ ‘y” in automaton ~. Then associated s .config = argued that there are no REPORT_ COMMIT events in that are descendants of Tf. So, by Claim 5.3.2, no write ACM TransactIons on Database Systems, Vol 19, No. 4, December 1994 570 K. J. Goldman . and N Lynch accesses occur in y‘. Therefore, logical-corz~ig( returns as its value by part (c) of the P ‘). By definition nently by T“. set to the value Claim 5.3.4. Let returned s be the state Therefore, of the inductive hypothesis, T“ of Tf, s.config is permathe claim transaction ❑ holds. automaton associated with T‘ in any global state reachable from ~‘ in ~ ‘y’. Then s.version-nurnber and s.value contain the highest version-number and associated value among the states led to by P‘ of the replicas whose names are in s. value-read. Claim with 5.3.5. Let s be the T~ in any global = current-un( Let PROOF. MIT ~‘) event state and state of the reachable s.value s‘ be the state from transaction automaton ~ ‘y’ in ~. Then = logical-z] alue( ~ ‘). of T‘ just it issues in y‘. By the definition before associated s.version-nurnber its REQUEST_ of a read-coordinator, s‘. read COM- must contain a read quorum .%?G config(T ‘). read. Claim 5.3.3 and the defhition imply that config(T’ ) = logical-con fig( ~‘ ), so the read quorum Y of Tf is in logical-config( ~‘ ). read. By part (a) of the inductive hypothesis, there is some ~‘ ).write such that the states of all replicas write quorum 7Z-● logical-config( for object names in Win any global state led to by P‘ have version-number = current-vn( ~‘ ). Since logical-config( ~‘ ) is a configuration, M? and W- must have a nonempty intersection. So s‘ ,read must contain at least one object name in Z? So, by Claim 5.3.4, s ‘.uersion-number = czLrrent-un( ~ ‘), Therefore, by part When T‘ (b) of the inductive reports its commit hypothesis, to T~, the s‘ .ualue = logical-ualue( version-number and ~‘ ). value compo- nents of Tf are set equal to s‘ version-n urn ber and s‘ value, respectively. By definition of T~, these components are never again modified. Therefore, Claim ❑ 5.3.5 is proved. We now return to the main proof. We consider the three possible cases for T~ . (1) T~ G la,. The logical-value( /3) = logical -value( ~ ‘). Since T~ invokes only read-configuration-coordinators and read-coordinators, which in turn invoke only read accesses, the version-number and value components of the states global of the replicas in any global state state led to by ~‘, so current-vn( led to by ~ are the same as in any ~ ) = czu-rent-un( ~ ‘). Likewise, we have logical-config( ~ ) = logical-config( P ‘), and configuration object is the same in any global Therefore, this time condition (1) on Claim based holds for ~. The proof the data component of the state led to by ~ and ~‘. for condition (2) is as before, 5.3.5. Then logical-ualue( ~ ) = data(T~ ) and logical-config( P) = (2) T~ ● la,,,. logical-conflg( ~ ‘). We prove two claims about the information associated with the descendants of T~ invoked in y. Claim 5.3.6. If T is a write-coordinator invoked = data(T~ ), ber(T) = current-vn( ~‘) + 1, data(T) config( ~‘ ). ACM TransactIons on Database Systems, Vol. 19, No. 4, December by T~ then version-numand config(T) = logical- 1994 Quorum Consensus Let PROOF. event that s be the occurs + 1, data(T) cannot invoke reports and by of T~ just in y. By definition, = data(T~), and a write-coordinator its commit. in y occur state after So all REQUEST_ 5.3.5, before the CREATE by Claim s.uersion-number = current-vn( 5.3.7. If T is a write access for an object = (current-un( /3’) + 1, data(T~ )). As before, 5.3.6, T~ cannot and the definition request to commit quorum of replicas, logical-coizfig( quorum rent-un( in logical-config( P‘ 8 ) = eurren,t-vn( until to conflg(TU receive Then Claim 5.3.8. ). By REPORT_ config( Claim logical-value( fi ‘), Analogous 5.3.9. data(T) invoked 5.3.6 in y, then ❑ a REPORT_ COM- request to commit accesses to a write Claim 5.3.6, COMMIT Claim corzfig(TU, ) = events 5.3.7, ~ ) = logical-ualue( data(T) Claim data(T) If to that about invoked for a write condition ~‘) and (1) still logical-con- the information by T~ then = logical-ualue( 5.3.10. of Claim T is a write = (current-un( By Claim PROOF. @‘), associ- uersion-num- and config(T) = If access invoked /3’), logical-ualue( 5.3.8 and the definition T ❑ 5.3.6. is a write access by a write-coordinator It follows By the definition from T~ cannot Cowc invoked by T~ in y, Claim request of T~. then 5.3.9 that to commit ❑ current-un( until ~ ) = current-un( a REPORT_ COMMIT ~ ‘). By definifor at least of its write-coordinators occurs. Let TW be any write-coordinator child 5.3.8, uersion-number(T,U that has a COMMIT event in y. By Claim current-un( ~ ‘); data(T,u ) = logical-ualue( P ‘); and config(T,U ) = config(T~ definition y, ❑ of a write-coordinator. in in P ‘)). = config(T~). PROOF. tion, Claim T~ ). PROOF. then in ~ fl ‘), ~ ‘). Therefore, by Claim 5.3.7, it follows that cur~‘) + 1, so condition (1) holds in any global state If T is a write-coordinator = current-un( /3 ‘). Thus, it receives fig(P) = conflg(Tt ). We first give three claims ated with descendants of T~ invoked in y. ber(T) T~ Let T(O be the write-coordinaevents in y, and let ~ be the led to by ~ ‘8. As before, but this time using holds for ~. Condition (2) holds vacuously. (3) T~ G lar,C. = logical-config( COMMIT( TU,). TW cannot COMMIT events for write according ~ ‘), so T,,, must by definition, read-coordinators of a write-coordinator. MIT from at least one of its write-coordinators. tor child of T~ that has the first COMMIT portion of y up to and including until it has received REPORT_ CREATE(T) for write-coordinators s.config Claim data(T) 571 . = s.uersion-number events 5.3.3, ❑ By Claim REQUEST_ = s.config. Also, at least one of its holds. PROOF. Systems uersion-number(T) config(T) until ~ ‘y’. Therefore, Claim in Nested Transaction of a write-coordinator ACM Transactions and Claim on Database 5.3.9, Systems, Vol. just 19, No. prior to 4, December one of T~ ) = ). By the 1994, 572 K J Goldman . REQUEST_ config( COMMIT T~[,).zurite sion-number and N Lynch for such Tu that in y, there all replicas = uersion-number( must be for object T~~,). By the a write names equalities %7”● quorum in 7?< have above, data .c’er- Yfl- is a write quorum in logical-config( ~ ), and all replicas for object names in %- have data-uersio?z-number = current-vn( ~ ). This shows part (a). Claim 5.3.9 implies that all write accesses in -y have data = ( currelz tZ.ln( @‘), Zogical-zjalue( ~ ‘)), which is equal to (current-vn( ~), logical-ualzte( ~ )). Since part (b) holds for ~‘, it also holds for ~. Claim 5.3.10 and the definition of a reconfigure-LA imply that the data component of the state of the configuration object in any global state led to by ~ is equal to corzfig(T~), and hence to logical-config( ~). read-LA, condition (2) holds three ❑ Now denote we give the main correctness theorem for @. Again, let the transaction names and object names of d, respectively. Let 5.3.11. (3 be a finite that = PIT for each transaction (2) ylY = ~lYfor The each object proof is analogous by removing from @ all REQUEST _COMNIIT(T, REPORT_ COMMIT( T, u), and tion names T in end of the of d. two conditions name name The schedzde the followilzg (1) ylT PROOF. holds. T~ does not in all y of d such lemma (c). Since 5.3.1. schedule the part Thus THEOREM cases, This shows vacuously. proof Then name for Lemma Y~l and there a -2Pi exists a hold: T ● Y> – la and Y •;2~. to that for Theorem 4.3.7. We construct y REQUEST_ CREATE(T), CREATE(T), v), COMMIT(T), ABORT(T), REPORT _ABORT(T ) events for all transac- co U ace, U ace,,,. Clearly, the two conditions hold. It re- mains to show that y is a schedule of .W to do this, it suffices to show that y projects to yield a schedule of each component of M. It is easy to see that, for each nonaccess transaction automaton T, and for each object for name ule of the object automaton CREATE(T) and REQUEST T of s?’, y IT is a schedule name of the transaction Y = 7A – {X, Z}, y IY is a sched- for Y. The sequence y IZ is exactly the sequence of Since _COMMIT(T, z)) actions in y for T = la,,,. reconfigure-LA’s preserve transaction well-formedness and always return value u = “OK”, y IZ is a schedule of the dummy object automaton. Also, as before, y is a schedule of the serial scheduler, It remains to show that y IX is a schedule of S’x. The proof is by induction on the length of /3. The basis, when ~ is of length O, is trivial. Suppose that 6 = B ‘n, where m is a single event and the result true for @‘. The onlY case that is not immediate is the case where n is a REQUEST_ COMMIT la, U laU,, so suppose that n = event for a transaction name in REQUEST_ COMMIT(T, u), where T G la, U /aW. The proof that the preconditions of w in S4Y are satisfied as in Theorem 4.3.7, but this time using ❑ Lemma 5.3.1 in place of Lemma 4.3.1. Therefore, y is a schedule of W. ACM TransactIons on Database Systems, Vol 19, No. 1, December 1994 Quorum Consensus 5.4 System ~’: The Reconfigurable with Replicated Configurations. It is possible implementing interesting avoid point to manage in Nested Transaction Quorum Consensus configurations in Systems Algorithm a centralized the algorithm described by system ~ to consider replicating the configuration 573 . fashion, directly above. However, it is also information. This may the risk of the configuration storage becoming a bottleneck or a single One way of doing this is using the of vulnerability in the system. fixed-configuration ration that quorum describes configuration. A configuration consensus algorithm: define quorums and quorums read write read-con figuration-coordinator replicas, and returns ation number, which configuration-coordinator both a fixed reads metaconfigu- of replicas a read of the quorum of the latest configuration and a generto a version number. A write- is analogous writes the new configuration to a write quorum of configuration replicas; it does this using the generation number obtained from a read-configuration-coordinator. Except for the need to manage the generation-number information, the LAs are the same in this algorithm as in @. It is also possible to manage metaconfigurations, similar All issues of this avoid arise could an infinite tion that be either the configuration i.e., to reconfigure the for the implementation continue regress, for any we must replicas configuration using number of steps. stop at some point However, in order does not require further configuration. Such a stopping point a centralized implementation or a fixed-quorum algorithm. same configurations are used the same configurations for then a special both stages. are used to manage to and use an implementa- case of There is another interesting alternative, however, in which one-to-one correspondence between the replicas at the last two stages, But of metaconfigurations. that a centralized implementation is just quorums, where there is only one replica.) the changing replicas! That using fixed there stages, is, at the the data might (Note replicas last is a and two and the copies of the configurations themselves. This means that the read-configuration-coordinators will need to read a set of replicas of the configuration objects that form a read-quorum, without first knowing what the current read-quorum is! It turns out that they can simply start reading configuration replicas and use the generation numbers and configurations stored in them to determine In this when a read-quorum for the current configuration has been read. subsection, we will describe this strategy. For simplicity, we con- sider the special case of which objects and configurations—and there are only two kinds of replicas—of logical both are managed using the same configura- tions. The system we define is called give the new automata that the configuration replica write-configuration-coordinator slightly modified from objects those %“. We first appear define the type of %, and then in K but not in .@. The new automata are and the read-con figuration-coordinator transaction automata. Also, the LA’s and are of W. 5.4.1 System Type. The type of system 8’ is defined to be the same as that of .$?, with the following modifications: The object name ~ is now replaced by ACM Transactions on Database Systems, Vol 19, No 4, December 1994. 574 K. J. Goldman . Fig. 13. and N Lynch Reconfiguration coordinators a new set of object names, ~, assume that there is a bijection are new accesses U acc,C but transaction names, in %“ they are not; changes to the those indicated representing the configuration replicas. We i, from the names in ~ to those in j?’. There ace,, and acc,,, C, which each transaction name are the name in CO,C now in COU, C has children read and has write coordinators with In ~, we keep all such information from in are the names other The naming and their some ~ children in acc,,)C. These system type—for example, no transaction are parents of the new transaction names. structure for the read and write configuration are shown in Figure 13. In defining ~, we associated information names. children to the configuration replicas, respectively. We let acc = ace, u accU, U accUC. The transaction names in CO,C and COW,Care accesses in J%’, acc,C, and each transaction only than and their children of the transaction and add more information as follows. We now assume that every transaction name T ● COW~ has associated values old-con fig(T) G C and generation-number(T) G N, in addition to data(T) as before. Also, we assume that every transaction name T = acc,C has hind(T) = write = read, and also and that data(T) every transaction name T = accU,C has kind(T) G N x C.7 5.4.2 Configuration Replica Automata. We define a configuration replica automaton for each object name Y E ~. ‘I’his is a read/write serial object for Y with domain N x C and initial value (O, co). For v ● N x C, we write u .generation-number 5.4.3 and Coordinators. v con fig In this to refer section, to the components we define of v. the coordinator automata %’. The read- and write-coordinators are as in S, so we need only read-con figuration-coordinators and write-con figuration-coordinators. define read-configuration-coordinators. coordinator is to determine The the latest purpose generation in define the We first of a read-configuration- number and configuration, on the basis of the data returned by the read accesses it invokes on con!@uration replica objects. A read-configuration-coordinator for transaction name T G Co,c has state components awake, config, generation-number, requested, reported, and read, CO; where awake is a Boolean variable, initially false; config ● C, initially ● N, initially O; requested and reported are subsets of generation number 7As with the transaction read/write objects, number ACM names agreement and configuration TransactIons on into Database in accu,, with a single Systems, since that the transaction definition data Vol. compels names the attribute. 19, No 4, December in acc W, name combination 1994 of the accesses generation to Quorum Consensus in Nested Transaction REPORT.ABORT(T) Effect: a.mported CREATE(T) Effect: s. awake = true g.awake = true s.reported REQUEST.CREATE(T’) Precondition: Systems = s’.reportrd = s’.mported REQUEST.COMMIT( Precondition: sf. awake = true s’. awake = true s’. requested REPORT. Eflect: COMMIT(T’,U) s. reported= s’.reported s.read if = s’. requested U {T’) u {T’) u {T’} = .9’. reported (3 R)(t(R) s.requested 575 T,U) T’ $ s’. requested Effect: . E s’.config.rcad A?? ~ s’. read) v = (s’. generation-number, s’. config) Effect: U {T’} s.awake = jalse = s’. read U {objwt(T’)} v.gene-ntion-number > a’. generation-number then s.generation-number = rr.generrrtion-number s. config = v. config s.reported = s’. reportedu {T’] s.r-ead = s’. read u {object(T’)} if v.generation-number > a’. generation-number then 9.generation-number = v.generrrtion- number s.config = v.mnfig Fig. children(T), values The Figure Transition initially relation empty; and for Read-Configuration-Coordinators. read g ~, initially empty. The set of return VT is equal to N X C. transition relation for read-configuration-coordinators 14. A read-con figuration-coordinator invokes accesses replicas. On receiving returned own 14. a REPORT_ generation state. updates If its the own returned number with returned COMMIT, the values. The number and interesting replica and ~) commit, associated names that corresponds is part of larger, a the the of its coordinator components to the read-configuration- its REQUEST_ s‘ read contains (using compares component configuration coordinator is the set of preconditions for the coordinator reaches a state s‘ in which ration coordinator generation-number generation generation-number the is shown in to configuration COMMIT. When a set of configu- the correspondence i between ~ to a read quorum, according to s‘ .config, then it may request to returning the highest generation number it has seen, along with the configuration. uration-coordinators One and the should note the read-coordinators similarity of the in the way the read-configmost current configuration and generation number are obtained, and also the seemingly “circular” use of replicated configuration data in order keep track of the current configuration of the configuration replicas themselves. Of course, it remains to show that this method of determining the current configuration is correct. The key intuition is that the write-configuration coordinators, defined next and as in system g, write the new configuration and generation number to a write-quorum of the old configuration, which of ACM Transactions on Database Systems, Vol. 19, No. 4, December 1994. 576 K J, Goldman . and N. Lynch CREATE(T) Effect: s.awake = true 9 awake = true REPORT_ABORT(T”) Effect: .s.reported REQUEST.CREATE(T’) Precondition: REQUEST_ s reported s’. awake = true COMMIT(T,W) st. ou,ake = true = (genemtion-nwnker(7’), data(l’)) s’. requested = 9’. reported (3 W)(1(W) G old-conjig(T). A~ Effect: 9.reque.9ted u {T’) U {~} Precondition: T’ @ g’. requested dotu(T’) = s’. report.d = s’. repOrted = s’. rquesfed u = U {T’} ~ write s’. written) “OK” Effect: REPORT-COMMIT(T’, Effect: s.written = 9’. repOrted s.wr:tten = Fig. course will configuration 3, = s’. reported U awoke = false {T’} = s’. wrvtten U {object(T’)} s.reported quorum occurred ZJ) s.r-eported s’. written 15. U {T’) {object(T’)} U Transition relatlon for Write-C’onfiguration-Coordinators. intersect all read-quorums of the old configuration. replica’s generation number is the largest found according to that replica’s and the given configuration configuration, is the current then one. no Hence, if a in reading a such write has As promised, we next define write-con figuration-coordinators. A write-configuration-coordinator for transaction name T ● co.,, has state components awake, initially requested, reported, and written, false; requested and reported empty; and VT is equal The Figure written is a subset of ~, where awake is a Boolean are subsets of children(T), initially empty. variable, initially The set of return values to {“OK”}. transition 15. Recall relation for write-con figuration-coordinators is shown in that a write-configuration-coordinator T has an associated old configuration, old-config( T ), and a new configuration, data(T), as well as a generation number, genera tion-number(T ). The purpose of a write-configuration-coordinator is to write its generation number and new configuration to a write quorum in its old configuration. It does this by invoking write accesses receiving to the configuration REPORT_ COMMIT replicas, actions cas in some set corresponding and can only request to commit for accesses to all configuration to a write quorum, according after repli- to old-con fig(T ). A read-LA in W is identical to a read-LA 5.4.4 Logical Access Automata. in W, except for the REPORT_ COMMIT input action for children that are read-configuration-co ordinators. The only difference is that the read-configuration-coordinator returns not only a configuration, but a pair consisting of a generation number and a configuration. The read-LA simply ignores the generation number. Formally, we have: REPORT_ COMMIT(T ‘ , u), T ‘ E CO,, Effect: s.reported = s‘ report u if s‘ .config-read = false then s.config = v.config s.config-read = true ACM TransactIons on Database Systems, {T’} Vol. 19, No 4, December 1994 Quorum Consensus s.i-eported Systems 577 . U {T’} = s‘ reported if s‘ .config-read in Nested Transaction then = false s.confi,g = u.config s.config-read A write-LA described = true in %’ is identical to a write-LA in W, except for the same change above for read-LA’s. Unlike the read- and write-LA’s, the reconfigure-LA’s of % make generation numbers returned by the read-configuration-coordinator, age the configuration from those of 3 replicas. than do configuration-coordinator and a configuration, used to determine, LA, the allowable @, except more read- number These are name for the T G la,,, addition has the same state of the component compon- generation- and has initial value nil. The which takes on values in N U {nil} is the same, except that the type of the return values for read-con- figuration-coordinators is now N x L’. The actions REPORT– COMMIT( Effect: s. reported if s‘. config-read s.config T ‘ , u ), T ‘ G = s‘ reported COrc U {T’} then = false = u.con fig s.generation -number = v.generation-n s.config-read umber = true = s ‘reported U {T’1 ifs’ .conflg-read = false then s.config = v.config. s.generation -nwnber — Lj.generation-number s con fig-read = true s. reported 5.5 Correctness In this of % differ Here, a a pair consisting of a generation are saved by the reconfigure-LA. for transaction as it does in number, signature returns of which reconfigure-LA’s and write-LA’s. for the write-con figuration-coordinators T‘ invoked by the values for old-con fig(T’) and generation-number( T’). A reconfigure-LA ents both Thus, the the read- use of the to man- section, that change are: REQUEST– CREATE(T ‘), T ‘ G COU,L Precondition: s’. awake = true s’. value-read = true data(T’) = config(T ) old-con fig(T’ ) = s‘ .confzg genera tion-nuntber(T’ ) — s ‘generation-number + I T‘ g s‘ requested Effect: s.requested = s ‘requested u {T’] Proof we prove the correctness of system ~. Once again, we begin with definitions, this time for a sequence /3 of actions of %’. Specifically, we extend the definitions of logical-sequence( P ) and logical-config( D ) to apply the sequences of actions of %’. We also require one new definition, analogous to current-un( ~). Namely, if ~ is finite, then current-gn( P ) is defined to be the highest generation-number among the states of all configuration replica automata in any global state led to by ~. We prove correctness of C by means of a correspondence between %’ and ,%. If ~ is a sequence of actions of %, then we define f( ~ ) to be the sequence of actions of .’@ that results from (1) removing all REQUEST_ MIT(T, u), COMMIT(T), ACM CREATE(T), CREATE(T), REQUEST_ COMABORT(T), REPORT_ COMMIT(T, u), and RETransactions on Database Systems, VO1. 19. No. 4, December 1994. 578 K, J. Goldman . PORT _ABORT(T) and N. Lynch events for all transaction names T = czcc,C u accWC, and (2) replacing all REQUEST_ COMMIT(T, u ) and REPORT_ REQUEST _COMMIT(T, where T G CO,C, with REPORT_ COMMIT(T, u.config ), respectively. The following lemma is the key to the proof COMMIT(T, u.con~ig) of Theorem 5.6, u) and the main correctness theorem for %. As in Lemma 5.3.1, condition (1) is only needed the inductive argument. Part (a) says that any configuration replica either has the current generation number, or has a configuration all configuration replicas in some write quorum of that generation numbers higher than the one held by S’y. Part configuration replica logical configuration. holding the current generation Condition (2), the important that yields our construction LEMMA active Let 5.5.1. a schedule ~ be a finite such configuration (b) says that for SF that hold every number also holds the part of the lemma, says of Q?. schedule of I% such that no logical access is in ~. (1) The following (a) properties hold in any global For any configuration replica nent of Sy and v.generation-nu ists a write quorum state led to by B: automaton SF, if v is the data compomber < current-gn( ~), then there ex- Y’ E v.config.write with the following property: any configuration replica SF for an object name ~ such that i(~) component of SF then v‘ .generation-num if v’ is the data u.generation-nu (b) mber. For any configuration v.generation -number config( for E %‘, ber > replica SF, if v is the data = current-gn( ~ ) then component v.config of SF and = logical- B ). (2) The sequence f(~) is a schedule of W. PROOF. We proceed by induction on the number of pairs of the form CREATE(T) REQUEST_ COMMIT(T, v) in logical-sequence( /3 ). For the basis, suppose logical-sequence( ~ ) contains no such pairs. Then logical- config( /3 ) = co. Since all configuration replicas initially have generationnumber == O and config = co, the same is true in any global state led to by ~. Thus, current-gn( ~ ) = O. This implies that condition (l), and condition (2) hold because the two systems behave identically if no LAs are invoked. For the inductive step, let ~ = /3 ‘y, where logical-sequence(y) begins with the last lemma CREATE event holds for P’. in logical-sequence(~), Then logical-sequence(y) and assume that = CREATE(Tf) the REQUEST_ COMMIT(Tf, Vf ) for some T~ 6 la and Vf of the appropriate type. As before, we can restrict attention to T~ and its descendants. For any LA to issue a REQUEST_ COMMIT, it must first receive a REPORT_ COMMIT for a read-configuration-coordinator and a read-coordinator. Let T‘ be the read-coordinator in co, with the first REPORT_ COMACM TransactIons on Database Systems, Vol. 19, No 4, December 1994 Quorum Consensus MIT in ~, and let REPORT_ COMMIT in Nested TransactIon Systems . 579 y‘ be the portion of y up to and including the event. Also, let T“ be the read-configuration-coordinator given in corC with the first REPORT_ COMMIT in y, and let y“ be the portion up to and including this REPORT_ COMMIT event. Claim 5.2.2. (current-gn( lar proof contain COMMIT(T”, of the automaton its REQUEST_ of Claim 4.3.3, the highest v) occurs in y, then with T“ when u = B ‘)). s be the state issues to the s.config REQUEST_ B ‘), logiccd-config( Let PROOF. automaton If of y COMMIT associated event. we can show generation-number Using that an argument that simi- s.generation-number and associated and config among the configuration replicas whose names are in s.read. Suppose (for contradiction) that this highest generation-number is not equal to cument-gn( P ‘); then part (a) of the inductive hypothesis implies that there is some set ‘7/” iCZ~) ● s.config. with write such that all names in 7?7-have generation-number tion of a read-configuration-coordinator, i(%) ● s.config. and onto, i(~) read. Since and s.config i(7Y”) have configuration replicas > s.generation-number. s.read must contain is a configuration, a nonempty and intersection; for object By the definia set W with i is one-to-one therefore, some configuration replica for an object name in ~, and hence in s.read, lies in 77 and hence has its generation-number component strictly greater than s.generation-number. This is a contradiction to the claim that s.generation-number contains automata the highest generation number among the configuration whose names are in s. read. It follows that s.generation-nz~ current-gn( P ‘). Then it follows s.config = logical-config( B ‘). Now w-e consider three by part (b) of the inductive replica mber = hypothesis cases. It is easy to see that condition (1) (1) Tf G la,. descendants of Tf are write accesses to configuration is preserved, since no replicas. Claim 5.5.2 implies that the value returned by T“ in y has as its configuration nent the value logical-config( /3’). By the definition of logica l-config, the same value that ASO, the automaton that would be returned associated with by the configuration Tf in & behaves compothis is automaton in <Z. identically to that associated with Tf in B except that, in %’, it receives and ignores the generation-number component of the return value of T“, exactly as the correspondence f requires. Since ~ is a schedule of %, it follows that f( /3) is a schedule of &?, which shows condition (2). (2) Tf G law. (3) Tf G la,,,. information The argument We first associated with is similar give several to the one for the previous claims the descendants about the state case. of Tf and Claim 5.5.3. If s is a state of Tf in a global state reachable from B, s.generation-number = current-gn( ~‘) and s.config = logical-config( PROOF. of a read By Claim configuration 5.5.2, since y“ coordinator. ACM Transactions ends the it invokes. with the first REPORT_ D ‘Y” in P ‘). COMMIT ❑ on Database Systems, Vol 19, No 4, December 1994. 580 K. J, Goldman . Claim then 5.5.4. data(T) If T is a write-con = config(T~ old-con fig(T) Claim 5.5.5. tion-coordinator invoked by Tt in y (3’) + 1, and = current-gn( P‘ ). 5.5.3 and the definition ❑ of T~. If T is a write access in acc&,C invoked by a write-configurain y then data(T) = ( current-gn( @‘) + 1,config(T+ )). By Claim PROOF. figuration-coordinator ), generation-number(T) = logical-config( By Claim PROOF. and N, Lynch 5.5.4 and the definition of a write-configuration coordi- ❑ nator. By definition, T~ cannot configuration-coordinators request to commit has committed; until at least one of its write- let Tl,) be any write-configuration- coordinator child of T~ that has a COMMIT event in y. Moreover, at least one write access to a configuration replica must commit before a write-cont3guration-coordinator can request to commit. From Claim 5.5.5, we know that any write-configuration access T that is a descendant of T~ has data(T) = current-gn( ~ ) = cz~rrent-gn( ~‘) (current-gn( B‘) + 1,config(T~ )). Therefore, + 1. Furthermore, by the config( B ). Therefore, invoked by a write-con gn( ~), logical-config( To show that part three definition of logical-config, we can conclude that figuration-coordinator ~ )). (a) of the inductive cases for each configuration led to by ~. (By the preceding replica arguments, co~zfig( Tt ) = logLcal - all write accesses T in acct~,C in y have data(T ) = ( current- hypothesis ~ with these holds data for /3, we consider z), in any global state cases are exhaustive.) (1) v.generation-number < current-gn( ~ ‘). ~hen ~ is not updated by any descendant of Tt in ~. That is, v is the data in Y in any global state led to by ~‘. Therefore, that there configuration it follows immediately exists a set ~“ with replica S~, for an component of S’y, in any global > u.generation-number. During held in ST, is unchanged, greater (a) holds than u.generation-n from part (a) of the inductive hypothesis i(7#’) G z).con fig. zurite such that for any object name ~‘ E w“, if v‘ is the data state led to by ~‘, then u‘ .generation-nurnber y, either the genera tion-n urn ber in the data or else it is set to current-generation(9) umber) by a descendant (which of T~. In either is case, part for D. (2) u.generation-num ber = current-gn( ~ ‘). Then by part (b) of the inductive hypothesis, we know that u.confzg = Zogical-config( /3 ‘). By definition of a write-configuration-coordinator, just prior to the REQUEST_ COMMIT for ).zurite such that T,,, in y, there must be a set W“ with i(~”’) ● old-config(T,,, write accesses to all configuration replicas for object names in w have committed, so by our arguments above, all of these configuration replicas have generation-number = current-gn( ~ ) in any global state led to by ~. By Claim 5.5.4, old-config(TU) ) = logical-config( ~‘ ). Therefore, 7’ satisfies i(~’) E u.config.zorite, and all configuration replicas for object names in 77” have generation-number = currenf-gn( state led to by /3. Therefore, part ACM TransactIons on Database Systems, Vol. ~ ) > v.generation-number (a) holds. 19, No 4, December 1994 in any global Quorum Consensus (3) u.generation-number in Nested Transaction = current-grz( /3 ). Then Systems part (a) is trivial. So in all three cases, part (a) holds for ~. We now show part (b). By the definition of current-gn, the data state any component led to by of the /?’ then configuration state of a configuration u‘ .generation-nurn replica that has, ber in any we know replica < that if u‘ Therefore, /? r ). state led to by /3, a generation number equal to current-gn( ~ ) (and so larger than current-gn( must be written after ~‘. The writer of such a configuration replica must descendent coordinator which of Tf, and therefore must be a child of some write T,,, ~ invoked by Tt. Since all such TU,~ have data(TU by definition Condition ~), part The lemma above immediately THEOREM yields the transaction Let 5.5.6. ~ of y of @ such and object schedule the following (1) ylT = ~lT (2) ylY = (31Y for each object for each transaction name name THEOREM (1) y\T Let 5.5,7. y such of.& ~ that be a (2) YIY = D 1~ for each object finite T G y; 6. CONCURRENT REPLICATED .%. Let P’. Then there exists a hold; U COZ,,C) and – (co,, and 5.5.6 to prove a relationship the transaction names and object schedule of two conditions %“. Then there exists a hold: T ● y< – la and name name of F’ and of .%’, respectively. Y G :zj. the following = BIT for each transaction between names two conditions Now we can combine Theorems 5.3.11 between % and .<. Let ti~ and YA denote names of <w, respectively. schedule configuration) = config( TI ), (b) holds. a relationship names be a finite that B‘ )) be a ❑ (2) is straightforward. Y= and :Z~ denote schedule is logical-config( is SY in any global current-gn( global 581 . Y = 2A. SYSTEMS So far, this paper has dealt exclusively with serial systems. However, a useful nested transaction system must allow concurrency and the possibility of aborting a transaction after it has begun running. In order to simplify the programming effort, it is best for a system not to permit arbitrary concurrency, serial. but rather to make it seem (to the programs) as if the Because we have been discussing several different serial extend the which serial system, concept system we say that of serial is correctness considered a sequence to given define ~ of actions earlier to mention correctness: is serially system were systems, we if correct Y with explicitly is a serial respect to & for transaction name T, provided that there is some behavior y of= such that ~lT = ylT. Many concurrency control mechanisms are known that enable a concurrent a system to appear like a serial one. For example, in Fekete et al. [1990] generic system is defined as a system containing transaction automata (just as in the serial system), generic objects that accept concurrent accesses and ACM Transactions on Database Systems, Vol 19, No. 4, December 1994 582 . also receive K. J, Goldman and N Lynch information controller that algorithms about passes are the commit information given for or abort between constructing the of transactions, them. In generic that objects and paper from a several the serial objects of 5-, in ways that ensure that every execution of the generic system is serially correct with respect to S for each nonorphan transaction. We use J& and %’ to denote the systems defined in Section 5. A concurrent system may be constructed by applying the methods discussed in Fekete et al. [1990] to the system %’. This can be described as applying concurrency control to each copy separately. Thus, for example, we consider a transactionprocessing system ~ 1981] to give a generic show that nonorphan names all behaviors transaction of ~ are serially correct names, in particular, for in YZ — la. That the goal of building and that uses Moss’ read-update locking algorithm [Moss object for each object of %. The results of Fekete et al. replication is, 9 should the results of this paper. To be precise, we have COROLLARY with Let 6.1. PROOF. a serial replicated system That is that is, one wants correct with respect to .&, the case for g, a fact that follows the following general ~ be a sequence to %, for transaction respect like be transparent. behaviors that are serially system. This is in fact the respect looks a transaction-processing with respect all nonorphan name to %’, for all transaction system. both However, concurrency a system serial from to have unreplicated the above by result: of actions that T G Y< – la. Then is serially correct /3 is also serially with correct to .& for T. By Theorem ❑ 5.5.7. Corollary 6.1 implies that all behaviors of Q are serially correct with respect to w for all nonorphan transaction names in Y; — la. This says that system ~, which combines concurrency control techniques based on locking with the replication strategies of this paper, looks the same as the serial, unreplicated system .@, to the nonorphan transaction names in Y; — la (in particular, to TO). Corollary processing consensus rency 6.1 also system protocol control shows that if is constructed to manage the algorithms verified a concurrent, replicated transaction- by (1) using the reconfigurable quorum copies and (2) applying any of the concurin Fekete et al. [1990] to each copy individu- ally, then the concurrent replicated system appears serial and unreplicated. Similar conclusions can be drawn for a system using the multiversion timestamp algorithms of Reed [ 1983] and Herlihy [ 1987], as modeled in Aspnes et al. [1988]. Also, similar conclusions can be drawn when % is replaced by system @’ of Section 5, or by @ of Section 4. In general, any concurrency control algorithm that provides serial correctness at the level of the replicas may be combined with any of our replication algorithms to produce a correct system. Finally, we note that our techniques allow combination of algorithms for orphan management (as described in Herlihy et al. [1987]) with algorithms for concurrency control and for replication. For example, consider the concurACM TransactIons on Database Systems, Vol. 19, No 4, December 1994 Quorum Consensus rent system ~ described constructed from The of Herlihy results respect ~ to $3, for all all transaction correct with %, which respect techniques transaction above, and let one of the et al. [1987] transaction names combines just by adding m Nested Transaction % be another orphan imply names Systems that system management orphans), that is algorithms. % is serially (including 583 o correct with in particular, for in Y; – la. Then Corollary 6.1 implies that % is serially to .@ for all transaction names in 9A – la. Thus, system concurrency control techniques with the replication strategies of this names in 7A – la, and in particular, and orphan paper, looks to TO. management like&to all the 7. CONCLUSION We have presented a precise description and rigorous correctness proof for Gifford’s data replication algorithm in the context of nested transactions and transaction failures. The algorithm was decomposed into simple modules that were arranged naturally in a tree structure. This use of nesting as a modeling tool enabled us to use standard assertional techniques to prove properties of transactions Each based module upon use of nondeterminism. nondeterministic, the proof. That further the properties was described is, restricts the of their in terms Although an nondeterminism correctness actual adds proof of the proof accomplished simulated strategy lated hierarchically, an intermediate an obviously we presented rency control system correct a theorem algorithm made extensive for any not be to our implementation that choices. permitted of replication from those of concurrency exclusively with serial systems in order was that implementation would a degree of generality holds the nondeterministic The modularity children. of an automaton showing and unreplicated us to separate the concerns control and recovery. We could deal to simplify our reasoning. The proof that that the the system. fully replicated intermediate Then, stating that the combination with the replication algorithm system system to complete simu- the proof, of any correct concuryields a correct sys- tem. This work has identified a general framework for proving data replication algorithms in nested transaction constructing a formal description of the algorithm the correctness of systems. One begins by in terms of a nested transaction system built from 1/0 automata. Then, one uses the appropriate definitions to show that each logical read access returns the proper value. Next, one constructs a corresponding serial system without replication, and proves that the user transactions in that system have the same executions as the user the correctness automata in the of the replicated concurrency system. control analogous to Corollary 6.1 to show that It may be possible to use this general Finally, algorithm, one proves and applies separately a result the combined system is correct. technique to add transaction nesting to other, more complicated, data replication schemes, and to prove the resulting algorithms correct. Such algorithms include the “Virtual Partition” approach of Abbadi and Toueg [ 1989], and Herlihy’s “General Quorum Consensus” [1984]. An interesting question is whether the techniques presented ACM Transactions on Database Systems, Vol 19, No 4. December 1994. 584 . K. J. Goldman and N. Lynch here can be extended to accommodate these algorithms when transaction nesting is added. Several of these algorithms do not provide atomicity as we define it. Rather, order, between they allow the transactions Therefore, additional algorithms can be verified serialization that run definitions and using order in separate theory to differ from partitions will the of the be required true network. before these our techniques. ACKNOWLEDGMENTS We thank Alan Fekete, Michael Merritt, William Weihl, and the anonymous referees for their useful comments on this material and for their help in improving its presentation. REFERENCES ASPNES, J., based FEKETE, A Conference B~R&u?A, , LYNCH, EA~ER, on Very Large Data Bases VLDB S,yst Trans ARB.4DI, K. Database Systems A , L~NcH, nested transactions. Diego, CA, D HERLIHY, tion M. HERLIHY, MIT 1987. M. M., Sc,ence, tency L~-NrH, of replicated Sept. ) 27 S–305. Expanded of York, 137–151 Computer ACM 6th TransactIons 8–12 on MIT May. systems. ACM replicated In Trans. databases. York, protocol Svmposium for of on Pn nclples 215-229 Commutativlty-based Nested locking Expanded Science, for New available Cambridge, data, In time-stampmg Systems (San as Tech. Mass Proceedings York, and read/write of Database version MIT, replicated ACM, transactions on Prlnclples of Memo , April the 7th ACM 150-162. protocols 1986. types. , May, to exploit in type informa- voting Introduction to in International algorithms Syst. version Cambridge, the theory Systems, Mass., Vol Laboratory- for as Tech No. for maintaining Com Computer of the consis- 230-280, nested on Database proofs of Dzstrzbuted Rep, transactions. Theory 4, December for distributed Computation. MIT/LCS/TR-387, Apr 19, of orphan Theoret. (Rome, Italy, July. correctness on Prznczples avadable correctness on Fault-Tolerant 15 (June), Conference m MIT/LCS/TR-367 urn the Svmposwm MIT Database Hierarchical On Rep. MIT/LCS/TR-319, J. ACM. Dynamic Trans. Tech. 1987. IEEE MIT/LCS,ZTM-329, To appear ACM data Mass W, of the 17th 1990 1987. Database Recocery fault-tolerant ACM 1990 for abstract Also, Syrnposl Expanded Science, efficient 1987. M , ANII WEIHL, version ACM and 65–156. Cambridge, Proceedings Also LYNCH, N., AND TUTTLE, M, Proceedings methods databases M. for Science, D. Set. 62, 123-185. New W. Prznczples. MERRITT, Mass., N , AND MERRITT, Conlput. sys- N.J Control in partitioned 4th W. 97– 111 multlversion Rephcat]on Cambridge, distributed Princeton, database An Synlposlum for Computer for Computer S., AND MUTCHLER, International C“-36, 4 (Apr.). elimination algorithms In ,Dutzng IEEE, New York, JAJODIA, ACM, Scl. (Aug.) York, System N,, distributed of the M., AND WEIHL, Extending LYNCH, of tlmestamp- 14th m partitioned Umv., Concurrency 1985 AND WEIHL, Syst voting Comput 1984. Laboratory HERLIHY, Welghtcd Trans F. Mar.). M., New on Operating’ IEEE excluslon avadabihty proceedings Oregon, Laboratory 1979. Symposusm In of the 6th ACM ACM. MIT/LCS/TM-324, GJWWRD, theory 264-290. AND CRISTMN, N,, MERRITT, Mm.). A of the 43-444. 1987 in Mamtammg Comput. Proceedings 1988. Mass. Robustness MERRITT, J W. Proceedings Princeton N. Reading, 14, 2 (June) (Portland, A., L~mcH, In D., N,, Mutual Science, GOODMAN, 1989. management FEKIZTE, In 354-381. Syst. SKREN, data Iockmg. S, Database A., AiiD 1983 8, 3 (Sept.) replicated FEKETE, V., WEIHL, Endowment, 1985, Computer Addison-Wesley, EL ARADDI, A , TOUEG, ACM Dept. HAIMILAC(OS, AND SEvcIIi, D., LINA, H. Rep. TR-346, Systems Database AND transactions. P., Dataha.w M., nested Tech. BERNSTEIN, , MERRITT, for D , ANI) GARCIA-M• tems. EL N control concurrency 1994. algorithms, In ACM, New Laboratory for Quorum Consensus Moss, ,J. E. Ph.D. Science, REED, THOMAS, 1981. MIT, MIT, D. Syst. R. databases. transactions: Also, Mass. pubhshed Implementing 1, 1979. ACM August Nested Cambridge, Apr. 1983. Comput. Received B. thesis, 1 (Feb.) 1992; Tech. by MIT atomic An approach to reliable Rep. MIT/LCS/TR-260, Press, actions Mar. on Systems distributed Laboratory . 585 computing. for Computer 1985. decentralized data. 1983. ACM !i!’rans. 3-23. A majority Trans. in Nested Transaction consensus Database revised Syst. March ACM approach 4, 2 1993 Transactions and to concurrency control for multiple COPY (June) 180-209. September on Database 1993; Systems, accepted Vol. February 19, No 1994 4, December 1994