Quorum consensus in nested-transaction systems

Transcription

Quorum consensus in nested-transaction systems
Quorum Consensus
systems
KENNETH
J. GOLDMAN
Washington
University
in Nested Transaction
and
NANCY
LYNICH
Massachusetts
Institute
of Technology
Giffords Quorum Consensus algorithm for data replication is studied in the context of nested
transactions and transaction failures (aborts), and a fully developed reconfiguration strategy is
presented. A formal description of the algorithm is presented using the Input/Output
automaton
model for nested transaction systems due to Lynch and Merritt. In this description, the
algorithm itself is described in terms of nested transactions. The formal description M used to
construct a complete proof of correctness that uses standard assertional techniques, is based on a
natural correctness condition, and takes advantage of modularity that arises from describing the
algorithm as nested transactions. The proof is accomplished hierarchically, showing that a fully
replicated reconfigurable system “simulates” an intermediate replicated system, and that the
intermediate system simulates an unreplicated system. The presentation and proof treat issues
of data replication entirely separately from issues of concurrency control and recovery.
Categories and Subject Descriptors: H.2.4 [Database
distributed
Management]:
Systems—concurrency;
systems
General Terms: Algorithms, Theory, Verification
Additional Key Words and Phrases: Concurrency control, data replication, hierarchical proofs,
1/0 automata, nested transactions, quorum consensus
1. INTRODUCTION
In distributed
database
systems,
data are often replicated
in order to improve availability,
reliability,
and performance.
Whenever
replication
is used,
a replication
algorithm
is required
to
ensure
that
replication
is transparent
K. J. Goldman was supported in part by the National Science Foundation under Grant CCR-91of the 6th ACM Symposium
10029. A preliminary version of this paper appeared in Proceedings
on Principles
of D~stributed
Computing.
N, Lynch was supported in part by the National Science
Foundation under Grant CCR-86-11442, in part by the Defense Advanced Research Projects
Agency (DARPA) under Contract NOOO14-89-J-1988,and in part by tbe Office of Naval Research
under Contract NOO014-85-0168.
Authors’ addresses: K. J. Goldman, Department of Computer Science, Washington University, St.
Louis, MO 63130; N. Lynch, Laboratory for Computer Science, MIT, Cambridge, MA 02139.
Permission to copy without fee all or part of this material is granted provided that the copies are
not made or distributed for direct commercial advantage, the ACM copyright notice and the title
of the publication and its date appear, and notice is given that copying is by permission of the
Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or
specific permission.
01994 ACM 0362-5915/94/1200-0537
$03.50
ACM
Transactions
on Database
Systems,
Vol.
19, No.
4, December
1994,
Pages
537-585
538
K. J. Goldman
.
and N. Lynch
to the user programs.
In understanding
replication
algorithms,
it is convenient to think
of each logical
object as being implemented
by a collection
of
replicas
and logical
access programs.
The replicas
retain
state information,
and
write
write
the
user
programs
the logical
operations
invoke
logical
access
programs
in
order
object. The logical
access programs
complete
by accessing some subset of the replicas.
to read
these
read
or
or
One of the best-known
replication
algorithms
is the quorum
consensus
algorithm
of Gifford
[1979].
Gifford’s
work
does not include
transaction
nesting,
but permits
the user to access the logical objects using single-level
transactions.
ideas of this
Based on the majority-voting
algorithm
of Thomas
[1979], the
method
underlie
many
of the more recent
and sophisticated
replication
techniques
Abbadi
et
replication
al. [ 1985],
algorithm
replicas
to ensure
write
operation.
approach
(where
replica suffices),
consensus
assigned
(e.g.,
Eager
and
Sevcik
[ 1983],
and El Abbadi
and Toueg
is to ensure
that
each read
that
it will
return
the
value
Herlihy
[1984],
El
[ 1989]).
The goal of a
operation
sees enough
written
by the
“preceding”
This
can be guaranteed
by a simple
read-one/write-all
every replica is updated
on each write, and reading
a single
or by a read-majority
/write-majority
approach.
The quorum
algorithm
a number
generalizes
of
votes,
these
and
associated
version number.
Each
ration
consisting
of two integers,
approaches
retains
logical
called
in its
as follows.
state
a data
Each
replica
is
value
with
an
object X has an associated
read-quorum
and write-quorum.
configuIf
k
is the total number
of votes assigned to replicas
of X, then the configuration
for X is constrained
so that read-quorum
+ write-quorum
> k. To read X, a
logical access program
collects the version
numbers
from enough replicas
so
votes;
then it returns
the value associated
that it has at least read-quorum
with
the highest
version
number.
To write
X, a logical
access program
collects
the version
numbers
from enough
replicas
so that it
read-quorum
votes; then it writes its value with a higher version
collection
of replicas
with at least write-quorum
votes.
In our discussion
use a configuration
described
above. In
[1985],
a configuration
of quorum
consensus
algorithms
for data
has at least
number
to a
replication,
we
strategy
that
is slightly
more general
than
the one
this strategy,
justified
by Barbara
and Garcia-Molina
consists of a set of read quorums
and a set of write
quorums.
Each quorum
is a set of replica
names; thus, the voting
strategy
above corresponds
to the special
case where
any set of replicas
is a read
quorum
if and only if the total
votes among
its members
exceeds the
threshold
read-quorum.
To read a logical
object, a logical
access program
accesses all the replicas
in some read quorum
and chooses the value with the
highest
version
number.
To write
a logical
object, a logical
access program
first discovers the highest
version number
written
so far by accessing all the
replicas in some read quorum;
then the logical access increments
number
by one and writes
the new value
and version
number
replicas
in some write
quorum.
Every read quorum
must have
intersection
with
every write
quorum.
This ensures
that each
logical
object reads at least one of the replicas
written
by the
logical write.
ACM
TransactIons
on
Database
Systems,
Vol.
19,
No,
4, December
1994
that version
to all the
a nonempty
read of the
most recent
Quorum Consensus
in Nested Transaction
In this paper,
Gifford’s
quorum
consensus
algorithm
proven correct in the context of nested transaction
systems
to
invoke
accesses
Transaction
programs,
rithms
to
nesting
but
also
themselves
This
is because
tions
of the user
access may
logical
turns
for
to
describing
(even
the
objects
out
access
transactions,
be written
arbitrary
useful
and
if the user
logical
using
be
not
transactions
and the
is described
that permit
and
users
for
the
of the
algo-
are not nested).
as subtransac-
performed
logical
user
replication
be written
tasks
transactions.
describing
themselves
may
different
as subtransactions
539
nested
only
understanding
programs
.
Systems
by a logical
access transactions.
We show how Gifford’s
algorithm
can be structured
in this way. Once this is
done, it is very natural
to allow nesting
of user transactions
as well.
The quorum
consensus
algorithm
is well suited to systems in which transaction failures
(aborts) may occur, and is therefore
a natural
choice for nested
transaction
systems in which one of the primary
goals is to mask failures.
For
example,
if the replica
accesses performed
by a logical
access program
are
invoked
as subtransactions
of that program,
then an operation
to access a
logical
data
item
can complete
even if some of its associated
abort. Our presentation
includes
a fully
the read- and write-quorums
dynamically.
figuration,
Mutchner
failures.
was outlined
by Gifford,
[1990].
Reconfiguration
In this paper,
data replication
of distributed
paper.
The
reconfiguration.
ting
where
uses
Two
a fixed
Both
algorithms
transaction
failures
reconfiguration
accesses
and has also been studied by Jajodia
is useful
for coping
with
site or
and
link
we describe and prove the correctness
of two algorithms
for
using a general
framework
that is suitable
for a wide range
algorithms.
first
replica
developed
mechanism
for changing
This mechanism,
known as recon-
algorithm
is that
principal
algorithms
configuration,
are presented
are
are
whereas
possible.
the configuration
described
the
in a nested
An
second
in
this
supports
transaction
interesting
aspect
information
itself
cated and managed
by the data replication
algorithm.
Separating
the concerns
of data replication
from concurrency
recovery
has been an important
goal in the formal
treatment
setof the
is repli-
control
and
of replicated
data management
algorithms
(e.g., El Abbadi
and Toueg [1989]). Our presentation
achieves this separation
in the nested transaction
setting.
We present
the replication
issues solely in the context
of systems without
concurrency.
We prove that a system that includes
the new replication
algorithm
and that
is serial at the level of the individual
data copies looks the same, to the user
transactions,
as a system that is serial at the level of the logical objects. The
fact that
both
simplify
the reasoning.
systems
involved
in this
simulation
are serial
systems
helps
to
Of course, systems that are truly serial at the level of the data copies are of
little
practical
interest.
However,
previous
work on nested transaction
concurrency
control
and recovery
algorithms
[Moss 198 1; Reed 1983; Lynch and
Merritt
1986; Fekete et al. 1987] has produced
several interesting
algorithms
that guarantee
that a system appears to be serial, as far as the transactions
can tell. Combining
any of these algorithms
(at the replica
level) with the
replication
algorithm
we present
yields a combined
algorithm
that appears
ACM
TransactIons
on Database
Systems,
Vol.
19, No.
4, December
1994.
540
K. J, Goldman
.
unreplicated
and
transactions
can tell.
that
the
serial
and N, Lynch
(at
the
In fact,
replication
logical
our
algorithm
guarantees
“serializability”
serializable e at the logical
case of nested transaction
rency control theory:
data
results
can
item
provide
be combined
at the copy level,
data item level. Thus,
systems, an important
level),
as far
a careful
with
as the
proof
any
of the
algorithm
user
fact
that
to yield
a system
our work formalizes,
claim from classical
that
is
for the
concur-
quorum consensus works with any correct concurrency control algorlthm.
As
long as the algorithm
produces serializable
executions, quorum consensus will
ensure that the effect is just like an execution on a single copy database
[Bernstein et al. 1987].
We are able to accomplish
this
clean
separation
in the case of nested
transactions
because our correctness
conditions
are stated from the point of
view of the transactions,
instead
of from the point of view of the database
interacting
with a particular
scheduler.
Furthermore,
the classical
serializability
theory correctness
conditions
are stated in terms of executions
of the
same
system,
separate
‘I’here
proof
tation
this
while
our
specification
have been
of replicated
and proof
work
correctness
conditions
data
given
is based
algorithms.
Most
by Bernstein
on classical
presented
et al. [ 1987]
6, we show
rigorous
among
et al. [ 1990].
the
correctness
follows
directly
from these
suggests possible directions
terms
presentation
these
of Gifford’s
theory.
in
and
is the presen-
basic
Their
of a
algorithm;
approach,
how-
easily to the case where
nesting
and
[1984]
extends
Gifford’s
algorithm
to
and offers
a correctness
proof.
Nested
In Section
In Section 4 we describe the
by the reconfigurable
algorithm
that
defined
of the paper
is organized
as follows.
In
system
model
we use for nested
transaction
by Fekete
configurations.
tion, followed
notable
serializability
ever, does not appear
to generalize
failures
are allowed.
Also, Herlihy
accommodate
abstract
data types
transactions
are not considered.
The remainder
summarize
the
are
of the allowable
executions.
some previous
attempts
at
3 we give
a formal
definition
algorithm
for a fixed
in Section 5. Finally,
of interesting
results.
Section
for further
work.
Section
2 we
systems,
as
of
configurain Section
concurrent
replicated
7 summarizes
our
systems
results
and
2. BACKGROUND
We adopt the model of nested transaction
systems constructed
by Lynch and
Merritt
[ 1986] with the 1/0 automaton
model of Lynch and Tuttle
[ 1987] as a
foundation.
We begin with a review
of the 1/0
automaton
model and then
describe
its use in modeling
nested
may be found in Fekete et al. [ 1990].
transaction
systems.
Complete
details
We note that whereas classical serializability
theory is designed specifically
to deal with database
concurrency
and replica
control,
the 1/0
automaton
model is well suited for studying
a wide range of distributed
algorithms.
The
model provides
a foundation
for precise problem
specifications,
detailed
algoACM
TransactIons
on
Database
Systems,
Vol
19,
No
4, December
1994
Quorum Consensus
rithm
descriptions,
results.
and
2,1 The Input / Output
An
1/0
initial
automaton
states.
careful
Automaton
in Nested TransactIon
correctness
proofs,
a state
as well
541
.
as impossibility
Model
A has a set of states,
Usually
Systems
is
given
some
as an
of which
are designated
assignment
of values
as
to
a
collection
of named typed variables.
The automaton
has actions, divided
into
input actions, output actions, and internal
actions. We refer to both input and
output
actions as external
actions.
An automaton
has a transition
relation,
which is a set of triples
of the form (s’,
is an action. This triple means that in
perform
action w and change to state
is called a step of the automaton.
The input actions model the actions
T, s), where s‘ and s are states, and rr
state s‘, the automaton
can atomically
s. An element
of the transition
relation
that
are triggered
by the environment
of the automaton,
the output
actions model the actions that are triggered
the automaton
itself and are potentially
observable
by the environment,
the internal
actions
the environment.
Given
a state
a state
model
changes
s r and an action
s for which
of state
that
T, we say that
are not directly
T is enabled
(s’, w, s) is a step. We require
that
by
and
detected
by
in s‘ if there
each input
action
is
m be
enabled
in each state s‘, i.e., that an 1/0
automaton
must be prepared
to
receive any input action at any time.
Rather than listing
triples,
we describe transition
relations
by associating
a
precondition
and effect with each action. For a given automaton
in state s‘, if
the precondition
for action m is true in s‘, then the automaton
may perform
m
and go to state s, as defined
by the effect. If T is enabled
in all states, the
precondition
is omitted.
An execution
of A is an alternating
sequence
so nl Slrz “”” w.s. of states
and
actions
of A such
that
occurs as a consecutive
execution
is the sequence
so is a start
state
and
each triple
(s’, v, s ) that
subsequence
is a step of A. The schedule
of actions that occur in that execution.
The
of an
behav-
ior of an execution
of A is the sequence
of external
actions
of A in the
execution.
A behavior
represents
the information
that the environment
can
detect about the execution.
Since the same action may occur several times in
an execution,
schedule,
or behavior,
we refer to a single occurrence
of an
action
as an euent.
We say that
execution
with
a schedule (behavior)
schedule (behavior)
@‘ of A is a prefix
reachable
from
~‘
of schedule
in
P iff
P can leave A in state s if there is some
a and final state s. If schedule (behavior)
(behavior)
s is the final
B of A,
state
we say that
in an execution
state
s is
a of A with
schedule (behavior)
~ ‘y, where ~ ‘y is a prefix of /3.
We describe systems as consisting
of interacting
components,
each of which
is an 1/0
automaton.
It is also convenient
and natural
to view systems
as
1/0 automata.
Thus, we define a composition
operation
for 1/0 automata,
to
yield
a new 1/0
automaton.
A collection
of 1/0
automata
is said to be
strongly
compatible
if (1) every internal
action of every automaton
is not an
action
of any
other
automaton
ACM
in the
TransactIons
collection,
on Database’
.
(2) every
Systems,
Vol.
output
19, No.
action
4, December
of
1994.
542
K. J. Goldman
.
and N. Lynch
each automaton
is not an output
action
shared
by infinitely
many
automata
in
collection
system
of strongly
9.
A state
compatible
of the
automata
composed
an internal
action
that
of any
and (3) no action is
A (possibly
infinite)
be composed
is a tuple
to
create
of states,
a
one for
the start states are tuples
consisting
action of the composed
automaton
is
automata.
It is an output
of the system
is an internal
action of the system if it
component.
has an action
may
automaton
which
component
automaton,
and
start states of the components.
An
action of a subset of the component
it is an output
of any component.
It
components
of any other,
the collection.
During
n- carries
an action
of
an
if
is
n- of &’, each of the
out the action,
while
the remainder
stay in the same state.
If @ is a sequence
of actions
of a system
with
component
A, then we denote by /3 IA the subsequence
of fl containing
all the
actions
finite
of A. Clearly,
behavior
if
@ is a finite
of A. There
behavior
is an important
of the
converse
compatible
~ be a strongly
of the collection.
Suppose
LEMMA
2.1.1.
Let {A,},.
and let A be the composition
external
actions
of A, such that
Then ~ is a finite
behavior
of A.
/3 IA,
is a finite
system,
then
~ IA is a
result.
collection
of automata,
P is a finite sequence of
behavior
of Al
for
every
i.
Let A and
to implement
1? be automata
with the same external
actions. Then A is said
B if every finite
behavior
of A is a finite
behavior
of B. One
way
in which
this
that
an automaton
notion
satisfy some specified
B is also correct.
2.2
Serial
Serial
Systems
systems
can be used
A is “correct,”
property.
is the following:
in the
Then
sense
if another
that
Suppose
its
automaton
finite
we can show
behaviors
B implements
all
A,
and Correctness
[Fekete
et al. 1990]
a transaction-processing
system.
object
are used to characterize
Serial
automata
systems
consist
communicating
with
the correctness
of transaction
a serial
of
au-
tomata
and
automaton.
serial
scheduler
Transaction
Serial object
automata
represent
code written
by application
programmers.
automata
serve as specifications
for permissible
behavior
of data
objects in the absence of concurrency.
They describe the responses the objects
should make to arbitrary
sequences of operation
invocations,
assuming
that
later invocations
wait for responses to previous
invocations.
The serial scheduler handles
the communication
among the transactions
and serial objects,
and thereby
controls
the order in which the transactions
can take steps. It
ensures that no two sibling
transactions
are active concurrently
and decides
whether
each transaction
commits
or aborts. The serial scheduler
a transaction
to abort only if its parent
has requested
its creation,
not actually
transactions
been created.
are run serially,
Thus,
in a serial
system,
all sets of sibling
and in such a way that no aborted
transaction
ever performs
any steps.
A serial system allows no concurrency
only a very limited
ability
to cope with
ACM
TransactIons
on
Database
Systems,
can permit
but it has
Vol.
19,
No,
among sibling transactions,
and has
transaction
failures.
However,
serial
4, December
1994
Quorum Consensus
systems
provide
interesting
useful
systems,
of replication
specifications
from
the pattern
those
leaf,
ancestor,
descendant.)
The
the
The
accesses
and
set of return
a transaction
ualues
name
we say that
the pair
only
The
a system
type, by a set Y–
(A transaction
is its own ancestor
each
tree
element
Additionally,
are called
of the
the
and
accesses.
partition
system
contains
type
(T, v) is an operation
specifies
a
of X.
transactions
will
actually
take
steps. We imagine
structure
is known
in advance by all components
in general,
be infinite
and have infinite
branching.
classical
with TO
such as
can be thought
of as a predefine
naming
scheme for all
that might ever be invoked.
In any particular
execution,
some of these
that the tree
The tree will,
object.
more
the concerns
for transactions
(henceforth
simply called ualues). If T is
that is an access to the object name X and u is a value,
The tree structure
possible transactions
however,
of other,
control.
nesting,
transaction-naming
so that
to a particular
behavior
543
.
into a tree by the mapping
parent,
tree, we use traditional
terminology,
of the
are partitioned
accesses
Systems
for separating
of concurrency
descendant.
leaves
correct
models
of transaction
of transaction
names, organized
as the root. In referring
to this
child,
for
and as intermediate
algorithms
We represent
in Nested TransactIon
transactions
of concurrency
control
theory
of a system.
(without
nesting)
appear in this model as the children
of a “mythical”
transaction
TO, the root
of the transaction
tree. Transaction
To models the environment
in which the
rest of the transaction
system runs. It has actions that describe the invocation and return
of the classical
data, and so are distinguished
below the root. The
function
is to create
directly.
A
serial
transaction
serial
are described
system
addition
Input:
of a given
for
automaton
each
for each
object
model
transactions
whose
but not to access data
type
node
name,
and
is the
of the
composition
transaction
a serial
of a
tree,
scheduler.
A nonaccess
transaction
T is modeled
a
These
as a transaction
AT,
an 1/0
automaton
with
the following
external
to this, AT may have arbitrary
internal
actions.)
actions.
(In
CREATE(T)
REPORT_
COMMIT(7”,
value u‘ for T‘
REPORT _ABORT(T’),
Output:
system
nonaccess
access the
at any level
below.
2.2.1 Transactions.
automaton
internal
nodes of the tree
and manage
subtransactions,
automaton
object
transactions.
Only leaf transactions
as “accesses.” Accesses may exist
REQUEST_
REQUEST_
u ‘), for every child T‘ of T, and every return
for every child T‘ of T
CREATE(T ‘), for every child T‘ of T
COMMIT(T,
v), for every return value u for T.
The CREATE
input
action
(an output
of the scheduler
automaton
we
describe later) “wakes up” the transaction.
The REQUEST_
CREATE
output
action
is a request
by T for the scheduler
to create
a particular
child
transaction.
The REPORT_
COMMIT
input action reports to T the successful
completion
of one of its children,
and returns
a value recording
the results
of
that child’s execution.
The REPORT _ABORT
input
action reports
to T the
ACM
TransactIons
on Database
Systems,
Vol.
19, No.
4, December
1994.
544
K. J. Goldman
.
unsuccessful
information.
and
N, Lynch
completion
of one of its children,
The REQUEST_
COMMIT
action
it has finished
its work,
and includes
without
returning
is an announcement
a return
any other
by T that
value.
We leave the executions
of particular
transaction
automata
largely
unconstrained;
the choice of which children
to create and what value to return
will
depend on the particular
implementation.
For the purposes
of the systems
studied
here, the transactions
are “black
boxes. ” Nevertheless,
it is convenient
to require
that all transaction
automata
preservel
certain
syntactic
as follows.
A sequence
~ of
conditions,
called
transaction
well- forrnedness,
external
action
the following
of automaton
conditions
(1) The first
CREATE
event
well-formed
for
T, provided
in ~, if any, is a CREATE(T)
event,
and there
are no other
events.
(2) There is at most
T’ of T.
report
(3) Any
CREATE(T’)
(4) There
A~’I is transaction
hold:
one REQUEST_
event
in ~.
is at most
for
child
one report
CREATE(T’
of
T‘
event
in
T
) event
is
in
preceded
B for each child
f? for each child
by
T‘
REQUEST_
of T.
COMMIT
event for T occurs in ~, then it is preceded by
(5) If a REQUEST_
a report
event
for
each
child
T‘
of
T
for
which
there
is a
REQUEST_
CREATE(T’)
(6) If a REQUEST_
event in ~.
in ~.
COMMIT
event
for
T occurs
A transaction
well-formed
sequence
is always
starts with CREATE(T),
ends with REQUEST_
tween
has some interleaving
of a collection
REQUEST_
CREATE(
T ‘) REPORT_
COMMIT(
in
~, then
it is the
last
a prefix
of a sequence
that
COMMIT(T,
u), and in beof two-element
sequences
T‘, u‘ ), for various
children
T‘
of T.
2.2.2 Serial Objects.
Recall that transaction
automata
are associated
with
nonaccess
transactions
only, and that
access transactions
model
abstract
operations
on shared data objects. We associate a single 1/0 automaton
with
each object name. The external
actions for each object are just the CREATE
and REQUEST_
COMMIT
actions
for all the corresponding
access transactions. Although
we give these actions the same kinds of names as the actions
of nonaccesa
transactions,
it is helpful
to think
of the actions
of access
transactions
in other terms also: a CREATE
corresponds
to an invocation
of
an operation
on the object, while
a REQUEST_
COMMIT
corresponds
to a
response by the object to an invocation.
Thus, we model the serial specification
of an object
X (describing
lThe concept of
preserumg
a property
means
automaton
does
ACM
that
the
TransactIons
on
Database
not
its activity
of behaviors
violate
Systems,
Vol
the
19,
in the absence
is formally
property
No.
unless
4, December
defined
Its
of concurrency
in
Fekete
environment
1994.
et
has
al.
and
[ 1990]
done
and
so first.
failures)
(In
by a serial
addition
to this,
Quorum Consensus
in Nested Transaction
object
Sx with
aqtomaton
S’x may
Input:
CREATE(T),
Output:
REQUEST_
have
arbitrary
Systems
the following
internal
.
external
545
actions.
actions.)
for every access T to X
COMMIT(
l’, u), for every access
to X and every return
T
value v for T.
As with
transactions,
is convenient
tic conditions.
while
serial-object
well-formed
where
T* + Tj when
preserve
serial-object
Read/Write
has
an
COMMIT(T1
it
that
every
of the form
) REQUEST_
serial
object
COM-
automaton
access to X),
and
and
function
T where
kind(T)
data
(an
element
actiue
of D).
= do. The transition
CREATE(T)
REQUEST_
s.actiue
: accesses(X)
= “write”
of two components:
so ,data
s.actiue
kind
+ {“read”,
“write”},
function
data : T G czccesses(X) : kind(T)
= “write”+
return
values for each access T where kind(T)
= “read”
an access
Effect:
of a sequence
, u ~) CREATE(T2
i + j. We require
well-formedness.
associated
of S’x consists
= “nil”,
unconstrained,
Objects.
One especially
important
class of serial-object
reaci/write
serial objects. Each read/write
serial object Sx
domain
of values, D, with a distinguished
initial
value do.
and an associated
The set of possible
state
are. left largely
Serial
automata
are the
has an associated
D, while
objects
for X if it is a prefix
CREATE(TI)
REQUEST_
MIT(T2, u2)...
SAY also
specific
to require
tb at behaviors
of serial objects satisfy certain
syntacLet a be a sequence of external
actions of S’x. We say that a is
for
= T
has return
(either
The
start
relation
state
“OK”.
The
or the name
of an
so has
so .actiue
is as follows:
COMMIT(T,
kind(T)
value
“nil”,
D.
is
u),
= “read”
Precondition:
= T
s’. actiue = T
REQUEST_ COMMIT(T,
= “write”
for kind(T)
Precondition:
u),
s’. data = u
Effect:
s.active = “nil”
s’. actil~e = T
~ =~~o~~~
Effect:
s.actiue
s.data
2.2.3
Serial
=“nil”
= data(T)
Scheduler.
The third
kind
of component
in a serial
system
is
the serial scheduler.
The transactions
and serial objects have been specified
to be any 1/0
automata
whose actions and behavior
satisfy
simple restrictions. The serial scheduler,
however,
ia a fully ~pecified automaton,
particular
to each system type. It runs transactions
according
to a depth-first
traversal
of the transaction
tree. The serial scheduler
can choose nondeterministically
ACM
TransactIons
on Database
Systems,
Vol.
19, No.
4, December
1994.
546
.
K. J Goldman
to abort
any transaction
the transaction
Lynch
whose
parent
has not actually
is requested
must
overlapping
and N
be either
its execution,
has requested
been created.
aborted
before
or run
REQUEST_
REQUEST_
Output:
The
CREATE(T)
COMMIT(T),
ABORT(T),
and
COMMIT(T,
ABORT(T),
mark
no siblings
of a transition
can
or abort
has occurred.
inputs
are intended
T # TO
u)
u ),
T # TO
T + TO
CREATE
and
REQUEST_
COMMIT
with
the corresponding
outputs
of transaction
and correspondingly
for the CREATE,
REPORT_
REPORT_
as
creation
T # To
REPORT_
REQUEST_
with
The result
after the commit
are as follows:
as long
of T whose
T # To
REPORT_
to be identified
object automata,
actions
CREATE(T),
COMMIT(T,
its creation,
child
to commitment
T can commit.
be reported
to its parent
at any time
The actions of the serial scheduler
Input:
Each
ABORT
output
the point
in time
actions.
where
The
COMMIT
the decision
and
and serialCOMMIT,
ABORT
on the fate
output
of the transac-
tion is irrevocable.
Thus, for each transaction
T‘ involved
in an execution,
we
have a REQUEST_
CREATE
from
the parent,
followed
by either
(1) an
ABORT
and REPORT_
scheduler,
REPORT_
Each
ABORT
a REQUEST_
COMMIT
state
from
COMMIT
from
the scheduler
from
or (2) a CREATE
the transaction,
from
and a COMMIT
the
and
the scheduler.
s of the serial
scheduler
consists
of six sets, denoted
via record
notation:
s.create _requested,
s.created,
s.commit _requested,
s.committed,
s.aborted,
and s.reported.
The set s.commit _requested
is a set of operations.
The others are sets of transactions.
There is exactly one start state, in which
the set create _requested
is {TO}, and the other sets are empty. The notation
s.completed
denotes
s.committed
u s.aborted.
Thus,
s.completed
is not an
actual
variable
in the state, but rather
a deriued
variable
determined
as a function
of the actual state variables.
REQUEST_ CREATE(T)
Effect: s.create _ requested
=
whose
value
ABORT(T)
Precondition:
s’.create_requested
T E s ‘.create_requested
u {T}
—
s’.completed
T @ s’.created
siblings(T) n s’.created
s’.completed
Effect:
REQUEST_ COMMIT(T,
Effect: s.commit_requested
s. aborted
u)
= s’.aborted
= s’.commit_requested
U {(T,
u)}
u {T}
REPORT _COMMIT(
T,
Precondition:
T E s ‘committed
(T,
U)
E
s’. commit
ACM
TransactIons
on
Database
Systems,
Vol.
19,
No
4, December
1994
_requested
U)
c
is
Quorum Consensus
in Nested Transaction
CREATE(T)
Precondition:
.
Systems
547
T @ s‘ reported
Effect:
T G s ‘.create_requested
— s ‘created
s.reported
T E s ‘aborted
siblings(T)
Effect:
= s‘. reported
n s‘ created
c s ‘completed
REPORT_
V (T}
ABORT(T)
Precondition:
s.created = s ‘created
{T}
u
T = s‘
aborted
T @ s‘. reported
COMMIT(T)
Precondition:
Effect
(T, u) E s ‘commit
_requested
s. reported
= s’. reported
for some v
[T}
U
T @ s ‘completed
Effect:
s.comrnitted
2.2.4
sition
Serial
= s ‘committed
Systems
of a strongly
and
u {T)
Serial
compatible
Behaviors.
A serial
set of automata
system
consisting
is the compo-
of a transaction
automaton
AT for each nonaccess
transaction
name T, a serial-object
tomaton
S’x for each object name X, and the serial-scheduler
automaton
the given system type.
The discussion
in the remainder
of this
fixed, system type and serial system, with
automata,
and
behauiors
S’x
for the
external
actions
actions
are called
as the
system’s
of
the
define
transaction(n)
automata.
system.
and
paper assumes an arbitrary,
but
AT as the nonaccess transaction
use
The
COMMIT(T)
m is one of the serial
term
serial
actions
and
u), where
actions
to the
ABORT(T)
CREATE(T),
_COMMIT(T’,
u’),
u ), where T‘ is a child
to be T. If w is a serial
to be X. If ~ is a sequence
the
serial
for T.
REPORT
COMMIT(T,
_COMMIT(T,
We
We give the name
actions
name
_CREATE(T’),
‘), or REQUEST_
or REQUEST
object(w)
serial
completion
If T is a transaction
REQUEST
ABORT(T
serial-object
behaviors.
aufor
action
of the form
T is an access
of actions,
REPORT_
of T, then we
to X,
CREATE(T)
then
T a transaction
we define
name,
and
X
an object name, we define ~ IT to be the subsequence
of ~ consisting
of those
serial actions w such that transaction(w)
= T, and we define B 1X to be the
subsequence
of ~ consisting
of those serial actions n such that object(m)
= X.
We define serial(~)
If @ is a sequence
to be the
of actions
subsequence
of ~ consisting
of serial actions.
and T is a transaction
name, we say T is an
orphan
in ~ if there is an ABORT(U)
action in @ for some ancestor
U of T.
We say the T is liue in ~ if F contains
a CREATE(T)
event but does not
contain
a completion
event for T.
LEMMA 2.2.4.1.
Let
~ be a serial
behavior.
(1) If T is live in ~ and T’ is an ancestor
(2)
If T and T‘ are transaction
then either T is an ancestor
2.2.5 Serial
ness condition
Correctness.
that
ACM
T’
is live in fl.
names such that both T and T‘ are live in /3,
of T‘ or T‘ is an ancestor of T.
The
we expect
of T, then
serial
other,
TransactIons
system
more
is used
efficient
on Database
to specify
systems
Systems,
Vol
the
to satisfy.
19, No.
correctWe say
4, December
1994.
548
K. J. Goldman
.
that
a sequence
provided
interested
~
and N. Lynch
of’ actions
data objects,
interacts
with
believe
corresponds
ought
serially
correct
for
transaction
name
serial
to behave.
all finite
behaviors
of these
root transaction
representing
correctness
precisely
to
Serial
the
to be a natural
intuition
correctness
of how
for
T
that
~ IT = y IT. We are
representing
replicated
where
each replica
uses a concurrency
control
method
a controller
that passes information
between
transactions
objects. We will show that
correct for TO, the mythical
We
is
that there is some serial behavior
y such
in studying
particular
systems of automata
and
and
systems
are serially
the environment.
notion
of’ correctness
nested
transaction
T is a condition
that
that
systems
guarantees
to
implementors
of T that their
code will encounter
only situations
that can
arise in serial executions.
Correctness
for To is a special case that guarantees
that the external
world will encounter
only situations
that can arise in serial
executions.
3. CONFIGURATIONS
If
S is an arbitrary
(c. read,
subsets
with
set,
we define
a configuration
c.write),
where each of c.read
of S, and where every element
every
element
of c.write.
(We will
c of S to be any
and c.write is a nonempty
of c. read has a nonempty
sometimes
refer
to c.read
as the sets of read quorums
and write quorums,
respectively,
c.) We write corzfzgs(s) for the set of all configurations
of S.
For example,
if the set S consists of three elements
configuration
c has c.read = {{x}, {y}, {z}} and c.write
pair,
collection
of
intersection
and
c.write
of configuration
x, y, and z, then one
= {{x, y, z}}. This con-
figuration
corresponds
to the read-one/write-all
replication
strategy.
Another
configuration
has c.read = {{x, y}, {.x, z}, {y, z}} and c.write
== {{x, y}, {x, z},
{y, z}}, corresponding
4,
to a read-majority/write-majority
FIXED-CONFIGURATION
QUORUM
strategy.
CONSENSUS
In this section, we present and prove the correctness
of a fixed-configuration
quorum
consensus algorithm.
For simplicity,
we carry out the formal development in this paper for the replication
of a single object, which we call X. The
same arguments
First, Section
apply to any finite number
of objects.
4.1 defines system M, a serial system that
has X as one of its
object names.
In system
.@, the object automaton
associated
with
X is a
read/write
serial-object
automaton
Sx, representing
the logical object to be
replicated.
System
.w’ is used in order to define
system
correctness.
Then
Section 4.2 defines the replicated
serial system @. System
,% is identical
to
d, except that logical object Sx is implemented
as a collection,
SY, Y E ~, of
objects,
and each logical read (or
serial read/write
objects that we call replica
write) of X is replicated
as a subtransaction
that performs
multiple
read and
write
accesses to some subset
of the replicas
according
to the quorum
consensus
algorithm.
Section
4.3 shows that
the algorithm
is correct
by
demonstrating
a relationship
between
W and .@; this proof is easy because
both systems are serial.
ACM
TransactIons
on
Database
Systems,
Vol
19,
No.
4, December
1994
Quorum Consensus
4.1
System
We begin
serial
.W The Unreplicated
by defining
system
Serial
unreplicated
having
in Nested Transaction
Sx associated
and initial
value
with
.
549
System
serial
a distinguished
automaton
Systems
system
object
W. System
name
X is a read/write
X,
serial
.@ is an arbitrary
in
which
object
the
with
object
domain
D
do.
We define
la, and la,O to be sets containing
the respective
names of the
la
read and write
accesses to X, in W, and define
la = la, U la,,. (Here,
stands for “logical
access.”)
Since each transaction
name T = la, is a read access to Sx, the definition
of a read/write
serial object says that kind(T)
= read for each such T. Also,
each transaction
name T = la LOhas kind(T)
= write, and also has an associated value, data(T)
G D.
4.2
System
Q?: The Fixed-Configuration
Here
we describe
system
rithm
with
configuration.
a fixed
S’x is replicated;
that
.%, which
data
the
Algorithm
quorum
A??is a serial
as several
system
serial
just one. The accesses to X
called logical access automata
the quorum
consensus
algorithm
quorums,
as long as each read
logical
Consensus
represents
System
is, it is implemented
SY, Y = ~, rather
than
subtransaction
automata
every write quorum.
the replicas in some
write-LA
writes
to
property
guarantees
Quorum
consensus
algo-
in which
object
objects
(replicas),
are implemented
(LA’s). The LA’s
to manage the replicas.
quorum
has a nonempty
as
use
We allow arbitrary
intersection
with
In order to read the logical object, a read-LA
accesses all
read quorum,
while in order to write the logical object, a
all the replicas
in some write
quorum.
The intersection
that each read-LA
receives at least one copy of the latest
value.
Since reads
of different
replicas
could return
mechanism
is needed to distinguish
the most recent
values. Thus, each
nonnegative
integer
replica
in the
version-number
different
data values,
a
value from other possible
quorum
consensus
algorithm
maintains
a
in addition
to its data value. Successive
write-LA’s
use successively
larger version numbers,
and a read-LA
selects the
data value
associated
with
the largest
version
number
it receives.
This
strategy
requires
each write-LA
to learn what version number
it is supposed
to write,
previously
preliminary
which
used.
read
in
In
turn
order
requires
to learn
of a read-quorum
it to learn
the largest
version
number
this number,
each write-LA
first does a
of the replicas.
Since both the read-LA
and the write-LA
programs
must perform
reads of a
read quorum
of the replicas,
it is helpful
to have a subtransaction
that
performs
such a read. This subroutine
can be invoked
by a read-LA
to
perform
nearly
all of its work,
and can also be invoked
by a write-LA
to
perform
its preliminary
read. We can describe
such a subroutine
in the
nested
transaction
framework
by introducing
two extra levels of nesting.
Thus, read-LA
and write-LA
transactions
will have children
known
as readcoordinator
transactions,
which will be responsible
for reading
from a read
quorum
of replicas.
Write-LA
transactions
will also have children
known
as
write-coordinator
transactions,
ACM
which
Transactions
will
be responsible
on Database
Systems,
Vol
for
19, No.
writing
4, December
to
1994
a
550
K. J. Goldman and N. Lynch
.
write
quorum
turn,
have
replicas.
of replicas.
children
Read-coordinators
that
are
and
individual
read
write-coordinators
and
We begin by defining
the type of 9, and then give
appear in W but not in ~. The new automata
are the
coordinator
and LA transaction
automata.
We define
up,” starting
with the replicas
and the configuration
4.2.1
System
Type.
The
type
of system
write
will,
accesses
to
in
the
the new automata
that
replica objects, and the
these automata
“bottom
automata.
.% is defined
to be the
same
as
that of M, with the following
modifications.
First, the object name X is now
replaced
by a new set of object names
~, representing
the replicas.
(The
replicas
will
transaction
store
both
names,
co,
a value
and version
number.)
There are some new
read- and write-coordinators,
COW representing
acca,, which are the read and write accesses to the
and
respectively,
and ace, and
replicas,
respectively.
We let co = co, U co,,, and ace = ace, U accu,.
The transaction
names in la are accesses in w-, but in # they are not; each
that
are in co,, while
each
transaction
name
in lar now has children
transaction
name in la ~, now has children
in C07 and children
in col~,. Also,
each transaction
name in co~ has children
in acc~, and each transaction
name in co,” has children
in accU,. These are the only changes to the system
type—for
parents
example,
of the
no
new
transaction
names
transaction
names.
other
The
than
naming
those
indicated
structure
for
are
logical
accesses and their descendants
is shown in Figure
1.
In system .&, some information
is associated
with some of the transaction
names; for example,
each T in la (that is, each access to X) has hind(T)
indicating
whether
the access reads
or writes,
and write
accesses have
ciata(l”)
indicating
information
from
action
name
the value to be written.
In Q?, we keep all such associated
.@ and add more information
as follows.
First, every trans-
T G COU has an associated
version number
uersion-number(T)
for X and the associated
version
value
data(T)
G N. These denote,
number
to be written
G D and an associated
respectively,
the value
by the write-coordina-
tor in the quorum
consensus
algorithm.
Second, we associate
tion with the accesses to the replicas.
Namely,
we assume that
some informaevery transac-
tion name T E ace, has kind(T)
= read, and that every transaction
name
● N X D, the version
number
T E ace,,, has kind(T)
= write
and data(T)
and value to be written.z
Throughout
the rest of this section, we let c denote a fixed configuration
of y.
4.2.2 Replica
Automata.
A replica
automaton
is defined
for each object
name Y ~ ~. Each replica
is a read/write
serial object that keeps a version
number
and a value for X. More formally,
the object automaton
for each
Y = ~ in system Q? is a read/write
serial object for Y with domain
N x D
2The transaction names in czccWname accessesto read/write objects. Hence, to agree with the
definitions in Section 2.2.2, the version number and data values are combined into a single data
attribute.
ACM
Transactions
on Database
Systems,
Vol.
19,
No.
4, December
1994
Quorum Consensus
in Nested Transachon
Systems
.
551
la,
Cor
Fig. 1. Logical accessesand their descendants.
accr
i
CREATE(T)
Effect: s.awake = true
REPORT_ABORT(T’)
Effect: s.mparted = e’.mported U {T’)
v.wake = true
s.reparted =
REQUEST.CREATE(T’)
Precondition:
/awake
= true
~ f? s’. rcquesfed
Effect:
s.requested = /.wque9tcd
s.read
= s’.read
{2”)
REQUEST_COMMIT(T,v)
Precondition:
s’. ou,ake = true
s’. requested= a’. reported
s’.read)
(37? g c.read)(??s
v = (9’. uersion-number, a’. value)
Effect:
a. awake = Jal$e
U {T’)
REPORT.COMMIT(T’,V)
Effect: s.reporfed = s’.reported
s’. repartedu
U {~}
U {object(T’))
if u.uervion-number > s’. uersion-number then
9.uer9i0n-number = u.uersion-number
s.wdue = u,ualue
s.reported = 9’. reported U {T’ )
s.read
= s’. read U {object(T’))
if u.uersion-number
> s’. uer9ion-number then
s version-number = u.uersion-number
s.ua I ue = v.ua 1ue
Fig. 2. Transition relation for Read-Coordinators
and initial
value
(O, do). For
u = N x D, we use the record
u.uersion-nurnber
and v.ualue to refer to the components
of U.
4.2.3
Coordinators.
In this
section,
we define
.27, the transaction
automata
that are
replicas.
We first define read-coordinators.
is to determine
the latest
version
number
the coordinator
notation
automata
in
invoked
by the LA’s and access the
The purpose of a read-coordinator
and value
of X, on the basis
of the
data returned
by the read accesses it invokes. A read-coordinator
for transaction name
T G co, has state
components
awake,
value,
version-number,
requested,
reported,
and read, where
awake is a Boolean
variable,
initially
false; ualue G D, initially
do; version-number
EN, initially
0; requested
and
reported
are subsets of children(T),
initially
empty;
and read is a subset of
~, initially
empty. The set of return
values VT is equal to IV x D.
The transition
relation
for read-coordinators
is shown in Figare
2. Recall
that c is a fixed configuration.
This is used to determine
when the coordinaACM
TransactIons
on Database
Systems,
Vol.
19, No.
4, December
1994.
552
tor
K. J. Goldman and N. Lynch
.
has
recent
collected
value
enough
information
to be sure
of X. A read-coordinator
collects
of having
data
from
seen
replicas
the
most
for
object
names in ~, and keeps track of the value from the replica
with the highest
version number
seen so far. Whenever
the read-coordinator
reaches a state in
which (1) some read quorum,
according
to c, is a subset of the replicas
it has
seen (i.e., those in s‘. read) and (2) it has learned
of the fates (i.e., commit
or
abort)
of all of its requested
request
to commit
We have
written
In
any number
practice,
replica
at a time,
earlier
access
transactions,
one
and
to that
then
the read-coordinator
may
its data.
the read-coordinator
ism. It may request
active.
child
and return
would
then
probably
request
replica.
with
a high
degree
of accesses to replicas,
start
another
There
is also
at
of nondetermin-
at any time
most
one only
one
after
a natural
while
access
the failure
trade-off
in
to
it is
each
of an
practice
between
request
request
two methods
of choosing which replicas
to access at first: one could
initially
only enough access to complete
one read quorum,
and then
others if these fail to commit
in reasonable
time, or else one could
request
accesses
to
all
replicas
concurrently.
The
first
method
communication
cost in the best case (when all accesses complete
and the second reduces
latency
in the bad case where
some
Each
of these
mentation
possibilities,
and
of the automaton
many
given
others,
here,
could
obtained
minimizes
successfully),
accesses fail.
be modeled
by restricting
as an implenondetermin-
ism.
We next
write
define
a given
write-coordinators.
value
to a write
The purpose
quorum
write-coordinator
for transaction
requested,
reported,
and written,
false; requested
and reported
are
~~ritten is a subset of ~, initially
of a write-coordinator
of replicas
for object
names
is to
in $’. A
name T G CO,Ohas state components
awake,
where awake is a Boolean variable,
initially
subsets of children(T),
initially
empty; and
empty.
The set of return
values VT is equal
{“OK”}.
to
The transition
relation
for a write-coordinator
Figure 3. When created, a write-coordinator
replicas
for object names in j’, overwriting
at the
After
mit.
replicas
writing
4.2.4
with
its
to a write
own
named
( version-number(T)
quorum,
the
T G co,,, is shown
begins invoking
write
the version
numbers
and
write-coordinator
data(T),
may
in
accesses to
and values
respectively).
request
to com-
Access Automata.
Now we define logical
access automata,
read-LAs.
The purpose
of a read-LA
for a transaction
name
ia to perform
a logical
read access to X. A reaci-LA
has state
T E la,
where awake
components
awake, L~alue, requested,
reported,
and value-read,
and ualue-read
are Boolean
variables,
initially
false;
ualue
E D I., {nil),
initially
nil; and requested
and reported
are subsets of children(T
], initially
beginning
Logxal
with
empty. The set of return
values VT is equal to D.
The transition
relation
for read-LAs
is shown in Figure
4. The read-LA
invokes
any number
of read-coordinators.
After receiving
a REPORT_
COMMIT from at least one of these coordinators,
and receiving
reports
of the
completion
of all that were invoked,
the read-LA
may request
to commit,
ACM
TransactIons
on
Database
Systems,
Vol
19.
No
4, December
1994
Quorum Consensus
in Nested Transaction
CREATE(T)
Effect: a.awake = true
9.awake=
.9.reported
T’)
553
U {~}
T,,J)
9’. awak-e = true
s’. reque.rted = St. reported
= (wersion-number(
Z’), data(l”))
(3W
Effect:
v=
s. requested=
= s’. mporkd
REQUEST.COMMIT(
Precondition:
T’ $ s’.reguested
data(Z”)
.
REPORTABORT(~)
Effect: s.rwported = s’. reported U {T’)
true
REQUEST.CR.EATE(
Precondition:
8’. arvake = true
Systems
s’. requested u {T’}
e c.write)(W
~
s’.writtar)
“OK”
Effect:
s.riwcike
= fa/9e
REPORT.
COMMIT(T’,W)
Effect: .9.reparted = s’, m-ported U {T’)
smitten
=
s’, written
U {object(T’)}
s.reported = .s’. reportedu
s.written
== s’. writfen
Fig. 3.
{T’}
U {object(T”))
Transition
relation
for Write-Coordinators.
CREATE(T)
Effect: s.awake = b-we
s.awwke = true
REPQRTABORT(~)
Effect: s.reportecf = s’.reported U {~}
REQUEST.CREATE(~)
Precondition:
s~.awake = tr-ue
F ~ s’. ~guested
Effed :
REQUEST’-COMMIT(
Precondition:
s’. awake = true
s.requested
s.mperted
~’.value-reod
v =
= s’. requested U {T’}
s.ualue-read
u {T’)
T,V)
s’. r-eqaested = s’. reported
= true
s’. volue
Effect:
s.arvake= false
REPORT.COMMIT(V,V)
Effect: s.reparted = s’. mporfed U {~}
if s’. value-rwd
= fake then
s. value = v.value
s.rqwrted
= s’. reported
= true
= s’.reported
U {T’}
if s’. value-read = false then
s.rralue = v. value
s.value-rmd
= true
Fug. 4.
Transition
relation
returning
the value component
of the data
nator.
We next define write-LA’s.
The purpose
a logical
write
name in la ~ is to perform
for
Read
LA’s.
returned
by the first
read-coordi-
of a write-LA
for a transaction
access to X on behalf
of a user
transaction.
A write-LA
has state components
awake, value-read,
value-written, version-num
her, requested,
and reported,
where awake, value-read,
and
inivalue-written
are Booleans,
initially
false; version-number
e N u {nil},
tially
nil; and requested
and reported
are subsets
of children(T),
initially
empty.
The set of r-eturn
values
ACM
VT is equal
Transactions
to {“OK”}.
on Database
Systems,Vol 19, NO 4, December 1994.
554
K. J. Goldman
.
and N. Lynch
R.EPORT_COMMIT(~,u),
CREATE(T)
Effect: 9.awake = true
Effect:
a’. vo!ue-written
= true
s.reported
= s’. reported
a awake = true
REQUEST-CREATE(
Precondition:
T’),
T’ c CO.
.9’. value-written
REPORT.ABORT(T’)
Effect: s.reported
g’. awake = true
T’ @ a’. requested
Effect:
s.requested = s’. reguestedu {T’}
s.reported
s.wlue.rwd
s.reported
9.uer9i0n-number
REQUEST-CREATE(
Precondition:
u {~)
U {T’)
T,U)
a’. awake = true
= s’. reported
s’. ualue- written
= true
u = ‘OK”
= s’. reporfedu
s.ualue-read
U {T’}
= s’. reported
= s’. reported
a’. requested
= true
Effect:
s.awake = ja/9e
{T’)
= false
U {T’}
= true
REQUEST.COMMIT(
Precondition:
REPORT_ COMMIT
(Z’’, v), T’ G COr
Effect: s.reported = s’. reported
u {T’]
if s’, ualue-read = jalse then
s.uer.sion-number = v.ucrs:on-number
if s’. ualue-read
P G COW
= s’. rtported
s.mported
then
=
u.ucr9ion-number
= true
T’), T’ c
COW
s’, au, ake = true
= true
s’. wdue-rmd
data(T’)
uersion.
= d.t.(T)
nurnbcr(T’)
= ~’. uersion-nutnber
+ 1
T’ ~ s’. requested
Effect:
s.reguested
= s’. mquestedu
{T’}
Fig. 5.
TransitIon
relatlon
The transition
relation
for write-LA’s
invokes
any number
of read-coordinators.
MIT
from
for Write-LA’s,
is shown in Figure
5. A write-LA
After receiving
a REPORT_
COM-
one of the read-coordinators,
the write-LA
remembers
the version
number
returned.
The write-LA
then invokes
any number
of write-coordinators, using a version
number
one greater
than that returned
with the first
REPORT_
COMMIT
of a read-coordinator,
along with
its particular
data
value. In order for the write-LA
to request to commit,
it must receive at least
one REPORT_
4.3
COMMIT
Correctness
In this
of a write-coordinator.
Proof
section,
we prove
the correctness
of the quorum
consensus
algorithm
given above as system 4?. That is, we show that @ is indistinguishable
from
.w?;to the transactions
and objects the two systems have in common. We begin
by stating
some definitions
that are useful for reasoning
of logical accesses in an execution
of W. In each definition,
actions
First,
of @.
the logical
defined
to
REQUEST_
ACM
Transactions
access
sequence
of
~,
denoted
about the sequence
@ is a sequence of
logical-sequence(
be a subsequence
of ~ containing
the
CREATE(T)
COMMIT(T,
u) events, where T G la. That is, the logical
on
Database
Systems,
Vol
19,
No
4, December
1994
~),
is
and
access
Quorum Consensus
sequence
to
is the
sequence
in Nested Transaction
of requests
and
responses
Systems
for the
555
.
logical
accesses
x.
Next,
if D is finite,
REQUEST_
transaction
name
REQUEST_
the logical
the logical
Finally,
carrying
is
in laW that
COMMIT
the
is defined
last
occurs
event
among
next
correctness
properties
write
logical-value(B)
u)
occurs
to be either
REQUEST_
COMMIT
data(T)
event
if
for
a
in logical-sequence(
/3), or do if no such
in
~).
logical-sequence(
In
other
words,
value is the value of the last logical write (or the initial
value of
object if no such write occurs).
if B is finite,
then
current-un(
/3) is defined
to be the highest
version-number
led to by /3.
The
then
COMMIT(T,
lemma
is
the
states
the
key
of all replica
to
the
automata
proof
in the global
of Theorem
4.3.7,
state
the
main
theorem.
Condition
(1) of the lemma,
which
enumerates
some
on the state led to by particular
schedules
of Y, is only needed for
out the inductive
quorum
any replica
in which
with
data.
The more
each
read-LA
argument.
every
the current
important
returns
These
replica
version
part
the
properties
has the current
number
(a) there
number
also has the logical
of the lemma
value
say that
version
is condition
associated
with
the
value
(2), which
previous
is a
and (b)
as its
says that
logical
write.
(That is, each read-LA
returns
the logical-value
of the logical
object.) The
schedules of J% which we consider are those in which no logical access to X is
active.3
The fact that
.%’ is a serial
system
makes
the reasoning
simple
(though
detailed).
LEMMA 4.3.1.
X is active
Let
(1) The following
(a)
/3 be a finite
There
with
properties
exists
(2)
hold
a write
obJ”ect name
v.version-number
(b)
of 9?’ such that
quorum
~
state
WrE c.write
if
v is
= current-vn(
no logical
access to
led to by B:
such
the
data
that
for any
component
replica
of SY,
COMMIT(T,
is a serial
SY
then
B ).
if v is the data component
of SY and
D ) then v.value = logical-value(
/3).
If B ends in REQUEST_
value(~).
Since
in any global
Y E Y/,
For any replica
SY,
number
= current-vn(
PROOF.
schedule
in P.
system,
v) with
Lemma
T E la,,
2.2.4.1
then
shows
v.version-
v = logical-
that
logical-
sequence( B) consists
of a sequence
of pairs, each of the form CREATE(T)
REQUEST_
COMMIT(T,
v), where T G la. We proceed by induction
on the
number
of such pairs. For the basis, suppose logical -sequence( ~ ) contains
no
such pairs. Then
logical-value(B)
= dO. Since @ contains
no CREATE(T)
events for T ~ la, it contains
no REQUEST_
COMMIT
events for accesses in
acc; therefore,
= dO, this
since
is also
all replicas
the
case in
initially
any
have
global
version-number
state
led
to by
= O and
value
~. It follows
that
3Recall that a transaction name is active in a sequence if the sequence contains a CREATE
action but no REQUEST_ COMMIT action for that transaction.
ACM
Transactions
on Database
Systems,
Vol.
19, No.
4, December
1994.
556
K. J. Goldman and N. Lynch
.
cw-rent-vn(
ously.
For
/? ) = O. This
the
pairs,
inductive
and assume
implies
step,
that
condition
suppose
(1), and
that
the lemma
condition
(2) holds
vacu-
contains
k >1
k – 1 pairs.
That
logical-sequence(~)
holds
for sequences
with
is, let /3 = /3’ y, where logical-seguence(
y ) begins with the last CREATE
event
in logiccd-sequence(
~), and assume that the lemma holds for ~‘. (Note that
Lemma
2.2.4.1
sequence(y)
shows
that
LIf
only
in terms
Any
This
action
follows
_ COMMIT(T,,,
The following
subtree
access
_COMMIT
PROOF.
type.
of the transaction
4.3.2.
REQUEST
access to X is live
)REQUEST
of the appropriate
and
Claim
no logical
= CREATE(7’f
or
for
some
Tf G kz
us to argue
❑
which
in y is a descendant
@ is a serial
~ ‘.) So, logical-
for
enables
at T~.
coordinator
appears
since
observation
rooted
in
ZJf )
a
CREATE
or
of Tt.
❑
system.
Now, since Tt has a REQUEST_
COMMIT
in y, we know by the definition
of Tt that there is at least one REl?ORT _ COMMIT
event for a transaction
in co, with
the first
name in co, in y. Let T‘ be the read-coordinator
REPORT_
portion
is in
while
COMMIT
in
y; by Claim
of y up to and including
la,, then the system
if T~ is in la U,, the
before
receiving
T’
type shows that
code shows that
a REPORT_
of any write-coordinator
no REPORT_
COMMIT
4.3.2,
G children(T~
the REPORT_
COMMIT
it invokes
Tt invokes
). Let
event
about
Thus
appear in y‘. Therefore,
descendants
of Tt that
the states
y‘
for
be the
T‘.
If T~
no write-coordinator,
no write-coordinator
for read-coordinator.
or its descendants
events in y‘ for
accesses. We now give two claims
COMMIT
of the automata
no actions
there are
are write
for T~ and
T’.
Claim
4.3.3.
Let
s be the
state
of the
transaction
automaton
associated
with T‘ in any global state reachable
from ~‘ in /3 ‘y’. Then s.version -n umber
and s.value contain
the highest
version-number
and associated
ualue among
the states led to by ~‘ of the replicas
whose names are in s. value-read.
This
PROOF,
reported
by
claim
a read
REPORT_
COMMIT
y‘, the version-number
must
be the same
holds
because
access,
T‘ retains
the maximum
with
associated
together
its
version-number
value
upon
each
of a read access. Again,
since no write accesses occur in
and value components
of all replicas
observed by T‘
during
y‘
as in the state
led to by ~‘.
❑
Claim 4.3.4.
Let s be the state of the automaton
associated
with Tt in any
global
state reachable
from
~ ‘y’ in ~. Then
s.uersion-number
= currentvn( /?’) and s.ualue = logical-value(
~ ‘).
PROOF.
MIT
event
Let s‘ be the state of T‘ just before it issues
in y‘. By the definition
of a read-coordinator,
its REQUEST_
s‘. read must
COMcontain
a read quorum
~ = c.read. By l(a) of the inductive
hypothesis,
there is some
for object names
write quorum
ZI” ● c.write such that the states of all replicas
in W in any global state led to by ~‘ have version-number
= current-vn(
/3’).
Since c is a configuration,
.9? and W“ have a nonempty
intersection.
So s‘. read
ACM
TransactIons
on
Database
Systems,
Vol.
19.
No,
4, December
1994
Quorum Consensus
must
contain
at least
number
= current-un(
s‘ value
= logical-value(
When T‘
nents Of T’
reports
are
one object
~ ‘).
m Nested TransactIon
name
Therefore,
Systems
in %. So, by Claim
by
l(b)
of the
557
.
4.3.3,
inductive
s‘ .uersiorzhypothesis,
/?’).
its commit
to T’f, the version-number
and value compoequal to s‘ version-number
and s‘. value, respectively.
BY
set
definition
of T~, these
❑
4.3.4 is proved.
components
We now return
to the
possible cases for Tf.
main
are never
proof
again
of Lemma
modified.
4.3.1,
Therefore,
and
consider
Claim
the
two
Then logical-value(@)
= logical-value(
~‘ ) by definition.
Also,
(1) Tf G la,.
since Tf invokes
only read-coordinators,
which
in turn
invoke
only read
accesses, the uersion-number
and value
components
of the states
of the
replicas in any global state led to by ~ are the same as in any global state led
to by
holds
~‘,
and
so current-un(
~ ) = current-vn(
fl’ ). Therefore,
By definition,
coordinators
its commit,
Tf cannot
request
commits.
Since
the REQUEST_
to commit
until
Then
(2) Tf ● la,..
about the information
at least
T‘ is the first read-coordinator
COMMIT
for Tf must occur
D ‘Y’. When
Tf requests
to commit,
it returns
state. By Claim 4.3.4, this value is logical-value(
condition
(2) holds for ~.
ing
condition
(1)
for ~.
logical-value(
associated
one of its
read-
child that reports
at some point after
the value component
of its
~‘ ) = logical-value(
B). Thus,
/3 ) = data(Tt ). We first give two claims
with the descendants
of Tt invoked
dur-
y.
Claim
4.3.5.
nuntber(T)
Let
PROOF.
event
that
+ 1 and
ordinator
PROOF.
T
is
s be the
occurs
data(T)
until
REQUEST_
Claim 4.3.4,
Claim
data(T)
If
= current-un(
a write-coordinator
+ 1 and
~‘)
state
of T+ just
in y. By definition,
= data(T~
at least
). Also
invoked
data(T)
before
the
by definition,
The
type
@
version-
CREATE(T
invoke
reports
a write-co-
its commit.
So, all
for write-coordinators
in y occur after ~ ‘y’.
= current-vn(
~ ‘). Thus, Claim 4.3.5 holds.
is
constrained
follows
from
so that
Claim
)
in $Y invoked
T
is
invoked
By
❑
in y, then
by
4.3.5 and the definition
some
of a
❑
By
definition,
REPORT _CONIMIT
write-coordinator
be the portion
cannot request
write
accesses
of
The result
then
= s.version-number
T~ cannot
4.3.6.
If T is a write access for an object
= (current-vn(
~‘) + 1,data(Tf )).
write-coordinator.
write-coordinator.
Tf,
REQUEST_
version-number(T)
one of its read-coordinators
CREATE
events
s.uersion-number
by
= data(T~).
Tf
cannot
request
to
commit
until
from at least one of its write-coordinators.
it
Let
receives
a
TU, be the
child of Tf that has the first COMMIT
event in y, and let 5
of y up to and including
COMMIT(T,O
). By definition,
TU,
to commit until it has received REPORT_
COMMIT
events for
to a write
quorum
of replicas.
Therefore,
by Claim
4.3.6, it
ACM
TransactIons
on Database
Systems,
Vol.
19, No.
4, December
1994
K, J. Goldman
558
.
follows
that
any global
and N. Lynch
current-vn(
state
f?’8 ) = current-un(
We now show that condition
(1) still
By Claim 4.3.6, any write-coordinators
propagate
the new value and version
may
execute
,6’) + 1, so condition
(1) holds
in
led to by ~‘ 8.
in y after
8 cannot
holds in any global state led to by /3.
that may execute in y after 8 merely
number.
Any read-coordinators
that
change
the values
at the replicas,
since they
do not invoke write accesses. Therefore,
condition
(1) holds after ~.
Since T~ does not name a read-LA,
condition
(2) holds vacuously.
Thus
in both
cases, the lemma
holds.
(This
ends our proof
of Lemma
4.3.1.)
❑
Now we
distinguish
prove that
@ is
between
replicated
tem .@. Here,
respectively.
THEOREM
schedule
correct
serial
let 7A and %~ denote
4.3.7.
y of&
Let
~
such that
be a finite
ylT
= PIT for each transaction
(2)
YIY = P IY for each object
We construct
the
transaction
schedule
the following
(1)
PROOF,
by showing
that
transactions
cannot
system E? and unreplicated
serial sys-
tion
names
y by removing
T in co U ace. Clearly,
It is easy to see that,
H.
object
names
Then
there
of @,
exists
a
hold:
T e Y< – la, and
Y G .7A.
from
CREATE(T),
REQUEST_
COMMIT(7’,
REPORT_
COMMIT(T,
u), and REPORT
theorem
hold. It remains
to show
it suffices to show that y projects
of
two conditions
name
name
and
~ all REQUEST
_CREATE(T),
u),
COMMIT(T),
_ABORT(
T ) events
the two conditions
ABORT(T),
for all transac-
in the statement
that y is a schedule
of d. By Lemma
to yield a schedule of each component
for each nonaccess
schedule
of the transaction
automaton
Y E %~ – {X},
y IY is a schedule
of the
transaction
name
for T, and for
object automaton
of the
2.1.1,
of w“.
T of M, y IT is a
each
for
object name
Y. Since we
construct
y by removing
all the actions
associated
with
a specific
set of
transactions,
and since none of these transactions
have children
with actions
in -y, the fact that ~ is a schedule of the serial scheduler
implies
that y is as
well.
It remains
to consider
the
automaton
for
object
name
X:
we must
show
that y IX is a schedule of S’x, where Sx is the serial object associated
with X
in x?. We proceed by induction
on the length
of P. The basis, where /3 is of
length
O, is trivial.
Suppose that ~ = ~ ‘m, where m is a single event and the
result is true for /3’. The only case that is not immediate
is the case where m
is a REQUEST
_COMMIT
event for a transaction
name in la. So suppose
that
m = REQUEST_
COMMIT(T,
u), where
T G la. Let y = y ‘w; by the
inductive
hypothesis,
y‘ IX is a schedule of Sx. Let s‘ be the state of S’x led
to by y’lx.
One precondition
for REQUEST_
COMMIT(T,
U) as an output
of Sx is that
T E s ‘created.
By transaction
well-formedness
of ~, CREATE(T)
occurs in
/3’. By the construction,
CREATE(T)
occurs in y‘, so that this precondition
is
satisfied.
We consider
the two possible cases for T in order to show that the
ACM
TransactIons
on
Database
Systems,
Vol.
19,
No.
4, December
1994.
Quorum Consensus
remaining
clauses
of the
in Nested Transaction
precondition
for
Systems
REQUEST_
559
.
COMMIT(T,
u) are
satisfied.
(1) T ● laW.
which
Then
satisfies
the
definition
the remaining
(2) T G la,.
Then
by
of a write-LA
implies
that
u = “OK”,
precondition.
the
construction,
y ‘IX=
there is a REQUEST_
COMMIT
for a write
last such write access. Then s‘ data
is equal
logical-sequence(
/3 ‘).
If
access to Sx in y‘, let T‘ be the
to datcz(T ‘). By the construction,
we know that T‘ is also the name of the logical write access in U? having the
last REQUEST_
COMMIT
in ~‘. Therefore,
data(T’)
= Jogical-value(
~ ‘). So
the data component
of the state of Sx led to by -y’ is logical-value(
@‘).
Sx
On the other
in y‘, then
hand, if there is no REQUEST_
COMMIT
for a write
s ‘data
is dO, which is logical-vahze(
/3 ‘). Therefore,
component
of the state of Sx led to by y‘ is again logical-value(
Thus, in either case, logical-value(
~‘) is the data component
Sx in any global state
value v = logical-value(
Sx
led
to
REQUEST_
Therefore,
by y’,
which
implies
that
the
COMMIT(T,
v) in Sx is satisfied.
❑
y is a schedule of ~.
QUORUM
Now we present
implementations
algorithm
are our
place
as
the
that
a transaction
object
of
in
that
the
holds
remaining
precondition
for
CONSENSUS
with
ultimate
must interact
correctly
of a logical
access. Thus
configuration
/3 ‘).
of the state
led to by y‘. By Lemma
4.3.1, we know that the return
~ ‘). Thus, v is the data component
of the state, s‘, of
5. RECONFIGURABLE
configuration
processing
access to
the data
reconfiguration.
For the concurrent
goal, the activity
of changing
the
with
each
reading
the
reconfiguration
transaction
nesting
the current
configuration
during
needs to take its
structure,
configuration.
This
accessing
might
a
be done
with reconfiguration
transactions
as children
of TO; however,
the correctness
arguments
work equally
well no matter
which existing
transaction
is taken
as parent of each reconfiguration
transaction.
Thus we allow each reconfiguration to occur at an arbitrary
location
in the transaction
tree.
The presentation
and proof are carried
out in stages. First, in Section 5.1,
we define
system
~,
a serial
and that also has another
automaton
Sx associated
The object
simply
automaton
receives
system
Sz associated
requests
that
has
X as one of its object
from
with
Z is a special
transactions
to change
dummy
the
tion and responds
to them with a trivial
acknowledgment.
M is used to define correctness.
Then in Section
5.2, we define replicated
serial system
identical
Section
automaton
names4
special object name Z. As in Section 4, the object
with
X is a read/write
serial object automaton.
object,
current
As before,
system
System
$3’ is
to .@, except that logical object Sx is replicated
as in system
4, and also dummy
object Sz is replaced
by a configuration
L?? of
object
Sx, which
is a read/write
object
containing
~.
which
con&ura-
the current
con figura-
~In this section, we redefine certain notation from Section 4, such as the system names d
and .@.
ACM
Transactions
on Database
Systems,
Vol.
19, No
4, December
1994
560
K. J. Goldman
.
tion
of replicas
object
and N, Lynch
of X. In ~,
Say. In the course
the configuration
reconfiguration
of performing
object
S-y is read
requests
involve
each logical
to determine
read
a single
or write
the current
write
of
access to X,
configuration.
In
Section 5.3, we show that the new algorithm
is correct by demonstrating
relationship
between
@ and .&. The proof is analogous
to that in Section
the changes involve
the handling
of the new configuration
object.
Next,
in Section
except
that
the
5.4, we define
configuration
serial
object
system
‘3’, which
is replicated.
That
is the
is, the
same
a
4.3;
as &?’
configuration
object SF from @ is replaced
in % by a collection
of serial read/write
objects
that we call configuration
replica objects, and each access to a configuration
object
is implemented
by a subtransaction
of the configuration
in
replicas.
an interesting
Section
turn
way
on the
5.5, we prove
implies
5.1
System
We
begin
The
data
connection
defining
the
performs
object
with
Sz
domain
II
associated
below.
As before,
we define
Z
System
with
serial
la,
value
Dummy
3,
in
which
in
system
Reconfiguration
.&. System
dummy
~“ is an
object names, X and Z, in
X is a read/write
serial
dO, and in which
is a special
and
%“ and
Finally,
& and x7.
unreplicated
and initial
with
is based
scheme.
between
arbitrary
serial system having
two distinguished
which
the object automaton
Sx associated
with
ton
management
management
between
Serial
accesses to a subset
replica
relationship
.c#: The Unreplicated
by
replica
a simulation
a similar
that
configuration
the object
object
la,,, to be the respective
automa-
automaton,
names
defined
of the read
and
la,,,
to be the set of names
of
write
accesses to X. Now we also define
As before,
accesses to the dummy
object Z. We define la = la, U la,,, U la,,,.5
each transaction
name T E la, has kind(T)
= read, and each transaction
name T = laW has kind(T)
= write and also has an associated
value, data(T’)
e D.
Recall
that
system
.@ is used to define
defined
at the transaction
same transaction
interfaces
correctness,
and that
correctness
is
boundary.
Thus, it is desirable
that we use the
in ,cf as we do in the later systems,
Q? and %.
Since those systems
will permit
arbitrary
transaction
automata
to invoke
reconfiguration
operations,
we also allow them to do so in w’. However,
since
X is not replicated
in .v”, these operations
will not do anything
interesting
in
.&”. We simply
make them operations
on the dummy
object, which
merely
responds
“OK” in all cases.G
5Notc
that
umon
of
GIt
this
la,
may
seem
transaction
ration
is
different
artificial
to
automaton
approach,
which
where
augmentation
than
settle
ACM
application
instead
Transactmns
and
.&” would
be defined,
the
the
place
interface
to users
invisible
.w’. In this
from
defimtion
in
Section
4,
where
la
was
just
defined
to
be
the
in
the
lau.
and
would
the
in
invocations
.W; an
apphcatlon
programmers
be defined
represent
just
made
We leave
for the
simpler
alternative
on
Database
Systems,
reconfiguration
would
by not
the
of m with
details
19,
No.
of this
having
added
to
make
them
approach
1994,
these
appear
database
explicitly
in
.cY”’would
operations,
system
to interested
requests
reconfigu-
system
reconfiguration
of the
reconfiguration
4, December
transactions
be
4. An auxiliary
by programmers
of including
Vol
the
approach
as it is m Section
an augmentation
is presumably
programmers.
of
alternative
m C/
readers,
rather
and
Quorum Consensus
CREATE(T)
Effect: s.actiue = T
s.acfiue = T
in Nested Transaction
REQUEST.COMMIT(
Precondition:
Systems
561
.
T,W)
T = a’. actiue
v = “OK”
Effect:
s.actit, e = nil
Fig. 6,
5.1.1 Dummy
Dummy
Object Automata.
object
More
automaton.
precisely,
we define
a single
dummy
object automaton
Sz, which is a serial-object
automaton
having a single state
U {nil},
initially
nil. The set of return
values
component,
actiue G accesses(Z)
VT for accesses to Z is equal to {“OK”}.
The (trivial)
transition
relation
for Sz
is shown in Figure
6.
5.2
System
.%: The Reconfigurable
with a Centralized
In
this
system
Y ~ ~,
section,
Quorum
Consensus
Algorithm
Configuration
we
define
replicated
serial
system
~.
This
is similar
to
.2? of Section
4 in that object Sx is implemented
by replicas
SY,
and the accesses to X are implemented
as subtransaction
automata
called read-LA’s
and write-LA’s,
using
the same basic strategy
as in the
fixed-configuration
algorithm.
Additionally,
in the new M, the dummy
object
Sz is implemented
by a configuration
object Sx, which is a read/write
serial
object that maintains
a configuration,
and the accesses to Z are implemented
as subtransactions
the configuration
LA’s will invoke
We first
in S
tion
define
but
reconfigure-LA’s
the type
these
and
that
of K?, and then
not in .ti. The new automata
object,
define
called
invoke
accesses
to S4Y. (In
%’,
objects themselves
will be replicated,
and the reconfigurecoordinator
subtransactions
to read and write them.)
the
automata
coordinator
“bottom
and
give the new automata
are the replica
LA
objects,
transaction
that
appear
the configura-
automata,
We
again
up.”
5.2.1 System Type.
The type of system
@ is defined
to be the same as
that of .&, with the following
modifications.
As in Section 4, the object name
X is now replaced
by a new set of object names
~, and there
are new
read- and write-coordinators,
transaction
names,
co, and COU,,representing
respectively,
and ace, and accW, which are the read and write accesses to the
replicas,
object
respectively.
name
~,
and
Additionally,
there
object
are new
are the read and write
accesses
the configuration
accesses in this
coordinator
automata.)
We let co
The transactions
names in la
name
transaction
Z is now
names
replaced
CO,C and
by a new
CO,U
~, which
to object name ~, respectively.
(We denote
way, because in % they will be replaced
by
= co, u COW U corC U col~)~.
are accesses in JYj but in @ they are not;
each transaction
name in lar now has children
that are in CO,Cand co,; each
transaction
name in laU, now has children
in COrC, Cor, and COU,; and each
now has &il&n
in CO,C, COW,, CO,, and cow,, &o,
transaction
name in la,,,
as in Section 4, each transaction
name in Cor has children
in ace,, and each
in accu. These are the only changes to
transaction
name in COU,has children
ACM
Transactions
on Database
Systems,
Vol
19, No
4, December
1994
562
K, J, Goldman and N. Lynch
.
la,
law
3
Com
Core
Cor
co
Occr
co
r
)#
accw
accr
%
Fig. 7. Logical accessesand their descendants.
the
system
type—for
example,
no transaction
names
other
than
those
indi-
cated are parents
of the new transaction
names. The naming
structure
for
logical accesses and their descendants
is shown in Figure
7.
In defining
.@, we associated
information
with
some of the transaction
names.
In S,
we keep all such information
from
.@ and add more
information
as follows.
First,
to denote
we associate
with
the
new
every
configuration
transaction
for each logical
name
T G la,,,
in conf’’gs(~),
the set of all configurations
Second, to denote the configuration
that
we assume
that
every
transaction
name
of j?’.
is used
reconfigure
a configuration
access,
config(T
)
by each read-coordinator,
T G co, has an associated
config-ura-
tion config(T
) E corzfigs(~
). Similarly,
every transaction
name T G co. has
an associated
value
data(T)
G D, an associated
version
number
uersionnumher(T)
G N, and an associated
configuration
config(T)
G corzfigs(~
).
These denote the value for X written
by the write-coordinator,
the associated
version
number
in the quorum
consensus
algorithm,
and the configuration
to
be used for writing
the new value and version number.
Third, we associate some information
with the accesses to both the conjuration
objects
and the replicas.
We assume
that
every transaction
name
T G CO,Chas
kind(T)
kind(T)
= write
and
= read,
data(T)
and that
every
G configs(j2’),
tion. Also, as before, we assume that every
kind(T)
= read and that every transaction
● N X D.
write and data(T)
transaction
representing
name
the
T E CO,O.has
new
confiWra-
transaction
name T G ace, has
name T G accw has kind(T)
=
5.2.2 Replica
and Configuration
Automata.
As in Section 4, we define a
replica
automaton
for each object name Y E ~. This is a read/write
serial
object for Y with
domain
N X D and initial
value
(O, do). As before,
for
u.version-number
and v.value
to refer to the compov ● PJ x D, we write
nents of v.
For the rest
of this
paper,
let
C = configs(~),
and let
co ~ C be a distin-
guished
initial
configuration.
The configuration
automaton
is defined to be a
read/write
serial object for ~, with domain
C and initial
value Co. (Recall
that its read accesses are the elements
of corC, and its write accesses are the
elements
of COWC.)
ACM
Transactions
on
Database
Systems,
Vol.
19,
No.
4, December
1994.
Quorum Consensus
CREATE(T)
Effect: s.awake
s.awake
in Nested Transaction
REPORT_ABORT(T’)
Effect: s, reported=
= true
= true
s.mported
REQUEST-CREATE(2”)
Precondition:
s’. au,ake = trwe
7“ @s’. requested
Effect:
s.reque$ted
=
s’, reported
= s’. reporfed
U {T’)
u {T’)
s’. au,oke = true
9’.
wgue9ted
= s’.rcported
(3?2 c conjig(T). rmd)(7? ~ s’. read)
v = (.qt. vcr.rion tlwnber, .qf.ualue)
U {T’}
Effect:
9.awOke = ja19e
T’,V)
s’. reported
s’. mad
.563
.
RIXJUEST.COMMIT(T,U)
Precondition:
= s’. requested
REPOR.T.COIVIMIT(
Effect: 9 reported=
s.read
Systems
u {T’}
U {object(Z”)}
if v.uersion-nurnber > s’. uersion-nu!nfmr- then
s.uersion-nomber .= u.ver9ion-number
s.ualue = u.urilue
s.r,-ported
s-.rcad
=
s’. reported
U {’l”]
s’. read u {object(T’))
=
if v.uer.. !on-namber
> s’. ucrsion-nt{?nbc-r then
s.ucr~ion-nornber
s.ualue
= u ucraion-ftumber
= u.ualue
Fig. 8. Transition relation for Read-Coordinators.
3.
5.2.3 Coordinators.
These transaction
In this section,
automata
are
we define the coordinator
automata
invoked
by the LA’s and access
in
the
replicas,
We first define read-coordinators;
these are the same as those in
Section
4, except that
instead
of using
the fixed
configuration,
the new
read-coordinator
uses its own associated
configuration.
A read-coordinator
for
transaction
ber,
initially
name
requested,
false;
T ● co, has state
reported,
value
G D,
and
read,
initially
components
where
do;
awake,
awake
is
version-number
quested and reported
are subsets of children(T),
a subset of ~, initially
empty. The set of return
The transition
relation
for read-coordinators
value,
version-num-
a Boolean
G N,
variable,
initially
code is identical
determine
when
assured of having
to that
in Figure
2, except
that
config(T)
is
the coordinator
has collected
enough
information
seen the most recent value of X.
We next define
Section
4, except
write-coordinators.
Again,
these are the same
that
instead
of using
the fixed
configuration,
write-coordinator
uses its
own
associated
O; re-
initially
empty; and read is
values VT is equal to N X D.
is shown in Figure
8. This
configuration.
used
to
to
be
as those in
the new
A write-coordinator
for transaction
name T G COW has state components
awake,
requested,
reis a Boolean
variable,
initially
false;
ported,
and written,
where
awake
requested
and reported
are subsets
of children(T),
initially
empty;
and
written
is a subset of ~, initially
empty. The set of return
values VT is equal
to {“OK”}.
The transition
relation
for a write-coordinator
named T G cow is shown in
Figure
9. This code is identical
to that in Figure
3, except that config(T)
is
used to determine
when the coordinator
has written
to enough replicas.
ACM
Transactions
on Database
Systems,
Vol.
19, No.
4, December
1994.
.564
K. J, Goldman
.
CREATE(T)
Effect: s.awake
s.otwke
and N. Lynch
REPORT_ABORT(~)
Effect: s.reported = s’. rcportedu
= true
= &rue
s.reported
RXQUEST.CREATE(Z’)
Precondition:
REQUEST_COMMIT(
Precondition:
T,u)
s’. uwake = true
9’. au,ake = true
T’ $$s’. reque~ted
s’. requested
dafa(T’)
= (uersion-number(
T), data(T))
(3W E
Eifect:
= s’, reguestedu
REPORT-COMMIT(T’,V)
Effect: s.reported = s’,
wrvtten
9’. wrifte”)
s.turvtten
{T’}
EtTect:
s awake = jalse
reported
U {T’}
= s’. wr:tten U {object)
9.reported
= s’. repot tedu
= s’, wr:tten
Fig
5.2.4
= s’ reported
cOnjig(T)tur:te)(W ~
V=” OIY”
s,rcquested
9
{T’]
repor~edu {~)
= s’
Logical
{T’)
U {object(T’)}
9.
Access
Transltmn
relation
Automata.
for Write-Coordinators.
Now
we define
logical
access
beginning
with read-LA’s.
These are the same as those in Section
that the new read-LA
first invokes
a read-con figuration-coordinator
automata,
4, except
in order
to determine
the current
configuration.
It then invokes only read-coordinators
whose associated
configuration
is the same as the determined
current
configuration.
A read-LA
value-read,
Boolean
u {nil},
has state
and
variables,
initially
initially
empty.
The transition
the
code
REQUEST_
taining
components
config-read,
initially
nil;
and
awake,
where
con fig,
awake,
false; config
requested
and
value,
l}alue-read,
requested,
and
initially
G C U {nil},
reported
are subsets
reported,
config-read
are
nil; value G D
of children(T),
The set of return
values VT is equal to D.
relation
for read-LA’s
is shown in Figure
10. The changes
in
Figure
CREATE
responses
4
and
from
tional
state
components:
returned,
and config-read,
read.
Also,
there
are
the
are
as
REPORT
follows.
_COMMIT
configuration-coordinators.
config,
a flag
extra
to
First,
there
are
additional
actions for invoking
and obThere
are
two
addi-
which
keeps track
of the configuration
that records that a configuration
has been
clauses
in
the
precondition
for
the
REQUEST_
CREATE
events for read-coordinators,
which
ensure that only
read-coordinators
with the current
configuration
are invoked.
We next define write-LAs.
These are the same as those in Section 4, except
that the new write-LA
first invokes a read-con figuration-coordinator
in order
to determine
the current
configuration,
and then invokes only read-coordinators and write-coordinators
having
the proper
associated
configuration.
A
write-LA
has state
components
awake,
config,
config-read,
value-read,
version-number,
requested,
and
reported,
where
awake,
value-written,
are Booleans,
initially
false; con fig
config-read,
value-read,
and value-written
initially
nil; version-number
● N U {nil},
initially
nil; and reG C u {nil},
quested and reported
are subsets of children(T),
return
values VT is equal to {“OK”}.
initially
ACM
1$394
Transactions
on Database
Systems,
Vol
19,
No
4. December
empty.
The
set of
Quorum Consensus
in Nested Transaction
REQUEST_CREATE(~),
Precondition:
T’ c COm
s.value-read
s.
s’. awake = true
reported
=
T’ @ s’. requested
s.ualue-read
= s’, reguestedu
s.conf7g-mad = true
s.reported = s’. reported U {T’}
read
9.config
= false
=
u {T’}
REPORT_A130RT(Z”)
Eflect: s.reported =
s.reported
=
9’. reported
= s’. reporfed
REQUEST.COMMIT(
P~econdition:
s’, awake = true
then
sf. requested
frue
u {r}
U {T’)
T,U)
= s’. reported
s’. ualue-read
v = al,”al”e
v
s.config-reacf
= true
{T’]
REPORT..COMMIT(
T’,V), T’ E cm,.
Effect: a.reported = .’. reported u {T’]
if s’. conjig-read = false then
s.conjig = v
s’. config.
true
= s’. reporteti
if s’. ua?ue.read = false then
.9.value = ti. ualue
Effect:
s.reguested
565
.
REPORT.COMMIT(~,
v), ~ E CO,
Effect: s. reported= s’.reportedu {’~)
if s’. uafue-rwad = jalne then
s.twlue = v.ualue
CREATE(T)
Effect: s.cmmke = trw
9.awake = frue
if
Systems
= true
Effect:
REQUEST-CREATE(T),
Precondition:
s’. awake = true
9’. c0nfig-read = true
config(’1”) = s’. config
T’ @s’. requested
Effect:
s.reqaested
~ G CO,
= 3’.reque3&edU
Fig. 10.
The transition
the
code
in
relation
5
Transition
are
= fake
{~]
for write-LRs
Figure
a.awake
as
relation
for Read
is shown
follows.
LA’s,
in Figure
First,
REQUEST_
CREATE
and REPORT_
COMMIT
actions
taining
responses
from the configuration-coordinators,
components
the
for keeping
precondition
for
track
the
of the results.
REQUEST_
CREATE
There
11. The changes
there
are
for invoking
and oband associated
state
are also extra
events
to
additional
for
clauses
in
read-coordinators
and write-coordinators,
ensuring
that
only coordinators
with
the
configuration
are invoked.
Finally,
we define recon@-ure-LA’s.
The purpose
of a reconfigure-LA
current
is to
change the current
configuration
to a given target
configuration.
The main
thing it does in order to accomplish
this change is to invoke a write access to
the configuration
object. However,
this alone is not enough;
transaction
must also maintain
the crucial
property
that
the recont3gure-LA
all the members of
some write quorum,
according
to the current
configuration,
must haue the
latest version number and data value. (The importance
of this property
can be
seen in the proof of Lemma
4.3.1.) An arbitrary
configuration
change might
cause this property
to become violated.
Thus, extra activity
is required
on the part of the reconfigure-LA
to ensure
that
the latest
data gets
configuration.
In particular,
ACM
propagated
to some write
quorum
of the new
the reconfigure-LA
must first obtain this latest
Transactions
on Database
Systems,
Vol
19, No
4, December
1994
566
K. J. Goldman
.
and N. Lynch
REQUEST.-CREATE(
Precondition:
s’. awake = true
CREATE(T)
Effect: 9. awake = true
$.awake = true
a’.ualue-rcad
REQUEST.CREATE(T’)
Precondition:
, ?“ c CO,c
conjig(~)
T’), T’ c COUI
= true
= 9’. conjig
rfata(T’) = riata(T)
uernion-number(T’)
Y $?3’. requested
a’. awake = true
T’ @ s’, requested
= s’. uersion-number
Effect:
Effect:
s.mquested
s.requested
= s’. requestedu
REPORT.COMMIT(T’,
REP C)RT. CO MMIT(T’,V), T’ 6 CO,,
Effect: s.reported = s’. reportedu {T’)
Effect:
s.config-read
if
9 c~njig
s.reported
REPORT_4BORT(T’)
Effect: s.reported
then
v
s.config-read
9
= s’ reported
= s’. rt;,orted
u {T’}
u {T’]
T,u)
a’. awoke = true
s’. oumke = true
s’, config-read
s’. requested
=
true
= s’, conjig
v=
=
true
“OK”
Effect:
s ~wake = jalse
T’ $ s’.requcsted
= s’. reguc~tedu
REPORT-COMMIT(7”,
s.repo,-ted
= s’ reported
sf. ualue-rurilten
Effect
Effect:
reported
REQUEST_COMMIT(
Precondition:
CO.
Precondition:
s.r{quested
u {T’}
= true
= true
REQUEST_ CREATE (T’), T’ ~
conjig(T’)
COW
u {T’}
= true
= s’. re~rted
s’, ualue-written
u {T’)
= false
=
U), T’ c
= s’. reported
= true
= s’.reportcd
s’. co~~jig-rmd
s.reperted
s’. uolue-ruritten
if 9’ conjig-reed = \a/se then
9 config = u
s.reported
{T’}
= s’. requested u {T’}
{T’}
U)) T’ <
= .s’. reportedu
CO,
{T’)
if s’, ualue-mad = jalse then
.s.uersion-nurnber = v,uersion-number
s.ualoe-recrd
s,rcported
= true
= s’, reportcdu
{T’]
= jalse then
if s’. ua)ue-read
s.uers:on-number
= v.uer9ton-number
s.ualue-read=
true
Fig.
11.
Transition
relation
for Write
LA’s.
data, which
requires
that it first obtain
the value of the old configuration
from a read-configuration-coordinator
and then the latest data value from a
Once it has the latest data, the
read-coordinator
with the old configuration.
reconfigure-LA
must cause this data to be written
to some write
quorum,
according
to the new configuration.
It accomplishes
this by invoking
write-coordinators
whose corresponding
configuration
is the new configuration.
(Note
that
the reconfigure-LA
can invoke
the write-coordinators
and the write
accesses to the configuration
object concurrently,
even though
in the serial
systems considered
here, the subtransactions
themselves
will be run serially.
This concurrent
invocation
will be useful in the concurrent
systems we study
later, where these activities
can run in parallel
to reduce latency.)
ACM
TransactIons
on
Database
Systems,
Vol
19,
No.
4, December
1994.
+ 1
Quorum Consensus
A reconfigure-LA
awake,
for
config,
value-read,
is in
version-number
subsets
and
u {nil},
c
{“OK”}.
The transition
read-
config-written,
nil;
value
and
for reconfigure
LA’s
state
components
config-read,
variables,
G D U {nil),
first
and
values
is shown
567
config-read,
awake,
requested
a reconfigure-LA
.
reported,
The set of return
empty.
write-LA’s,
has
Boolean
are
nil;
initially
Systems
where
config-written
initially
relation
and
T G la,,,
requested,
initially
~ N U {nil},
of children(T),
for the
and
value-written,
config
name
version-number,
value-written,
value-read,
false;
transaction
value,
in Nested Transaction
initially
initially
nil;
reported
are
VT is equal
in Figure
to
12. Just
determines
the
as
current
configuration
with a read access to the configuration
object. Then the reconfigure-LA
invokes
read-coordinators
using that configuration;
when the first
read-coordinator
reports its commit,
the reconfigm-e-LA
remembers
the value
and
version
number
number
returned.
Then
of write-coordinators,
value
and version
using
number
the current
value
as needed.
Concurrently,
returned
to every
replica
the
the
its
reconfigure-LA
new
may
configuration,
invoke
along
by the read-coordinator.
This
any
with
the
propagates
in a write
quorum
of the new configuration,
invokes
at least
one write
LA
access
to the
configuration
object in order to record
the new configuration.
In order to
request
to commit,
the reconfigure-LA
must
receive
REPORT_
COMMIT
and at least one write
access
responses
for at least one write-coordinator
to x.
5.3
Correctness
In this
begin
%.
Proof
section,
with
First,
exactly
we prove
the
some definitions.
the
logical
as in Section
correctness
of system
In each definition,
access
sequence
of
~,
4 to be the subsequence
~.
As in Section
~ is a sequence
logical-sequence(
~),
of D containing
4, we
of actions
of
defined
is
the CREATE(T)
and REQUEST_
COMMIT(~,
v ) events, where
tion is based on the new definition
of la, which
T E la. (But now this definiincludes
the names in la,, C.)
Also,
current-un(
if
before,
~ is finite,
to be the
then
value
logical-value(B)
of the
last
and
logical
write
(or the
/3 ) are defined
initial
value
logical object if no such write occurs), and the highest
version-number
automata
in any
global
state
led
the local states of all replica
as
of the
among
to
by
~,
respectively.
We require
one new definition,
analogous
to logical-value(
~ ).
finite,
then
logical-config(
/3 ) is defined
to be either
REQUEST_
COMMIT(T,
v) is the last REQUEST_
COMMIT
or
transaction
name in la~,c that occurs in logical-sequence(@),
is
REQUEST_
is the value
COMMIT
event
of the last logical
no such reconfigure
The next lemma
correctness
inductive
(c)
says
theorem.
argument.
that
the
occurs. In other words, the logical configuration
reconfigure
access (or the initial
configuration
access occurs).
is the key to
As
in
Parts
Lemma
the
4.3.1,
proof
of Theorem
condition
(a) and (b) of condition
configuration
ACM
Namely,
if ~
config(T)
if
event
for a
co if no such
object
Transactions
holds
on Database
the
(1)
5.3.11,
is only
(1) are as before,
logical
Systems,
Vol.
the
needed
main
for
the
and part
configuration.
19, No.
if
4, December
As
1994.
568
K, J. Goldman
s
and N. Lynch
REQUEST..CREATE(
CREATE(T)
Effect: s’. awake = true
~’. armrke = true
s’ wdue-mmd = true
3’. awake = true
REQUEST.
CREATE(T’),
T’ ~
data(T’)
Core
= .d. wdue
uersion-nu
Precondition:
true
T’ $ s’. r-equeated
st. awake
T’), T’ 6 co.
Precondition:
niber(T’)
= s’ uersion-aumber
config(T’)
= co.fig(T)
T’ $ ~’.requested
=
Effect:
Effect:
s.requestea’
s.reguested
= s’. requestedu
REPORT-COMMIT
REPORT-COMMIT
Effect:
if
(T’,rJ),
s.reported
9’
T’ C COW
= s’. reportedu
{T’]
s.conjig-read
s.mported
=
REQUEST
.CREATE(T’),
s’. t,olue-rcild = true
data(T’)
== conjig(T)
T’),
T’ 6 co.
T’ @ s’ requested
Effect:
s.requested
REPORT.
COMMIT(T’,V),
Effect:
s
reported
Effect:
s reported=
s.reparfed
$.rmrsion-number
s’. reque~ted = s’. reported
s’. ualue- written = true
u.version-number
s(. conjlg-wrxtten
v=”
OK”
12.
important
returns
7’,V)
s’. au,ake = true
= true
Fig,
Transition
relation
= true
= jolse
for Reconfigure
LA’s.
pax-t of the lemma
is condition
(2), which tells us that
the value associated
with the previous
logical write.
~ be a finite
schedule
of U? such that
no logical
properties
hold
in any global
state
led to by P:
There exists a write quorum
YI ~ logical-config(
/3 ).write
any replica SY for an obJ”ect name Y E 7?;-,if v is the data
Sy, then v.version-number
= current-vn(
B ).
Transactions
access to
in ~.
(1) The following
ACM
u {T’)
{T’)
Precondition:
EtTect:
s.awake
(a)
s’. reperted
= s’. reportedu
REQUEST.COMMIT(
U {T’)
=
Let
U {T’)
= v.uer9ion-number
if s’. uafue-wad = }rd.se then
s.ualue = v value
5.3.1.
{T’]
= true
REPORTABORT(T’)
= true
= s’. reported
s.uatue-read
T’ ~ COwc
s reported = s’. rvprfed
s.config-written
= true
= s’. requested U {T’)
s.uersion-number
LEMMA
{T’}
= s’. reporftdu
s.config-written
s.tla!ue-rmd
X is active
= g’. reguesfedu
= true
if s’. ualue-mad = false then
ir. t,alue = v.mlue
the
T’ G COvJc
s’. awake = true
= true
REPORT. COMMIT(T’,V),
T’ C co,
Effect: s.reparted = s’. reported U {T’]
each read-LA
U {T’}
= true
Precondition:
config(l”)
= 3’ conjig
T’ $ s’. wquested
before,
= s’. reported
9’. ualue-wrttten
v
REQUEST-CREATE(
Precondition:
s’ atiuke = true
s.reporfed
U {T’]
= true
= true
9.conjig-read
s’. conjig-r.md
T’ 6 COW
= s’. reported
s’. uolue-written
s.reported = s’. rcportedu
{T’]
if s’. conjlg-read
= ~ahe then
9.cOflJig
(T’,rJ),
s.re~rted
Effect:
config-mad = /a/se then
3 conjig = u
Effect:
s.regue~ted
{T’}
= s’. regue9ted U {T’)
on
Database
Systems,
Vol.
19,
No
4, December
1994
such that
component
for
of
Quorum Consensus
(b)
Systems
if v is the data component
of SY and
~ ) then v.value = logical-value(
P).
The data
of the configuration
~ ends
value(
component
in REQUEST_
COMMIT(T,
object
U) with
569
.
For any replica
SY,
number
= current-vn(
(c)
(2) If
m Nested TransactIon
v.version
is logical-config(
T G la,,
then
-
~ ).
v = logical-
P).
As before,
PROOF.
of the form
logical-sequence(~)
CREATE(7’
) REQUEST_
ceed by induction
on the
logical-sequence(
~ ) contains
consists
of a sequence
COMMIT(T,
of pairs,
each
T ● la. We PrO-
v), where
number
of such pairs.
For the
no such pairs. Then logical-config(
basis,
suppose
~) = co. Parts
(a) and (b) follow as before, with logical-config(
P ) used in place of the fixed
configuration.
For part
(c), since
~ contains
no CREATE(T)
events
for
T G la, it contains
no REQUEST_
COMMIT
events for accesses to the configuration
object; therefore,
since the data component
of the configuration
object
is initially
co, this is also the case in any global state led to by /3. This yields
part
(c), and condition
For the inductive
the last
CREATE
holds
for
COMMIT(Tf,
(2) holds
step,
let
event
in
vacuously.
~ = /3 ‘y, where
~),
and
assume
f?’.
So,
logical-sequence(y)
= CREATE(T’f)
Vf) for some T+ = la and Vf of the appropriate
the fact that the system is serial shows
T~ have actions in y; in particular:
Claim
REQUEST_
is at least
that
only
begins
that
with
the lemma
REQUEST_
type. As before,
descendants
or ancestors
and
if
CREATE(T)
is
in
ace, U ace,,,
v) occurs in y, then T G descendants(Tt
). Also,
5.3.2.
If
T
COMMIT(T,
is in co and if CREATE(T)
T = children(T~).
There
logical-sequence(y)
logical-sequence(
or REQUEST_
one REPORT_
COMMIT(T,
COMMIT
event
v ) occurs
of
or
if T
in y, then
for a transaction
name
in
co, in y; let T‘ be the read-coordinator
in co, with the first REPORT_
COM). Let y‘ be the portion
of Y UP to
MIT in y. By claim
5,3.2, T’ ● children(Tf
and including
the REPORT_
COMMIT
shows that no LA requests
the creation
event for T‘. Then, because the code
of any write-coordinator
(for replicas
or configuration
objects) until
after a read-coordinator
we see that there are no REPORT_
COMMIT
events
T~ that are write accesses.
Also, since -y must contain
a REQUEST_
CREATE
name
in
REPORT_
CO,C with
up
co,, the
COMMIT
the
to and
claims
first
including
about
PROOF.
REPORT
this
the states
We have
accesses
event
for a transaction
definition
of T~ implies
that
there
is at least
one
event in y‘ for an access in CO,C;let T“ be the access in
_COMMIT
REPORT_
of the automata
Claim
5.3.3.
Let s be the
with
Tf in any global
state
logical-conjig(
~ ‘).
y‘ for write
has returned
a value,
in y‘ for descendants
of
in y‘,
and let
COMMIT
event.
for Tf and
y“
We
be the
now
portion
give
of y
several
T‘.
state of the transaction
reachable
from
~ ‘y”
in
automaton
~. Then
associated
s .config =
argued that there are no REPORT_
COMMIT
events in
that are descendants
of Tf. So, by Claim 5.3.2, no write
ACM
TransactIons
on Database
Systems,
Vol
19, No.
4, December
1994
570
K. J. Goldman
.
and
N
Lynch
accesses occur in y‘. Therefore,
logical-corz~ig(
returns
as its value
by part (c) of the
P ‘). By definition
nently
by T“.
set to the value
Claim
5.3.4.
Let
returned
s be the
state
Therefore,
of the
inductive
hypothesis,
T“
of Tf, s.config
is permathe claim
transaction
❑
holds.
automaton
associated
with T‘ in any global state reachable
from ~‘ in ~ ‘y’. Then s.version-nurnber
and s.value contain
the highest
version-number
and associated
value among
the states led to by P‘ of the replicas
whose names are in s. value-read.
Claim
with
5.3.5.
Let
s be the
T~ in any global
= current-un(
Let
PROOF.
MIT
~‘)
event
state
and
state
of the
reachable
s.value
s‘ be the state
from
transaction
automaton
~ ‘y’ in ~. Then
= logical-z] alue(
~ ‘).
of T‘ just
it issues
in y‘. By the definition
before
associated
s.version-nurnber
its REQUEST_
of a read-coordinator,
s‘. read
COM-
must
contain
a read quorum
.%?G config(T
‘). read. Claim
5.3.3 and the defhition
imply
that
config(T’
) = logical-con fig( ~‘ ), so the read quorum
Y
of Tf
is in
logical-config(
~‘ ). read. By part (a) of the inductive
hypothesis,
there is some
~‘ ).write such that the states of all replicas
write quorum
7Z-● logical-config(
for object names in Win
any global state led to by P‘ have version-number
=
current-vn(
~‘ ). Since logical-config(
~‘ ) is a configuration,
M? and W- must
have a nonempty
intersection.
So s‘ ,read must contain
at least one object
name in Z? So, by Claim
5.3.4, s ‘.uersion-number
= czLrrent-un( ~ ‘), Therefore,
by part
When
T‘
(b) of the inductive
reports
its
commit
hypothesis,
to T~, the
s‘ .ualue
= logical-ualue(
version-number
and
~‘ ).
value
compo-
nents of Tf are set equal to s‘ version-n
urn ber and s‘ value,
respectively.
By
definition
of T~, these components
are never again modified.
Therefore,
Claim
❑
5.3.5 is proved.
We now return
to the main
proof.
We consider
the three
possible
cases for
T~ .
(1) T~ G la,.
The logical-value(
/3) = logical -value( ~ ‘). Since T~ invokes
only read-configuration-coordinators
and read-coordinators,
which
in turn
invoke only read accesses, the version-number
and value components
of the
states
global
of the replicas
in any global state
state led to by ~‘, so current-vn(
led to by ~ are the same as in any
~ ) = czu-rent-un( ~ ‘). Likewise,
we
have logical-config(
~ ) = logical-config(
P ‘), and
configuration
object is the same in any global
Therefore,
this
time
condition
(1)
on Claim
based
holds
for
~.
The
proof
the data component
of the
state led to by ~ and ~‘.
for
condition
(2)
is
as before,
5.3.5.
Then
logical-ualue(
~ ) = data(T~ ) and logical-config(
P) =
(2) T~ ● la,,,.
logical-conflg(
~ ‘). We prove two claims
about
the information
associated
with the descendants
of T~ invoked
in y.
Claim
5.3.6.
If T is a write-coordinator
invoked
= data(T~ ),
ber(T)
= current-vn(
~‘) + 1, data(T)
config( ~‘ ).
ACM
TransactIons
on
Database
Systems,
Vol.
19,
No.
4, December
by T~ then version-numand
config(T)
= logical-
1994
Quorum Consensus
Let
PROOF.
event
that
s be the
occurs
+ 1, data(T)
cannot invoke
reports
and
by
of T~ just
in y. By definition,
= data(T~),
and
a write-coordinator
its commit.
in y occur
state
after
So all REQUEST_
5.3.5,
before
the
CREATE
by Claim
s.uersion-number
= current-vn(
5.3.7.
If T is a write access for an object
= (current-un(
/3’) + 1, data(T~ )).
As before,
5.3.6,
T~ cannot
and the definition
request
to commit
quorum
of replicas,
logical-coizfig(
quorum
rent-un(
in logical-config(
P‘ 8 ) = eurren,t-vn(
until
to
conflg(TU
receive
Then
Claim
5.3.8.
). By
REPORT_
config(
Claim
logical-value(
fi ‘),
Analogous
5.3.9.
data(T)
invoked
5.3.6
in y, then
❑
a REPORT_
COM-
request to commit
accesses to a write
Claim
5.3.6,
COMMIT
Claim
corzfig(TU, ) =
events
5.3.7,
~ ) = logical-ualue(
data(T)
Claim
data(T)
If
to that
about
invoked
for a write
condition
~‘)
and
(1) still
logical-con-
the information
by T~ then
= logical-ualue(
5.3.10.
of Claim
T is a write
= (current-un(
By Claim
PROOF.
@‘),
associ-
uersion-num-
and
config(T)
=
If
access invoked
/3’), logical-ualue(
5.3.8 and the definition
T
❑
5.3.6.
is a write
access
by a write-coordinator
It follows
By the definition
from
T~ cannot
Cowc
invoked
by
T~ in
y,
Claim
request
of T~.
then
5.3.9 that
to commit
❑
current-un(
until
~ ) = current-un(
a REPORT_
COMMIT
~ ‘). By definifor at least
of its write-coordinators
occurs. Let TW be any write-coordinator
child
5.3.8, uersion-number(T,U
that
has a COMMIT
event
in y. By Claim
current-un(
~ ‘); data(T,u ) = logical-ualue(
P ‘); and config(T,U ) = config(T~
definition
y,
❑
of a write-coordinator.
in
in
P ‘)).
= config(T~).
PROOF.
tion,
Claim
T~ ).
PROOF.
then
in ~
fl ‘),
~ ‘). Therefore,
by Claim
5.3.7, it follows
that cur~‘)
+ 1, so condition
(1) holds in any global state
If T is a write-coordinator
= current-un(
/3 ‘). Thus,
it receives
fig(P)
= conflg(Tt
). We first give three claims
ated with descendants
of T~ invoked
in y.
ber(T)
T~
Let T(O be the write-coordinaevents in y, and let ~ be the
led to by ~ ‘8. As before, but this time using
holds for ~. Condition
(2) holds vacuously.
(3) T~ G lar,C.
= logical-config(
COMMIT(
TU,). TW cannot
COMMIT
events for write
according
~ ‘), so T,,, must
by definition,
read-coordinators
of a write-coordinator.
MIT from at least one of its write-coordinators.
tor child of T~ that has the first COMMIT
portion
of y up to and including
until it has received REPORT_
CREATE(T)
for write-coordinators
s.config
Claim
data(T)
571
.
= s.uersion-number
events
5.3.3,
❑
By Claim
REQUEST_
= s.config.
Also,
at least one of its
holds.
PROOF.
Systems
uersion-number(T)
config(T)
until
~ ‘y’. Therefore,
Claim
in Nested Transaction
of
a
write-coordinator
ACM
Transactions
and
Claim
on Database
5.3.9,
Systems,
Vol.
just
19, No.
prior
to
4, December
one
of T~
) =
). By
the
1994,
572
K J Goldman
.
REQUEST_
config(
COMMIT
T~[,).zurite
sion-number
and N Lynch
for
such
Tu
that
in
y,
there
all replicas
= uersion-number(
must
be
for object
T~~,). By
the
a write
names
equalities
%7”●
quorum
in 7?< have
above,
data .c’er-
Yfl- is
a write
quorum
in logical-config(
~ ), and all replicas
for object names in %- have
data-uersio?z-number
= current-vn(
~ ). This shows part (a).
Claim
5.3.9 implies
that
all write
accesses in -y have data = ( currelz tZ.ln( @‘), Zogical-zjalue(
~ ‘)), which is equal to (current-vn(
~), logical-ualzte(
~ )).
Since part (b) holds for ~‘, it also holds for ~. Claim 5.3.10 and the definition
of a reconfigure-LA
imply
that
the data component
of the state
of the
configuration
object in any global state led to by ~ is equal to corzfig(T~),
and
hence to logical-config(
~).
read-LA,
condition
(2) holds
three
❑
Now
denote
we give the main correctness
theorem
for @. Again,
let
the transaction
names and object names of d, respectively.
Let
5.3.11.
(3 be a finite
that
= PIT for each transaction
(2) ylY
= ~lYfor
The
each object
proof
is analogous
by
removing
from
@ all
REQUEST
_COMNIIT(T,
REPORT_
COMMIT(
T, u), and
tion
names
T
in
end
of the
of
d.
two conditions
name
name
The
schedzde
the followilzg
(1) ylT
PROOF.
holds.
T~ does not
in all
y of d such
lemma
(c). Since
5.3.1.
schedule
the
part
Thus
THEOREM
cases,
This shows
vacuously.
proof
Then
name
for
Lemma
Y~l and
there
a
-2Pi
exists
a
hold:
T ● Y> – la and
Y •;2~.
to that
for Theorem
4.3.7.
We construct
y
REQUEST_
CREATE(T),
CREATE(T),
v),
COMMIT(T),
ABORT(T),
REPORT _ABORT(T
) events for all transac-
co U ace, U ace,,,.
Clearly,
the
two
conditions
hold.
It
re-
mains to show that y is a schedule
of .W to do this, it suffices to show that y
projects to yield a schedule of each component
of M. It is easy to see that, for
each nonaccess
transaction
automaton
T, and for each object
for
name
ule of the object automaton
CREATE(T)
and REQUEST
T of s?’, y IT is a schedule
name
of the transaction
Y = 7A – {X, Z},
y IY is a sched-
for Y. The sequence y IZ is exactly the sequence of
Since
_COMMIT(T,
z)) actions in y for T = la,,,.
reconfigure-LA’s
preserve
transaction
well-formedness
and always
return
value u = “OK”,
y IZ is a schedule
of the dummy
object automaton.
Also, as
before, y is a schedule
of the serial scheduler,
It remains
to show that y IX is a schedule
of S’x. The proof is by induction
on the length
of /3. The basis,
when
~ is of length
O, is trivial.
Suppose
that
6 = B ‘n, where m is a single event and the result true for @‘. The onlY case
that
is not immediate
is the case where
n is a REQUEST_
COMMIT
la, U laU,,
so
suppose
that
n =
event
for
a transaction
name
in
REQUEST_
COMMIT(T,
u), where T G la, U /aW. The proof that the preconditions
of w in S4Y are satisfied
as in Theorem
4.3.7, but this time using
❑
Lemma
5.3.1 in place of Lemma
4.3.1. Therefore,
y is a schedule of W.
ACM
TransactIons
on
Database
Systems,
Vol
19,
No.
1, December
1994
Quorum Consensus
5.4 System ~’: The Reconfigurable
with Replicated
Configurations.
It
is possible
implementing
interesting
avoid
point
to
manage
in Nested Transaction
Quorum
Consensus
configurations
in
Systems
Algorithm
a centralized
the algorithm
described
by system ~
to consider
replicating
the configuration
573
.
fashion,
directly
above. However,
it is also
information.
This may
the risk of the configuration
storage becoming
a bottleneck
or a single
One
way
of doing this is using
the
of vulnerability
in the system.
fixed-configuration
ration
that
quorum
describes
configuration.
A
configuration
consensus
algorithm:
define
quorums
and
quorums
read
write
read-con figuration-coordinator
replicas,
and returns
ation
number,
which
configuration-coordinator
both
a fixed
reads
metaconfigu-
of replicas
a read
of the
quorum
of
the latest configuration
and a generto
a version
number.
A
write-
is analogous
writes the new
configuration
to a write
quorum
of
configuration
replicas;
it does this using
the generation
number
obtained
from a read-configuration-coordinator.
Except
for the need to manage
the
generation-number
information,
the LAs are the same in this algorithm
as in
@. It
is also
possible
to manage
metaconfigurations,
similar
All
issues
of this
avoid
arise
could
an infinite
tion that
be either
the
configuration
i.e., to reconfigure
the
for the implementation
continue
regress,
for
any
we must
replicas
configuration
using
number
of steps.
stop at some point
However,
in order
does not require
further
configuration.
Such a stopping
point
a centralized
implementation
or a fixed-quorum
algorithm.
same
configurations
are
used
the same configurations
for
then
a special
both
stages.
are used to manage
to
and use an implementa-
case
of
There
is another
interesting
alternative,
however,
in which
one-to-one
correspondence
between
the replicas
at the last two
stages,
But
of metaconfigurations.
that
a centralized
implementation
is just
quorums,
where there is only one replica.)
the
changing
replicas!
That
using
fixed
there
stages,
is, at the
the data
might
(Note
replicas
last
is a
and
two
and the
copies of the configurations
themselves.
This means that the read-configuration-coordinators
will
need to read a set of replicas
of the configuration
objects
that
form
a read-quorum,
without
first
knowing
what
the
current
read-quorum
is! It turns out that they can simply start reading
configuration
replicas and use the generation
numbers
and configurations
stored in them to
determine
In this
when a read-quorum
for the current
configuration
has been read.
subsection,
we will describe
this strategy.
For simplicity,
we con-
sider the special case of which
objects and configurations—and
there are only two kinds of replicas—of
logical
both are managed
using the same configura-
tions.
The
system
we define
is called
give the new automata
that
the configuration
replica
write-configuration-coordinator
slightly
modified
from
objects
those
%“. We first
appear
define
the type
of %, and then
in K but not in .@. The new automata
are
and the read-con figuration-coordinator
transaction
automata.
Also,
the LA’s
and
are
of W.
5.4.1 System Type.
The type of system 8’ is defined to be the same as that
of .$?, with the following
modifications:
The object name ~ is now replaced by
ACM
Transactions
on Database
Systems,
Vol
19, No
4, December
1994.
574
K. J. Goldman
.
Fig.
13.
and N Lynch
Reconfiguration
coordinators
a new set of object names,
~,
assume that there is a bijection
are new
accesses
U acc,C
but
transaction
names,
in %“ they
are not;
changes to the
those indicated
representing
the configuration
replicas.
We
i, from the names in ~ to those in j?’. There
ace,,
and
acc,,, C, which
each
transaction
name
are the
name
in
CO,C now
in COU,
C has children
read
and
has
write
coordinators
with
In ~, we keep all such information
from
in
are the
names other
The naming
and their
some
~
children
in acc,,)C. These
system type—for
example,
no transaction
are parents
of the new transaction
names.
structure
for the read and write configuration
are shown in Figure
13.
In defining
~, we associated
information
names.
children
to the configuration
replicas,
respectively.
We let acc = ace, u accU,
U accUC. The transaction
names in CO,C and COW,Care accesses in J%’,
acc,C, and each transaction
only
than
and their
children
of the
transaction
and add more
information
as follows. We now assume that every transaction
name T ● COW~ has associated values old-con fig(T)
G C and generation-number(T)
G N, in addition
to
data(T)
as before. Also, we assume that every transaction
name T = acc,C
has hind(T)
= write
= read,
and also
and that
data(T)
every
transaction
name
T = accU,C has kind(T)
G N x C.7
5.4.2 Configuration
Replica Automata.
We define a configuration
replica
automaton
for each object name Y E ~. ‘I’his is a read/write
serial object for
Y with
domain
N x C and initial
value (O, co). For v ● N x C, we write
u .generation-number
5.4.3
and
Coordinators.
v con fig
In this
to refer
section,
to the components
we define
of v.
the coordinator
automata
%’. The read- and write-coordinators
are as in S, so we need only
read-con figuration-coordinators
and write-con figuration-coordinators.
define
read-configuration-coordinators.
coordinator
is to determine
The
the
latest
purpose
generation
in
define the
We first
of a read-configuration-
number
and configuration,
on the basis of the data returned
by the read accesses it invokes on con!@uration replica objects.
A read-configuration-coordinator
for transaction
name T G Co,c has state
components
awake, config, generation-number,
requested,
reported,
and read,
CO;
where
awake is a Boolean
variable,
initially
false; config ● C, initially
● N, initially
O; requested
and reported
are subsets
of
generation
number
7As with
the transaction
read/write
objects,
number
ACM
names
agreement
and configuration
TransactIons
on
into
Database
in accu,,
with
a single
Systems,
since
that
the transaction
definition
data
Vol.
compels
names
the
attribute.
19,
No
4, December
in acc W, name
combination
1994
of the
accesses
generation
to
Quorum Consensus
in Nested Transaction
REPORT.ABORT(T)
Effect: a.mported
CREATE(T)
Effect: s. awake = true
g.awake
= true
s.reported
REQUEST.CREATE(T’)
Precondition:
Systems
= s’.reportrd
= s’.mported
REQUEST.COMMIT(
Precondition:
sf. awake = true
s’. awake = true
s’. requested
REPORT.
Eflect:
COMMIT(T’,U)
s. reported= s’.reported
s.read
if
= s’. requested U {T’)
u {T’)
u {T’}
= .9’. reported
(3 R)(t(R)
s.requested
575
T,U)
T’ $ s’. requested
Effect:
.
E s’.config.rcad
A?? ~ s’. read)
v = (s’. generation-number,
s’. config)
Effect:
U {T’}
s.awake
= jalse
= s’. read U {objwt(T’)}
v.gene-ntion-number
>
a’. generation-number
then
s.generation-number
= rr.generrrtion-number
s. config = v. config
s.reported
= s’. reportedu
{T’]
s.r-ead = s’. read u {object(T’)}
if v.generation-number
> a’. generation-number
then
9.generation-number
= v.generrrtion- number
s.config = v.mnfig
Fig.
children(T),
values
The
Figure
Transition
initially
relation
empty;
and
for Read-Configuration-Coordinators.
read
g ~,
initially
empty.
The set of return
VT is equal to N X C.
transition
relation
for read-configuration-coordinators
14. A read-con figuration-coordinator
invokes accesses
replicas.
On receiving
returned
own
14.
a REPORT_
generation
state.
updates
If
its
the
own
returned
number
with
returned
COMMIT,
the
values.
The
number
and
interesting
replica
and ~)
commit,
associated
names
that
corresponds
is
part
of
larger,
a
the
the
of its
coordinator
components
to the
read-configuration-
its REQUEST_
s‘ read contains
(using
compares
component
configuration
coordinator
is the set of preconditions
for
the coordinator
reaches a state s‘ in which
ration
coordinator
generation-number
generation
generation-number
the
is shown
in
to configuration
COMMIT.
When
a set of configu-
the correspondence
i between
~
to a read quorum,
according
to s‘ .config,
then it may request
to
returning
the highest
generation
number
it has seen, along with the
configuration.
uration-coordinators
One
and
the
should
note
the
read-coordinators
similarity
of the
in the way
the
read-configmost
current
configuration
and generation
number
are obtained,
and also the seemingly
“circular”
use of replicated
configuration
data in order keep track
of the
current
configuration
of the configuration
replicas
themselves.
Of course, it remains
to show that this method
of determining
the current
configuration
is correct.
The key intuition
is that
the write-configuration
coordinators,
defined
next and as in system g, write the new configuration
and generation
number
to a write-quorum
of the old configuration,
which of
ACM
Transactions
on Database
Systems,
Vol.
19, No.
4, December
1994.
576
K J, Goldman
.
and N. Lynch
CREATE(T)
Effect: s.awake = true
9 awake = true
REPORT_ABORT(T”)
Effect: .s.reported
REQUEST.CREATE(T’)
Precondition:
REQUEST_
s reported
s’. awake = true
COMMIT(T,W)
st. ou,ake = true
= (genemtion-nwnker(7’),
data(l’))
s’. requested
= 9’. reported
(3 W)(1(W)
G old-conjig(T).
A~
Effect:
9.reque.9ted
u {T’)
U {~}
Precondition:
T’ @ g’. requested
dotu(T’)
= s’. report.d
= s’. repOrted
= s’. rquesfed
u =
U {T’}
~
write
s’. written)
“OK”
Effect:
REPORT-COMMIT(T’,
Effect:
s.written
= 9’. repOrted
s.wr:tten
=
Fig.
course will
configuration
3,
= s’. reported
U
awoke = false
{T’}
= s’. wrvtten U {object(T’)}
s.reported
quorum
occurred
ZJ)
s.r-eported
s’. written
15.
U {T’)
{object(T’)}
U
Transition
relatlon
for Write-C’onfiguration-Coordinators.
intersect
all read-quorums
of the old configuration.
replica’s
generation
number
is the largest
found
according
to that
replica’s
and the given configuration
configuration,
is the current
then
one.
no
Hence, if a
in reading
a
such
write
has
As promised,
we next define write-con figuration-coordinators.
A write-configuration-coordinator
for transaction
name T ● co.,, has state components
awake,
initially
requested,
reported,
and written,
false;
requested
and reported
empty;
and
VT is equal
The
Figure
written
is a subset
of ~,
where awake is a Boolean
are subsets of children(T),
initially
empty.
variable,
initially
The set of return
values
to {“OK”}.
transition
15. Recall
relation
for write-con figuration-coordinators
is shown
in
that a write-configuration-coordinator
T has an associated
old configuration,
old-config(
T ), and a new configuration,
data(T),
as well as
a generation
number,
genera tion-number(T
). The purpose
of a write-configuration-coordinator
is to write its generation
number
and new configuration
to a write
quorum
in its old configuration.
It does this by invoking
write
accesses
receiving
to the configuration
REPORT_
COMMIT
replicas,
actions
cas in some set corresponding
and can only request
to commit
for accesses to all configuration
to a write
quorum,
according
after
repli-
to old-con fig(T
).
A read-LA
in W is identical
to a read-LA
5.4.4 Logical
Access Automata.
in W, except for the REPORT_
COMMIT
input
action for children
that are
read-configuration-co
ordinators.
The only difference
is that the read-configuration-coordinator
returns
not only a configuration,
but a pair consisting
of
a generation
number
and a configuration.
The read-LA
simply
ignores
the
generation
number.
Formally,
we have:
REPORT_
COMMIT(T
‘ , u),
T ‘ E CO,,
Effect: s.reported
= s‘ report
u
if s‘ .config-read
= false
then
s.config = v.config
s.config-read = true
ACM
TransactIons
on
Database
Systems,
{T’}
Vol.
19,
No
4, December
1994
Quorum Consensus
s.i-eported
Systems
577
.
U {T’}
= s‘ reported
if s‘ .config-read
in Nested Transaction
then
= false
s.confi,g = u.config
s.config-read
A write-LA
described
= true
in %’ is identical
to a write-LA
in W, except
for the same change
above for read-LA’s.
Unlike
the read- and write-LA’s,
the reconfigure-LA’s
of % make
generation
numbers
returned
by the read-configuration-coordinator,
age the configuration
from
those
of 3
replicas.
than
do
configuration-coordinator
and a configuration,
used to determine,
LA, the allowable
@, except
more
read-
number
These are
name
for the
T G la,,,
addition
has the same state
of the
component
compon-
generation-
and has initial
value nil. The
which takes on values in N U {nil}
is the same, except that the type of the return
values for read-con-
figuration-coordinators
is now N x L’. The actions
REPORT– COMMIT(
Effect: s. reported
if s‘. config-read
s.config
T ‘ , u ), T ‘ G
= s‘ reported
COrc
U {T’}
then
= false
= u.con fig
s.generation
-number
= v.generation-n
s.config-read
umber
= true
= s ‘reported
U {T’1
ifs’ .conflg-read = false then
s.config = v.config.
s.generation -nwnber
— Lj.generation-number
s con fig-read = true
s. reported
5.5 Correctness
In this
of % differ
Here,
a
a pair consisting
of a generation
are saved by the reconfigure-LA.
for transaction
as it does in
number,
signature
returns
of which
reconfigure-LA’s
and
write-LA’s.
for the write-con figuration-coordinators
T‘ invoked
by the
values for old-con fig(T’)
and generation-number(
T’).
A reconfigure-LA
ents
both
Thus, the
the
read-
use of the
to man-
section,
that
change
are:
REQUEST– CREATE(T ‘), T ‘ G COU,L
Precondition:
s’. awake = true
s’. value-read = true
data(T’)
= config(T )
old-con fig(T’ ) = s‘ .confzg
genera tion-nuntber(T’
)
— s ‘generation-number
+ I
T‘ g s‘ requested
Effect:
s.requested = s ‘requested u {T’]
Proof
we prove
the
correctness
of system
~. Once
again,
we begin
with definitions,
this time for a sequence
/3 of actions of %’. Specifically,
we
extend the definitions
of logical-sequence(
P ) and logical-config(
D ) to apply
the sequences of actions of %’. We also require
one new definition,
analogous
to current-un(
~). Namely,
if ~ is finite,
then current-gn(
P ) is defined to be
the highest
generation-number
among the states of all configuration
replica
automata
in any global state led to by ~.
We prove correctness
of C by means of a correspondence
between
%’ and
,%. If ~ is a sequence of actions of %, then we define f( ~ ) to be the sequence
of actions of .’@ that results from
(1) removing
all REQUEST_
MIT(T,
u), COMMIT(T),
ACM
CREATE(T),
CREATE(T),
REQUEST_
COMABORT(T),
REPORT_
COMMIT(T,
u), and RETransactions
on Database
Systems,
VO1. 19. No.
4, December
1994.
578
K, J. Goldman
.
PORT _ABORT(T)
and N. Lynch
events
for
all
transaction
names
T = czcc,C u accWC,
and
(2) replacing
all REQUEST_
COMMIT(T,
u ) and REPORT_
REQUEST
_COMMIT(T,
where
T G CO,C, with
REPORT_
COMMIT(T,
u.config ), respectively.
The
following
lemma
is the
key
to the
proof
COMMIT(T,
u.con~ig)
of Theorem
5.6,
u)
and
the
main
correctness
theorem
for %. As in Lemma
5.3.1, condition
(1) is only needed
the inductive
argument.
Part
(a) says that
any configuration
replica
either
has the
current
generation
number,
or has a configuration
all configuration
replicas
in some write
quorum
of that
generation
numbers
higher than the one held by S’y. Part
configuration
replica
logical
configuration.
holding
the current
generation
Condition
(2), the important
that
yields
our construction
LEMMA
active
Let
5.5.1.
a schedule
~ be a finite
such
configuration
(b) says that
for
SF
that
hold
every
number
also holds the
part of the lemma,
says
of Q?.
schedule
of I% such that
no logical
access is
in ~.
(1) The following
(a)
properties
hold
in any global
For any configuration
replica
nent of Sy and v.generation-nu
ists a write
quorum
state
led to by B:
automaton
SF, if v is the data compomber < current-gn(
~), then there ex-
Y’ E v.config.write
with
the following
property:
any configuration
replica SF for an object name ~ such that i(~)
component
of SF then
v‘ .generation-num
if v’ is the data
u.generation-nu
(b)
mber.
For any configuration
v.generation
-number
config(
for
E %‘,
ber >
replica SF, if v is the data
= current-gn(
~ ) then
component
v.config
of SF and
= logical-
B ).
(2) The sequence
f(~)
is a schedule
of W.
PROOF.
We proceed
by induction
on the number
of pairs
of the form
CREATE(T)
REQUEST_
COMMIT(T,
v) in logical-sequence(
/3 ). For the basis, suppose
logical-sequence(
~ ) contains
no such pairs.
Then
logical-
config( /3 ) = co. Since all configuration
replicas
initially
have generationnumber
== O and config = co, the same is true in any global state led to by ~.
Thus, current-gn(
~ ) = O. This implies
that condition
(l), and condition
(2)
hold because the two systems behave identically
if no LAs are invoked.
For the inductive
step, let ~ = /3 ‘y, where logical-sequence(y)
begins with
the last
lemma
CREATE
event
holds
for
P’.
in logical-sequence(~),
Then
logical-sequence(y)
and
assume
that
= CREATE(Tf)
the
REQUEST_
COMMIT(Tf,
Vf ) for some T~ 6 la and Vf of the appropriate
type.
As before, we can restrict
attention
to T~ and its descendants.
For any LA to issue a REQUEST_
COMMIT,
it must
first
receive
a
REPORT_
COMMIT
for a read-configuration-coordinator
and a read-coordinator. Let T‘ be the read-coordinator
in co, with the first REPORT_
COMACM
TransactIons
on
Database
Systems,
Vol.
19,
No
4, December
1994
Quorum Consensus
MIT in ~, and let
REPORT_
COMMIT
in Nested TransactIon
Systems
.
579
y‘ be the portion
of y up to and including
the
event. Also, let T“ be the read-configuration-coordinator
given
in corC with the first REPORT_
COMMIT
in y, and let y“ be the portion
up to and including
this REPORT_
COMMIT
event.
Claim
5.2.2.
(current-gn(
lar
proof
contain
COMMIT(T”,
of the automaton
its REQUEST_
of Claim
4.3.3,
the highest
v)
occurs
in
y,
then
with
T“
when
u =
B ‘)).
s be the state
issues
to the
s.config
REQUEST_
B ‘), logiccd-config(
Let
PROOF.
automaton
If
of y
COMMIT
associated
event.
we can show
generation-number
Using
that
an argument
that
simi-
s.generation-number
and associated
and
config
among
the configuration
replicas
whose names are in s.read. Suppose (for contradiction) that this highest
generation-number
is not equal to cument-gn(
P ‘);
then part (a) of the inductive
hypothesis
implies
that there is some set ‘7/”
iCZ~) ● s.config.
with
write
such
that
all
names in 7?7-have generation-number
tion of a read-configuration-coordinator,
i(%)
● s.config.
and
onto,
i(~)
read.
Since
and
s.config
i(7Y”) have
configuration
replicas
> s.generation-number.
s.read must contain
is a configuration,
a nonempty
and
intersection;
for
object
By the definia set W with
i is one-to-one
therefore,
some
configuration
replica for an object name in ~, and hence in s.read, lies in 77
and hence has its generation-number
component
strictly
greater
than s.generation-number.
This is a contradiction
to the claim that s.generation-number
contains
automata
the highest
generation
number
among
the configuration
whose names are in s. read. It follows that s.generation-nz~
current-gn(
P ‘). Then it follows
s.config = logical-config(
B ‘).
Now
w-e consider
three
by part
(b) of the
inductive
replica
mber =
hypothesis
cases.
It is easy to see that
condition
(1)
(1) Tf G la,.
descendants
of Tf are write
accesses to configuration
is preserved,
since no
replicas.
Claim
5.5.2
implies
that the value returned
by T“ in y has as its configuration
nent the value logical-config(
/3’). By the definition
of logica l-config,
the same value that
ASO, the automaton
that
would be returned
associated
with
by the configuration
Tf in & behaves
compothis is
automaton
in <Z.
identically
to that
associated
with
Tf in B except that,
in %’, it receives
and ignores
the
generation-number
component
of the return
value
of T“,
exactly
as the
correspondence
f requires.
Since ~ is a schedule of %, it follows that f( /3) is
a schedule of &?, which shows condition
(2).
(2) Tf G law.
(3) Tf G la,,,.
information
The argument
We first
associated
with
is similar
give
several
to the one for the previous
claims
the descendants
about
the
state
case.
of Tf and
Claim
5.5.3.
If s is a state of Tf in a global state reachable
from
B, s.generation-number
= current-gn(
~‘) and s.config
= logical-config(
PROOF.
of a read
By Claim
configuration
5.5.2, since y“
coordinator.
ACM
Transactions
ends
the
it invokes.
with
the
first
REPORT_
D ‘Y” in
P ‘).
COMMIT
❑
on Database
Systems,
Vol
19, No
4, December
1994.
580
K. J, Goldman
.
Claim
then
5.5.4.
data(T)
If T is a write-con
= config(T~
old-con fig(T)
Claim
5.5.5.
tion-coordinator
invoked
by Tt in y
(3’) + 1, and
= current-gn(
P‘ ).
5.5.3 and the definition
❑
of T~.
If T is a write access in acc&,C invoked
by a write-configurain y then data(T)
= ( current-gn(
@‘) + 1,config(T+ )).
By Claim
PROOF.
figuration-coordinator
), generation-number(T)
= logical-config(
By Claim
PROOF.
and N, Lynch
5.5.4
and
the
definition
of a write-configuration
coordi-
❑
nator.
By definition,
T~ cannot
configuration-coordinators
request
to commit
has committed;
until
at least
one of its write-
let Tl,) be any write-configuration-
coordinator
child of T~ that has a COMMIT
event in y. Moreover,
at least one
write access to a configuration
replica
must commit
before a write-cont3guration-coordinator
can request to commit.
From Claim 5.5.5, we know that any
write-configuration
access T that
is a descendant
of T~ has data(T)
=
current-gn(
~ ) = cz~rrent-gn( ~‘)
(current-gn(
B‘) + 1,config(T~ )). Therefore,
+ 1. Furthermore,
by the
config( B ). Therefore,
invoked
by a write-con
gn( ~), logical-config(
To show that part
three
definition
of
logical-config,
we can conclude
that
figuration-coordinator
~ )).
(a) of the inductive
cases for each configuration
led to by ~. (By the preceding
replica
arguments,
co~zfig( Tt ) = logLcal -
all write
accesses T in acct~,C
in y have data(T ) = ( current-
hypothesis
~ with
these
holds
data
for /3, we consider
z), in any global
state
cases are exhaustive.)
(1) v.generation-number
< current-gn(
~ ‘). ~hen ~ is not updated
by any
descendant
of Tt in ~. That is, v is the data in Y in any global state led to by
~‘. Therefore,
that
there
configuration
it follows
immediately
exists
a set ~“ with
replica
S~, for an
component
of S’y, in any global
> u.generation-number.
During
held
in ST, is unchanged,
greater
(a) holds
than
u.generation-n
from part (a) of the inductive
hypothesis
i(7#’) G z).con fig. zurite such that
for any
object
name
~‘
E w“,
if
v‘
is
the
data
state led to by ~‘, then u‘ .generation-nurnber
y, either the genera tion-n urn ber in the data
or else it is set to current-generation(9)
umber)
by a descendant
(which
of T~. In either
is
case, part
for D.
(2) u.generation-num
ber = current-gn(
~ ‘).
Then by part (b) of the inductive hypothesis,
we know that u.confzg = Zogical-config(
/3 ‘). By definition
of a
write-configuration-coordinator,
just prior
to the REQUEST_
COMMIT
for
).zurite such that
T,,, in y, there must be a set W“ with
i(~”’) ● old-config(T,,,
write
accesses to all configuration
replicas
for object names
in w have
committed,
so by our arguments
above, all of these configuration
replicas
have generation-number
= current-gn(
~ ) in any global state led to by ~. By
Claim
5.5.4, old-config(TU)
) = logical-config(
~‘ ). Therefore,
7’ satisfies
i(~’)
E u.config.zorite,
and all configuration
replicas
for object names in 77” have
generation-number
= currenf-gn(
state led to by /3. Therefore,
part
ACM
TransactIons
on
Database
Systems,
Vol.
~ ) > v.generation-number
(a) holds.
19,
No
4, December
1994
in
any
global
Quorum Consensus
(3) u.generation-number
in Nested Transaction
= current-grz(
/3 ).
Then
Systems
part
(a) is trivial.
So in all three cases, part (a) holds for ~.
We now show part (b). By the definition
of current-gn,
the
data
state
any
component
led
to by
of the
/?’ then
configuration
state
of a configuration
u‘ .generation-nurn
replica
that
has,
ber
in
any
we know
replica
<
that
if
u‘
Therefore,
/? r ).
state
led
to
by
/3, a
generation
number
equal to current-gn(
~ ) (and so larger than current-gn(
must be written
after ~‘. The writer
of such a configuration
replica must
descendent
coordinator
which
of Tf, and therefore
must be a child of some write
T,,, ~ invoked by Tt. Since all such TU,~ have data(TU
by definition
Condition
~), part
The lemma
above immediately
THEOREM
yields
the transaction
Let
5.5.6.
~
of y of @ such
and object
schedule
the following
(1) ylT
= ~lT
(2) ylY
= (31Y for each object
for each transaction
name
name
THEOREM
(1) y\T
Let
5.5,7.
y
such
of.&
~
that
be a
(2) YIY = D 1~ for each object
finite
T G y;
6. CONCURRENT
REPLICATED
.%. Let
P’. Then
there
exists
a
hold;
U COZ,,C) and
– (co,,
and 5.5.6 to prove a relationship
the transaction
names and object
schedule
of
two conditions
%“.
Then
there
exists
a
hold:
T ● y< – la and
name
name
of
F’ and
of .%’, respectively.
Y G :zj.
the following
= BIT for each transaction
between
names
two conditions
Now we can combine
Theorems
5.3.11
between
% and .<. Let ti~ and YA denote
names of <w, respectively.
schedule
configuration) = config( TI ),
(b) holds.
a relationship
names
be a finite
that
B‘ ))
be a
❑
(2) is straightforward.
Y= and :Z~ denote
schedule
is logical-config(
is
SY in any global
current-gn(
global
581
.
Y = 2A.
SYSTEMS
So far, this paper has dealt exclusively
with serial systems. However,
a useful
nested
transaction
system
must
allow
concurrency
and the possibility
of
aborting
a transaction
after it has begun running.
In order to simplify
the
programming
effort,
it is best for a system not to permit
arbitrary
concurrency,
serial.
but rather
to make it seem (to the programs)
as if the
Because we have been discussing
several different
serial
extend
the
which
serial
system,
concept
system
we say that
of serial
is
correctness
considered
a sequence
to
given
define
~ of actions
earlier
to mention
correctness:
is serially
system were
systems, we
if
correct
Y
with
explicitly
is
a serial
respect
to
& for transaction
name T, provided
that there is some behavior
y of=
such
that ~lT = ylT.
Many concurrency
control mechanisms
are known that enable a concurrent
a
system to appear
like a serial one. For example,
in Fekete
et al. [1990]
generic system is defined
as a system containing
transaction
automata
(just
as in the serial system),
generic objects that accept concurrent
accesses and
ACM
Transactions
on Database
Systems,
Vol
19, No.
4, December
1994
582
.
also
receive
K. J, Goldman
and N Lynch
information
controller
that
algorithms
about
passes
are
the
commit
information
given
for
or abort
between
constructing
the
of transactions,
them.
In
generic
that
objects
and
paper
from
a
several
the
serial
objects of 5-, in ways that ensure that every execution
of the generic system
is serially
correct with respect to S for each nonorphan
transaction.
We use J& and %’ to denote the systems defined
in Section 5. A concurrent
system may be constructed
by applying
the methods
discussed in Fekete et al.
[1990] to the system %’. This can be described
as applying
concurrency
control
to each copy separately.
Thus,
for example,
we consider
a transactionprocessing
system
~
1981] to give a generic
show that
nonorphan
names
all behaviors
transaction
of ~ are serially
correct
names,
in particular,
for
in YZ — la. That
the goal of building
and
that uses Moss’ read-update
locking
algorithm
[Moss
object for each object of %. The results of Fekete et al.
replication
is, 9
should
the results
of this paper.
To be precise, we have
COROLLARY
with
Let
6.1.
PROOF.
a serial
replicated
system
That
is that
is, one wants
correct with respect to .&, the
case for g, a fact that follows
the following
general
~ be a sequence
to %, for transaction
respect
like
be transparent.
behaviors
that are serially
system. This is in fact the
respect
looks
a transaction-processing
with respect
all nonorphan
name
to %’, for all
transaction
system.
both
However,
concurrency
a system
serial
from
to have
unreplicated
the above by
result:
of actions
that
T G Y< – la. Then
is serially
correct
/3 is also serially
with
correct
to .& for T.
By Theorem
❑
5.5.7.
Corollary
6.1 implies
that
all behaviors
of Q are serially
correct
with
respect to w for all nonorphan
transaction
names in Y; — la. This says that
system
~, which combines
concurrency
control
techniques
based on locking
with the replication
strategies
of this paper, looks the same as the serial,
unreplicated
system .@, to the nonorphan
transaction
names in Y; — la (in
particular,
to TO).
Corollary
processing
consensus
rency
6.1
also
system
protocol
control
shows
that
if
is constructed
to manage
the
algorithms
verified
a concurrent,
replicated
transaction-
by (1) using
the reconfigurable
quorum
copies and (2) applying
any of the concurin Fekete
et al. [1990]
to each copy individu-
ally, then the concurrent
replicated
system appears
serial and unreplicated.
Similar
conclusions
can be drawn
for a system
using
the multiversion
timestamp
algorithms
of Reed [ 1983] and Herlihy
[ 1987], as modeled
in
Aspnes
et al. [1988].
Also, similar
conclusions
can be drawn
when
% is
replaced
by system
@’ of Section
5, or by @ of Section
4. In general,
any
concurrency
control
algorithm
that provides
serial correctness
at the level of
the replicas
may be combined
with
any of our replication
algorithms
to
produce a correct system.
Finally,
we note that our techniques
allow combination
of algorithms
for
orphan
management
(as described
in Herlihy
et al. [1987]) with algorithms
for concurrency
control and for replication.
For example,
consider the concurACM
TransactIons
on
Database
Systems,
Vol.
19,
No
4, December
1994
Quorum Consensus
rent
system
~
described
constructed
from
The
of Herlihy
results
respect
~
to $3, for all
all transaction
correct
with
%, which
respect
techniques
transaction
above,
and
let
one of the
et al. [1987]
transaction
names
combines
just
by adding
m Nested Transaction
% be another
orphan
imply
names
Systems
that
system
management
orphans),
that
is
algorithms.
% is serially
(including
583
o
correct
with
in particular,
for
in Y;
– la. Then Corollary
6.1 implies
that % is serially
to .@ for all transaction
names in 9A – la. Thus, system
concurrency
control
techniques
with the replication
strategies
of this
names in 7A – la, and in particular,
and orphan
paper, looks
to TO.
management
like&to
all the
7. CONCLUSION
We have
presented
a precise
description
and
rigorous
correctness
proof
for
Gifford’s
data replication
algorithm
in the context of nested transactions
and
transaction
failures.
The algorithm
was decomposed
into simple modules that
were arranged
naturally
in a tree structure.
This use of nesting
as a modeling
tool enabled us to use standard
assertional
techniques
to prove properties
of
transactions
Each
based
module
upon
use of nondeterminism.
nondeterministic,
the
proof.
That
further
the properties
was described
is,
restricts
the
of their
in terms
Although
an
nondeterminism
correctness
actual
adds
proof
of the proof
accomplished
simulated
strategy
lated
hierarchically,
an intermediate
an obviously
we presented
rency control
system
correct
a theorem
algorithm
made
extensive
for
any
not be
to our
implementation
that
choices.
permitted
of replication
from those of concurrency
exclusively
with serial systems in order
was
that
implementation
would
a degree of generality
holds
the nondeterministic
The modularity
children.
of an automaton
showing
and
unreplicated
us to separate
the concerns
control
and recovery.
We could deal
to simplify
our reasoning.
The proof
that
that
the
the
system.
fully
replicated
intermediate
Then,
stating
that the combination
with the replication
algorithm
system
system
to complete
simu-
the
proof,
of any correct concuryields a correct sys-
tem.
This
work
has identified
a general
framework
for proving
data replication
algorithms
in nested
transaction
constructing
a formal
description
of the algorithm
the correctness
of
systems.
One begins
by
in terms
of a nested
transaction
system built from 1/0 automata.
Then, one uses the appropriate
definitions
to show that each logical
read access returns
the proper
value.
Next, one constructs
a corresponding
serial system without
replication,
and
proves that the user transactions
in that system have the same executions
as
the
user
the
correctness
automata
in the
of the
replicated
concurrency
system.
control
analogous
to Corollary
6.1 to show that
It may be possible to use this general
Finally,
algorithm,
one proves
and
applies
separately
a result
the combined
system is correct.
technique
to add transaction
nesting
to other,
more
complicated,
data replication
schemes,
and to prove
the
resulting
algorithms
correct. Such algorithms
include
the “Virtual
Partition”
approach
of Abbadi
and Toueg [ 1989], and Herlihy’s
“General
Quorum
Consensus” [1984]. An interesting
question
is whether
the techniques
presented
ACM
Transactions
on Database
Systems,
Vol
19, No
4. December
1994.
584
.
K. J. Goldman
and N. Lynch
here can be extended
to accommodate
these algorithms
when transaction
nesting
is added. Several of these algorithms
do not provide
atomicity
as we
define
it. Rather,
order,
between
they
allow
the
transactions
Therefore,
additional
algorithms
can be verified
serialization
that
run
definitions
and
using
order
in separate
theory
to differ
from
partitions
will
the
of the
be required
true
network.
before
these
our techniques.
ACKNOWLEDGMENTS
We thank Alan Fekete, Michael
Merritt,
William
Weihl,
and the anonymous
referees
for their
useful
comments
on this material
and for their
help in
improving
its presentation.
REFERENCES
ASPNES,
J.,
based
FEKETE,
A
Conference
B~R&u?A,
, LYNCH,
EA~ER,
on Very Large
Data
Bases
VLDB
S,yst
Trans
ARB.4DI,
K.
Database
Systems
A , L~NcH,
nested
transactions.
Diego,
CA,
D
HERLIHY,
tion
M.
HERLIHY,
MIT
1987.
M.
M.,
Sc,ence,
tency
L~-NrH,
of replicated
Sept. ) 27 S–305.
Expanded
of
York,
137–151
Computer
ACM
6th
TransactIons
8–12
on
MIT
May.
systems.
ACM
replicated
In
Trans.
databases.
York,
protocol
Svmposium
for
of
on Pn nclples
215-229
Commutativlty-based
Nested
locking
Expanded
Science,
for
New
available
Cambridge,
data,
In
time-stampmg
Systems
(San
as Tech.
Mass
Proceedings
York,
and read/write
of Database
version
MIT,
replicated
ACM,
transactions
on Prlnclples
of
Memo
, April
the
7th
ACM
150-162.
protocols
1986.
types.
, May,
to exploit
in
type
informa-
voting
Introduction
to
in International
algorithms
Syst.
version
Cambridge,
the
theory
Systems,
Mass.,
Vol
Laboratory-
for
as Tech
No.
for maintaining
Com Computer
of
the
consis-
230-280,
nested
on Database
proofs
of Dzstrzbuted
Rep,
transactions.
Theory
4, December
for distributed
Computation.
MIT/LCS/TR-387,
Apr
19,
of orphan
Theoret.
(Rome,
Italy,
July.
correctness
on Prznczples
avadable
correctness
on Fault-Tolerant
15 (June),
Conference
m MIT/LCS/TR-367
urn
the
Svmposwm
MIT
Database
Hierarchical
On
Rep. MIT/LCS/TR-319,
J. ACM.
Dynamic
Trans.
Tech.
1987.
IEEE
MIT/LCS,ZTM-329,
To appear
ACM
data
Mass
W,
of the 17th
1990
1987.
Database
Recocery
fault-tolerant
ACM
1990
for abstract
Also,
Syrnposl
Expanded
Science,
efficient
1987.
M , ANII WEIHL,
version
ACM
and
65–156.
Cambridge,
Proceedings
Also
LYNCH, N., AND TUTTLE, M,
Proceedings
methods
databases
M.
for
Science,
D.
Set. 62, 123-185.
New
W.
Prznczples.
MERRITT,
Mass.,
N , AND MERRITT,
Conlput.
sys-
N.J
Control
in partitioned
4th
W.
97– 111
multlversion
Rephcat]on
Cambridge,
distributed
Princeton,
database
An
Synlposlum
for Computer
for Computer
S., AND MUTCHLER,
International
C“-36, 4 (Apr.).
elimination
algorithms
In
,Dutzng
IEEE,
New
York,
JAJODIA,
ACM,
Scl. (Aug.)
York,
System
N,,
distributed
of the
M., AND WEIHL,
Extending
LYNCH,
of tlmestamp-
14th
m partitioned
Umv.,
Concurrency
1985
AND WEIHL,
Syst
voting
Comput
1984.
Laboratory
HERLIHY,
Welghtcd
Trans
F.
Mar.).
M.,
New
on Operating’
IEEE
excluslon
avadabihty
proceedings
Oregon,
Laboratory
1979.
Symposusm
In
of the 6th ACM
ACM.
MIT/LCS/TM-324,
GJWWRD,
theory
264-290.
AND CRISTMN,
N,, MERRITT,
Mm.).
A
of the
43-444.
1987
in
Mamtammg
Comput.
Proceedings
1988.
Mass.
Robustness
MERRITT,
J
W.
Proceedings
Princeton
N.
Reading,
14, 2 (June)
(Portland,
A., L~mcH,
In
D.,
N,,
Mutual
Science,
GOODMAN,
1989.
management
FEKIZTE,
In
354-381.
Syst.
SKREN,
data
Iockmg.
S,
Database
A.,
AiiD
1983
8, 3 (Sept.)
replicated
FEKETE,
V.,
WEIHL,
Endowment,
1985,
Computer
Addison-Wesley,
EL ARADDI, A , TOUEG,
ACM
Dept.
HAIMILAC(OS,
AND SEvcIIi,
D.,
LINA, H.
Rep. TR-346,
Systems
Database
AND
transactions.
P.,
Dataha.w
M.,
nested
Tech.
BERNSTEIN,
, MERRITT,
for
D , ANI) GARCIA-M•
tems.
EL
N
control
concurrency
1994.
algorithms,
In
ACM,
New
Laboratory
for
Quorum Consensus
Moss,
,J. E.
Ph.D.
Science,
REED,
THOMAS,
1981.
MIT,
MIT,
D.
Syst.
R.
databases.
transactions:
Also,
Mass.
pubhshed
Implementing
1,
1979.
ACM
August
Nested
Cambridge,
Apr.
1983.
Comput.
Received
B.
thesis,
1 (Feb.)
1992;
Tech.
by MIT
atomic
An
approach
to
reliable
Rep. MIT/LCS/TR-260,
Press,
actions
Mar.
on
Systems
distributed
Laboratory
.
585
computing.
for Computer
1985.
decentralized
data.
1983.
ACM
!i!’rans.
3-23.
A majority
Trans.
in Nested Transaction
consensus
Database
revised
Syst.
March
ACM
approach
4, 2
1993
Transactions
and
to concurrency
control
for
multiple
COPY
(June) 180-209.
September
on Database
1993;
Systems,
accepted
Vol.
February
19, No
1994
4, December
1994