Effective Specification of Fault Tolerant Distributed
Transcription
Effective Specification of Fault Tolerant Distributed
Effective Specification of Fault Tolerant Distributed Software ∗ Andi Bejleri Tzu-Chun Chen Mohammad Qudeisat TU Darmstadt TU Darmstadt Purdue University [email protected]@dsp.tu- [email protected] darmstadt.de darmstadt.de Lukasz Ziarek Patrick Eugster SUNY Buffalo Purdue University, TU Darmstadt [email protected] [email protected] ABSTRACT 1. Distributed systems are plagued by partial failures, meaning that certain components or interactions may fail, while others remain unaffected. If such failures are improperly handled in message-passing distributed systems, software components may get stuck waiting for messages that will never arrive, or enter an inconsistent state after receiving inappropriate messages. Handling of partial failures is an intrinsic part of many protocols, yet very involving as it depends on many factors including their timing and nature, the affected component(s), and of course application semantics. In this paper, we present a novel specification technique and verification framework for fault tolerant distributed software based on the theory of session types. More precisely we contribute with a specification technique that models structured interactions and fine grained failure handling, a software framework that can be used to implement fault tolerant distributed systems in Java, and a static analysis that checks conformance of corresponding software to their specification. Finally, a detailed empirical study, including a code quality analysis comparing with straightforward approaches attempting to achieve the same based on predating techniques without explicit support for failures, demonstrates the usefulness, simplicity, and efficiency of our technique and tools. In message-passing distributed systems, software components communicate with each other through messages to accomplish a common task. The interaction structure (conversation) between components is not only defined by the values of the messages exchanged, but also by their ordering, branching, and looping that model sequencing, different, and repetitive behaviour respectively. This structure is commonly referred to as a protocol. Distributed systems are prone to partial failures, which means that some components or interactions may fail, while other components must still respect certain invariants. Since not all failures can be masked [11], most protocols in a typical “middleware stack” must explicitly deal with some failures. How to handle a partial failure depends on a variety of parameters including the affected component(s), the timing and nature of the failure, the application semantics, etc. This makes the design of correct and efficient protocols complex and error-prone. For example, notifying all parties of a failure and conservatively aborting them tends to hamper performance of distributed software by incurring spurious communication, repeating abandoned interactions (which could have continued despite certain failures), and over-synchronising. Inversely, omitting the notification of a single relevant component to a failure can hamper progress (e.g., a component waits on a message that will never arrive), or lead to inconsistencies (e.g., subsets of interacting parties engage in different protocol branches that are structurally compliant, but where messages containing different values are expected). To ease the burden of reasoning about distributed software, session types [25, 16] have been introduced. Session types are a well-established type-theory to ensure deadlockfreedom and communication-safety in the context of process calculi. They were proposed (1) to capture the interaction of protocol participants in the presence of ordering, branching and looping, and (2) to verify implementations of these participants. Bi-party session types [25] are the fore-runners of multiparty session types [16]. The former capture the communication structure between two participants, ensuring reciprocity of actions (for a “send” on one side, there is a “receive” on the other). The latter captures the structure of many participants from a global point of view, ensuring conformance of processes to a global type by projection. Categories and Subject Descriptors D.2.1 [Software Engineering]: Requirements/Specifications; D.2.4 [Software Engineering]: Software/Program Verification; D.2.2 [Software Engineering]: Design Tools and Techniques; D.3.1 [Theory of Computation]: Specifying and Verifying and Reasoning about Programs ∗Financially supported by ERC Consolidator Award “Lightweight Verification of Software”. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00. INTRODUCTION Subsequent works include theoretical studies of session types in object-oriented core calculi [9, 13]. SJ [18] is the most prominent system that has integrated the theory of session types for distributed, object-oriented software. It consists of a mature 1 specification technique, a static analysis, and a software system for client-server distributed systems. Subsequent work [3, 17] has demonstrated the usefulness of integrating session types for client-server distributed software in an object-oriented programming language like Java for parallel software systems. Despite this strong work, SJ is centered on bi-party interactions which makes real-life multi-party protocols hard to reason about and in many cases impossible to express at all; this is particularly valid for reasoning about handling of (partial) failures, which SJ does not support natively. In this paper we introduce protocol specification and verification for distributed software in the presence of partial failures. Our technique, based on session types, consists of an effective specification technique, static analysis and software system. The specification language describes the protocol between components in a distributed middleware or end application software, hence coining the term of our specification technique protocol types. The static analysis checks the conformance of components with respect to the specification, and the software system allows for the implementation of efficient distributed components capable of handling partial failure. We build on SJ’s approach, extending it with bi-directional asynchronous communication channels per participant pair as advocated in the theory of Bettini et al. [5] and used in many real-life protocols building on communication abstractions like TCP, and intuitive abstractions for branching, looping, and failure handling among multiple parties. Specifically, the contributions of this paper are: • A specification technique that reflects simplicity and efficiency for modelling structured interactions and finegrained handling of both environment-induced failures and application-level failures (exceptions). • A system for implementing fault-tolerant distributed software featuring asynchronous message passing, primitives for structured interactions and partial failure, and a static analysis for checking the conformance of that software to protocol specification. • A specification and software quality analysis showing the benefits of our specification design and software system compared to a variant of SJ. The results show that our approach greatly overcomes the complexity of specifications in that variant in metrics such as lines of code, levels of nesting and number of messages. • A detailed empirical study that demonstrates the usefulness of our specification technique, showing that a fault-tolerant software conforming a protocol type specification retains asynchrony; i.e., it imposes less unnecessary coordination among protocol components and spurious communication than attempts to encode failures with existing session type models. In the remainder of the paper, Section 2 motivates our protocol type specification technique through a real-world fault1 The SJ framework has been maintained for more than five years. Service Provider Discovery Service [2] [6] [5] [3] [4] [1] Identity Provider [7] [9] Client [8] Figure 1: Shibboleth discovery service protocol. tolerant distributed program. The design of our protocol type technique is introduced in Section 3, including a transformation tool of specifications. Section 4 and 5 present respectively the software system and the static analysis. Report on empirical studies and analysis of code quality is given in Section 6. Section 7 surveys related work and Section 8 concludes and outlines future work. 2. PROTOCOL SPECIFICATION This section provides a motivating example and gives an informal introduction to our protocol specification technique. 2.1 Shibboleth For illustration we choose the discovery service protocol (DSP) of Shibboleth [1]— a derivative of the Kerberos [21] protocol. Shibboleth is among the world’s most widely deployed federated identity solutions, connecting users to applications both within and between organisations. Shibboleth includes components acting in the roles of Identity Provider (IdP), Service Provider (SP), and Discovery Service (DS) in addition to clients. IdPs provide user information, SPs access to secure content based on that information, and DS a mechanism to seamlessly add or remove services and SPs. The discovery service protocol (DSP) between a client and DS involves communication between all components of the Shibboleth system, as described in Figure 1. We focus on one step of the protocol (arrow [1]) to demonstrate the benefits of our technique. 2.2 Shibboleth resource request The resource request interactions consist of the client authenticating with the single sign-on service and redirected to an SP to access a resource. This is modelled as the client issuing a request to the SP in the form of an HttpRequest and waiting for a corresponding HttpResponse. Consider the specification of this part of the protocol in SJ protocol ShibbolethResourceRequest { p a r t i c i p a n t s : C l i e n t , SP C l i e n t = !< H t t p R e q u e s t > . ? ( H t t p R e s p o n s e ) SP = ? ( H t t p R e q u e s t ) . ! < H t t p R e s p o n s e > } where !<T> specifies the sending of a value of type T and ?(T) the receiving of a value of type T. An overview of SJ’s specification constructs is given in Table 1. Table 1: Constructs of SJ’s specification language Construct Purpose !<T> Send a message of type T Receive a message of type T ![...]* Loop and send termination condition ?[...]* Loop and receive termination condition !{l1:...,...,ln:...} Select and send one of the labels l1 - ln ?{l1 :...,..., ln :...} Branch and receive one of the labels l1 - ln ?(T) Notice that this specification uses implicit channels, i.e., the sending construct does not specify the receiver or the receiving the sender. Specifications of this technique are safe. This means that the interactive structure of each component’s specification is reciprocal. In addition, the specification of a component expresses sequencing between its actions, sufficient in simple client-server protocols. Unfortunately, they cannot express sequencing of communications, branching, and looping between several components (see [16] for a formal argumentation) as required in many real-life distributed systems. It is important to note that loops and branches are always controlled by one component (evaluating the corresponding condition); so those involving three (or more) participants can not be modelled by three (or more) pair-wise loops and branches respectively. Not all loops or branches will be controlled by the same party. Pair-wise channels such as TCP are the abstractions most commonly found real-world distributed software. Thus, our protocol type specification supports pair-wise channels, borrowing ideas from multiparty session types, as A−>B: <T> specifying A sends to B a message of type T, where A and B denote identities of components. Below, we give the specification of the above protocol in our protocol type language. protocol ShibbolethResourceRequest { p a r t i c i p a n t s : C l i e n t , SP // 1 . c l i e n t r e q u e s t s r e s o u r c e C l i e n t −> SP : <H t t p R e q u e s t > // 2 . r e s p o n s e i f a u t h o r i z e d SP −> C l i e n t : <H t t p R e s p o n s e > } This specification expresses explicitly the order of communication between the two components and provides a clear and readable protocol. Furthermore, as we will see later in this section, our protocol type specifications support multi-party looping: A [...]∗ , where ... constitutes the loop body which is controlled by participant A, and a multi-party branching construct A:{l1 :...,..., ln :...} where A controls which path li of the conversation to follow. 2.3 Failure handling In the simple Shibboleth resource request protocol above, two inherent failure scenarios can occur: 1. Although the client is authenticated, it might not be authorised to access the requested resource. 2. The client’s authentication ticket may have expired, and thus the client must request a new ticket in order to gain access to resources at the SP. In the following we show the difficulties of modelling such scenarios even in a variant of the SJ specification language including the multi-party looping and branching constructs mentioned above – referred to as SJ? in the following – calling for specific failure-handling abstractions. The first scenario can be modelled as a special value in the HttpResponse, however hiding the important distinction between success and failure. Another alternative is to have a branch guarded by SP (SP: {...} ), specifying the three different scenarios explicitly: (a) the regular scenario (NoFailure branch) sends the response; (b) the AuthorisationFailure aborts the attempt (see 1 above); (c) the ExpiredFailure specifies the second failure scenario (see 2). In the third branch, the client communicates with IdP to obtain a new ticket. Finally, the client indicates to SP whether it wants to retry the request by controlling the loop ( Client : [...]∗ ) around the protocol. Retrying also applies to (b), where the client might want to request a different resource instead. Below, we give the specification in SJ? . 1 protocol RobustShibbolethResourceRequest { 2 p a r t i c i p a n t s : C l i e n t , SP , IdP 3 // l o o p f o r r e t r y i n g , i f f a i l e d 4 Client : [ 5 // c l i e n t r e q u e s t s r e s o u r c e 6 C l i e n t −> SP : <H t t p R e q u e s t > 7 // b r a n c h g u a r d e d by SP 8 SP : { 9 // d e s i r e d 10 N o F a i l u r e : SP −> C l i e n t : <H t t p R e s p o n s e >, 11 // u n a u t h o r i z e d 12 AuthorizationFailure : , 13 // t i c k e t e x p i r e d 14 ExpiredFailure : 15 // r e n e w r e q u e s t 16 C l i e n t −> IdP : <H t t p R e q u e s t > 17 // new t i c k e t 18 IdP −> C l i e n t : <H t t p R e s p o n s e > 19 } 20 ]∗ 21 } This simple example illustrates several shortcomings of specifying failure handling with conventional branching and looping constructs, a technique henceforth referred to as FSJ? (f ailure handling with SJ ?). First, adding failure branches adds substantial complexity and can lead to invalid paths. For instance, retrying a failed communication is specified as a loop around the entire session, allowing the failure-free path (NoFailure) to be repeated despite success. This breaks the original protocol, which ends upon success. An implementation that has such a spurious path will unfortunately pass verification. Second, efficiency has to be considered for a specification, since the resulting software contains corresponding communication actions. For example, in the failure-free run, which can be assumed to be the common case, there exists duplicated and redundant communication. Once the session is initiated, the client controlling the loop sends a message to all components involved (Line 4), notifying the termination condition. In addition, there are two messages — the branch label NoFailure (Line 9) as well as the actual response (Line 9) — which have to be transmitted between the SP and the client. This can be addressed by having the SP send a HttpResponse, expressing also the failure notification as a special value, and the client guarding the loop. We give this specification below: C l i e n t −> SP : <H t t p R e q u e s t > SP −> C l i e n t : <H t t p R e s p o n s e > Client : { // b r a n c h g u a r d e d by c l i e n t NoFailure : // no more m e s s a g e s n e e d e d AuthorizationFailure : ... } This replace Lines 5-19 in specification in Listing 2.3. Besides losing clarity, efficiency is similarly hampered because the SP waits for an additional branch label NoFailure in the failure-free run to be sent by the client to proceed. Our protocol type specification addresses failures that can arise during message exchanges via explicit failure handlers. Branches and loops do not have associated failures as those are captured by communication within them. We use a notation that is similar to exception handling features in mainstream programming languages, like Java and C++. try ... handle specifies a scope of failure handling, and several handle clauses can be attached to the same try to describe different types of failures.2 Below, we give the protocol in our language: protocol RobustShibbolethResourceRequest { p a r t i c i p a n t s : C l i e n t , SP , IdP try { // c l i e n t r e q u e s t s r e s o u r c e C l i e n t −> SP : <H t t p R e q u e s t > SP −> C l i e n t : <H t t p R e s p o n s e > | AuthorizationFailure | ExpiredFailure } handle ( A u t h o r i z a t i o n F a i l u r e ) { // m i g h t r e q u e s t d i f f e r e n t r e s o u r c e Client : retry } handle ( E x p i r e d F a i l u r e ) { // t i c k e t e x p i r e d C l i e n t −> IdP : <H t t p R e q u e s t > // r e n e w r e q u e s t IdP −> C l i e n t : <H t t p R e s p o n s e > // new t i c k e t Client : retry } } The SP can now immediately indicate a failure of either type AuthorisationFailure or ExpiredFailure — corresponding to the two failure scenarios — as alternatives ( | ) to sending a response. As illustrated, the same message can yield different types of failures. In the case of a failure the execution moves forward to the corresponding handler (handle (...) ). In contrast to the two FSJ? models, the intent is clear and there is no further need for communication between the client and the SP in the failure-free path. Similarly, the block terminates as it should and is not artificially constrained to be within a loop with the hidden invariant that the loop does not repeat upon success. In many distributed protocols, retrying the protocol, or relevant sub-protocols, is a necessary recovery mechanism. In protocol types, retrying is a first-class construct. The notation A: retry denotes the repetition of the next enclosing try {...} body, where A makes the decision. Making repetition a choice rather than forcing it allows protocol type specifications to avoid endless repetition. Assigning the duty of choosing the “retry path” to a particular participant is important for the static analysis to check software’s conformance to protocol specification. 2 We avoid the keyword catch to denote handlers to avoid confusion with the stronger (synchronous) semantics for exceptions in languages like C++ or Java. Table 2: Constructs or protocol type specification language A−>B: <T> | F1 | ... | Fn Message exchange or failure A: [...]∗ Looping A: {l1 :..., ..., ln :...} Branching try {...} handle(F1) {...} Failure handling ... handle(Fn) {...} A: retry 3. Retry PROTOCOL TYPE DESIGN This section describes in more details our specification language, framework, and semantics. 3.1 Challenges Our approach provides a specification language to express distributed protocols in the presence of failure: “environmentinduced” failures (e.g., host or communication failures) as well as application-defined failures (“exceptions”). The key challenges in the design of our technique are: Simplicity. We are interested in few intuitive features that capture a broad range of scenarios, and simplify global reasoning for the programmer in the presence of partial failures. Efficiency. Our technique should not over-constrain and hamper performance of software. In particular, the software must retain the potential for asynchronous execution rather than coupling components by coordination behind the scenes. The biggest challenge comes from the seeming conflict between these two requirements. For instance, a technique which allows to reason in terms of atomic protocol blocks and performs automatic rollbacks upon failures in a transactional style would probably be easier to use for a programmer. However, such features are hard to implement efficiently at large scale and thus can hamper the asynchrony underlying many distributed systems. We break our discussion on the protocol type design into (a) the presentation of the constructs, (b) the design of a transformation tool that deals with two important patterns: failure notification, also in the presence of nesting, and (c) the semantics of retrying in case of failure. 3.2 Constructs Table 2 provides an overview of the constructs for our protocol type specification language. A, B range over identities of components, T over types of messages exchanged and F (as well as with subscript) over failure types. The construct A−>B: <T> | F1 | ... | Fn specifies a message exchange between A and B of type T or that A may throw an exception of type F1, ..., Fn. As mentioned, looping A: [...]∗ specifies a repetitive behaviour, where the termination condition is controlled by A, and A: {l1 :..., ..., ln :...} describes branching of a conversation where A selects a label li and conversation follows that path. The construct try {...} handle(F1) {...} ... handle(Fn) {...} specifies an interaction within a try block that can raise a failure F1, ..., Fn, handled respectively in one of the handle blocks. The retry construct describes a repetition of a surrounding try block. 3.3 Transformation To not hamper asynchrony and efficiency of software, the user-defined protocol type specifications are transformed into specifications having failure notifications following the flow of communication. 3.3.1 Failure notification One way to handle failure of a communication within a try {...} is to abort all subsequent ones, meaning that all such communications would have to be guarded by the presence or not of failure. This would imply strong synchronisation at runtime between recipients of failure-prone communications, senders, and receivers of any follow-up communication. This hampers asynchrony and efficiency in the software. The transformation tool injects failure notifications messages only to components that are causally depending on the failed communication in a given try ... handle block. Causal dependence follows the usual definition of causal ordering of events [19]: (a) a sending action from a participant causally precedes any subsequent send by that participant, (b) a sending action causally precedes its receiving action, and (c) if we have send or receive actions e1 , e2 , e3 such that e1 precedes e2 and e2 precedes e3 , then e1 precedes e3 . From a component P’s perspective, this means that if P is on the receiving end of a causal chain of messages (m1 , ..., mn s.t. ∀i ∈ [1..n]mi =Pi −>Pi+1 : ... , ∀i < j ∈ [1..n] mi occurs before mj , and Pn+1 =P) within a try ... handle block that can yield a failure (i.e., m1 =P1 −>P2 : ... | F) then P is subject to being rolled forward to the corresponding handle (F) handler. From the user perspective, this does not add communication, as the failure notification or its absence, inherently, is propagated along the usual communication path by the tool; i.e., in our approach, the failure notification is sent directly to all targets. Other components present in the given try ... handle block can proceed with their remaining communication inside the try {...} . Indeed, since their communications are not causally depending on failure-prone interactions, neither their success nor failure depends on those. So, they can proceed up to the point of synchronisation at the end of the try {...} body. If no such synchronisation is desired, then, given the independence of the said communication from the failureprone parts, the programmer can also move them outside of the block. For illustration consider the following protocol 1 try { 2 A −> B : <T1> | F 3 C −> D : <T2> 4 } handle (F) { 5 ... E ... 6 } where the second message exchange from C to D at Line 3 does not depend on the first at Line 2. In the case of a failure F occurring at Line 2, the second exchange can thus proceed. Also, the programmer can move it out of this try ... handle block either before or after. There is a difference between environment induced and application defined failures. The sender of a corresponding communication in the former is not explicitly raising a failure and may not be aware of it. As such, it is not considered for determining the set of causally depending participants and communications; only the receiver is considered, respecting the asynchronous nature of communication. There is one additional important feature to this basic semantics to mention. From the example above, any participant E which does not appear in a try {...} but appears in a given handle(F), upon a F failure, will be informed like the other dependent participants in the body; E has to be informed in any case to participate in recovery actions. 3.3.2 Nesting Our protocol type also supports nesting of failures in the sense that any try {...} block or handle ...{...} clause can contain another try ... handle. The rules for nested handling and propagation of failures is analogous to those for exception handling within method bodies in languages like Java or C++ (and not like rules for propagation through nested method calls). That is, any failure, which does not have a corresponding handle clause attached to its immediate surrounding try , is propagated one scope outwards etc. If a given failure is not addressed by a corresponding handler within any enclosing scope, the protocol type is ill-formed and will be rejected by the static analysis. Nesting is orthogonal to our synchronisation semantics, or put differently, composes straightforwardly with it. Components that must be notified for a failure are found in the corresponding try {...} body, includes any component in nested try {...} blocks. Inversely, for any failure F not handled by any handle {...} clause of such a nested try , we consider the components in the causal extension of the raising of F for determining the set of components depending on the F after that nested try . Consider the following example: 1 2 3 4 5 6 7 8 9 10 11 12 13 try { A −> B : <T1> | F1 try { B −> C : <T2> | F2 B −> D : <T3> | F3 } h a n d l e ( F2 ) { ... } } h a n d l e ( F1 ) { ... } h a n d l e ( F3 ) { ... } Upon a failure of type F1 at Line 2 the remainder of the protocol trivially gets aborted as all the other sends causally depend on it. Upon a failure F2 at Line 4, Line 5 is skipped, and the failure F2 is handled “locally”, which presumes that all invariants required for subsequent actions — as is common with exception handling — can be restored there. Lastly, if a failure of type F3 is raised at Line 5, then it will be handled one level of nesting outwards. 3.4 Retrying One possibility to continue the protocol above after a failure of type F2 at Line 4 is to retry the nested (sub)protocol delineated by the inner try {...} . This can be achieved by specifying the handle(F2) clause with B: retry to trigger a retry guarded by B, meaning that the entire inner try {...} block is re-attempted. The following replaces Lines 3-8 in the example: try { B −> C : <T2> | F2 B −> D : <T3> | F3 } h a n d l e ( F2 ) { B: retry } A retry can fail again and/or the participant guarding it can choose to not retry. The programmer is still responsible for establishing any invariants necessary for any continuation of the protocol. There is no implicit propagation of the handled failure in the case of a declined retry. Thus retrying is not a panacea, it’s a feature that relieves the programmer from the burden of explicitly describing loops presented in the FSJ? model of the Shibboleth example in Section 2, and ensures that certain internal paths reflecting success do not lead to retrying the loop. 4. PROTOCOL TYPE SOFTWARE SYSTEM In this section we introduce our protocol type system for implementing fault-tolerant distributed software, featuring asynchronous message-passing, branching, looping, failure notification and failure handling. We extend Java with syntax for structured interactions operations, borrowing ideas from SJ. Table 3 gives an overview of the constructs in our protocol type software system. The most interesting ones are looping, branching, failure handling and retry of a try block. Looping and branching are constructs defined for two or more components. The ! indicates the party that controls looping or branching, while the other parties denoted by ? follow this decision. The other constructs read the same as in the specification language. At the beginning of the protocol execution, the software system automatically connects all distributed components to each other by deciding which components expect connection requests from which components in a way that avoids deadlocks. After all components are successfully connected, the protocol sockets belonging to each component are used to perform the operations according to the protocol specification. Whenever a failure occurs, the software system automatically sends synchronisation (failure) messages to the participants that need to be notified of the failure. These synchronisation messages carry the type of failure. On the receiving end, whenever the software system receives a failure message, it automatically raises an exception to move execution to the appropriate failure handler. Moreover, if execution flow proceeds normally without a failure at a synchronisation point, the software system automatically sends a “no failure” control message to any component that must be notified of a failure if one occurs. If a component receives a “no failure” message, it simply continues its execution, representing the failure-free path of the protocol. To illustrate, we provide the implementation of a simplified version of the SP component for the Shibboleth protocol in our approach. The first step is to create a protocol socket with the name of the protocol and identity of component. These values are used by the static analysis to check the conformance of the implementation to the protocol specification. Following, on the socket are performed protocol operation such as receive a HttpRequest from the client, sending back a HttpResponse, enclosing operations that may fail invariants, and so throwing exceptions, within a try-handle as described in Section 2. PInstance p = Protocol . i n i t ( R o b u s t S h i b b o l e t h R e s o u r c e R e q u e s t , SP ) ; i n t validMs = 864000000; p . try { H t t p R e q u e s t r q = p . r e c v ( rq , C l i e n t ) ; i f ( ! r e c o g n i z e d ( r q . g e t ( ”t o k e n ”) ) ) { p . throw ( new A u t h o r i z a t i o n F a i l u r e ( . . . ) ) ; } Table 3: Constructs for implementing software components. Construct Action PInstance p = Protocol. init (P, A) Instantiate protocol P p.send(m, A) Send message m to A m = p.recv(A) Receive message m from A p.outwhile(cond ){...} Looping: A sends termination condition Looping: A receive termination condition p.outbranch(Li ){...} Branching: A selects and sends a label p.inbranch{L1 :...;... Ln :...;} Branching: A receives a label p. try {...}{ F1 f1 :...;... Fn fn:...;} Try block p.throw(f) Raising failure f p. retry (cond) Retry controlled by A with p. inwhile {...} cond p. retry Retry controlled by A i f ( c u r r e n t T i m e M i l l i s ( ) − r q . g e t ( ”t o k e n ”) > validMs ) { p . throw ( new E x p i r e d F a i l u r e ( . . . ) ) ; } e l s e p . send ( new H t t p R e s p o n s e ( . . . ) ) ; }{ A u t h o r i z a t i o n F a i l u r e a : p . r e t r y ; ExpiredFailure e :p. retry ; } else where RobustShibbolethResourceRequest and SP are the arguments passed to protocol socket; validMs represents the validity duration of tokens (1 day). 5. A STATIC ANALYSER This section presents the static analysis that checks the conformance of components to the protocol type specification. The analysis is implemented on top of the SJ 3 static analyser. The SJ’s analyser is written using the Polyglot 4 framework. It takes as input a session type specification as well as source files, and checks the input program against the session type specification. In addition, the analyser instruments the code with calls to the SJ software system. The result is passed to the standard Java compiler. To implement our static analyser, we first extended the SJ analyser to support pairwise channels, branching, looping and the failure handler features as described in Section 3. The input to our extended version of SJ, called PJ, is a nonstandard Java file that contains a protocol type specification and a component implementation in our software system. Just like SJ, PJ’s output is a standard Java file that can be compiled by a standard Java compiler. Figure 2 gives the stages of the static analyser. In the parsing stage, PJ processes the input file and protocol type specification, constructing the AST for the input file. PJ synthesises component specs from the (global) protocol type by projection. The checking phase of the analyser verifies that the implementation of each component conforms to its spec. This includes checking that messages of the correct types are communicated with the correct component at all protocol points. Moreover, the checker uses 3 4 http://code.google.com/p/sessionj/. http://www.cs.cornell.edu/projects/polyglot/. Input File Parser Checker PJ Software System Executable Java Compiler Transformation & Translation Output File Figure 2: Work flow of PJ’s static analyser type information to verify that all failures are handled at the points specified by the global protocol type and their handlers implement a recovery protocol as specified. The analyser uses the specification returned from the transformation tool to instrument failure notification messages in the software components. This ensures that components are correctly notified of failures and execute appropriate handlers. Failure notification messages are inserted with one message send for each component that needs to be notified of the failure or its absence. After this, the analyser translates the AST into a standard Java file. 6. EVALUATION We illustrate the benefits of our technique empirically both in terms of specifications and software quality, and performance to gauge simplicity and efficiency (see Section 3.1). We consider six benchmarks programs: Shibboleth introduced earlier, 1PC [27], 2PC [27], 3PC [24], and, as well as Currency Broker and Buyer-Seller-Shipper examples inspired from previous work on session types with exceptions [7, 8]. In the following we first assess the benefits of protocol type in terms of specification and software quality by comparing them to FSJ? versions. Then, we also show that software based on protocol type system provide better performance than those of FSJ? in that their execution is faster. For both specification/software quality and performance comparisons we also include software versions obtained with SJ ? protocols that are agnostic to failures (SJA? ), i.e., that do not deal with failures. This gives us a utopian baseline reflecting a world in which we need not worry about failures. The versions obtained with our protocol type approach are in the following referred to as PT for brevity. Before we give the empirical studies along code quality analysis, we briefly provide an informal description of the 2PC protocol. 6.1 Two Phase Commit The popular Two Phase Commit (2PC) protocol [27] is used to decide on the outcome of distributed transactions executing across several servers. A TwoPC instance kicks off by having the coordinator send the identifier of a transaction (long) whose outcome (abort or commit) is to be voted upon by all participants. The coordinator waits for votes from each participant (true for commit, false for abort), based on which the coordinator sends the final decision to all participants. (Given the independence of the two sends from and to the coordinator respectively these can proceed in parallel.) Any other voting outcome than a unanimous commit (commit votes from all participants) must lead to aborting the transaction. If the coordinator times out on any of the responses then the protocol proceeds with the corresponding handle clause, leading to abort. In contrast to the previous examples, TimeoutFailure is raised by the “environment”, which means that the runtime raises it. This is no different than a RemoteException in Java’s remote method invocations which needs to be added to every remotely invocable method to convey errors like ConnectionExceptions, or SOAPExceptions in Web Services. With protocol type, this is supported by declaring the corresponding failure as a subtype of a built-in InfrastructFailure . Figure 6 outlines a simplified version of such a protocol in our language. For simplicity we focus on the case of two participants — which may fail — as that is enough to illustrate what we need, as argued by Skeen and Stonebraker [24]. Table 4: Code metrics Protocol Approach Spec Soft Nesting Msgs States Inv. Dupl. lines lines levels max. paths lines SJA? PT FSJ? SJA? 2PC PT FSJ? SJA? 3PC PT FSJ? Currency SJA? Broker PT FSJ? Buyer Seller SJA? Shipper PT FSJ? SJA? Shibboleth PT FSJ? 1PC 6.2 8 14 16 10 14 19 14 45 49 13 24 22 14 19 21 26 43 42 19 36 59 17 30 59 21 139 184 34 67 68 38 55 80 138 245 272 0 1 4 0 1 4 0 3 9 1 3 4 1 2 3 2 3 3 4 7 12 6 6 9 10 12 18 7 11 14 8 8 12 20 21 43 1 4 6 1 4 5 1 16 26 3 8 8 3 8 10 15 16 20 0 1 0 1 0 1 0 0 0 0 0 2 0 8 0 14 0 23 0 0 0 0 0 0 Specification and software quality To demonstrate the simplicity of devising fault-tolerant protocols in protocol type, we compare the quality of specifications and their corresponding software to those of FSJ? and SJA? . We gauge programmer effort by considering a number of statically determined code characteristics (1) lines of code (LoC) for protocol descriptions, (2) LoC for corresponding software, (3) nesting levels in protocols, (4) “maximum number” of messages (i.e., number of messages in the longest failure-free communication path), (5) number of distinct states (state here refers to a group of protocol operations that form a basic block in the protocol’s control-flow graph), (6) invalid paths between states in the protocol descriptions (i.e., paths permitted but not valid according to the protocol), and finally, (7) duplicated LoC due to explicit failure encoding (values for protocol type specifications are always 0). The last two are moot for SJA? . Table 4 summaries the outcome of our static code quality evaluation. While the number of lines (1) is close between PT and FSJ? , where the latter, however, clearly increases the numbers of LoC (2), and nesting levels (3). The number of messages in the longest failure-free path is also clearly increased in FSJ? (4); this is a hint to performance overhead which will be validated shortly. Another symptom is the protocol complexity, which is demonstrated through an increasing number of different states (5). Figures 3a and 3b graphically illustrate this difference via protocol state transition diagrams for the discovery protocol of Shibboleth with PT and FSJ? . Notice that NoIdPFailure does not increase the Start noDiscovery 1 Discovery Passive 2a, 2b DisplayPageF notPassive H 3 4 5a 5b, 6a 6b, 7,8 H 9 AuthenticationF 10 6.3 H H 11 AuthorizationF ExpiredF End (a) With PT Start 1 noDiscovery Discovery 2a, 2b DisplayPageF H Passive notPassive E1 NoFailure 3 NoIdPF H E2 NoFailure 4 5a 5b, 6a E3 H 6b, 7,8 AuthenticationF NoFailure 9,10 ExpiredF AuthorizationF H number of states with PT as it is handled locally at the discovery service end. Moreover, we see that states 9 and 10 in Figure 3a are separate, but they appear in a single state in Figure 3b. The reason is that with protocol type, the AuthenticationFailure is tied with the HttpRedirect message, which means that this message can potentially change the protocol execution path. In Figure 3b, however, the HttpRedirect message can be sent only after the execution path has been decided by the Authentication - Failure branch path, and therefore message 9 is placed in the failure-free path along with message 10. The most substantial increase in states with FSJ? occurs in 3PC. This is largely due to disjoint paths through the protocol and corresponds directly to the large increase in nesting level and code duplication. Shibboleth in FSJ? does not suffer from duplication but instead from invalid paths like 1PC, 2PC and 3PC (6). All XPC protocols exhibit significant code duplication (7) with FSJ? . The Currency Broker shows the least benefits for PT. We believe that this is largely due to its simplicity and focus on (few) applicationlevel failures. E4 H NoFailure 11 End (b) With FSJ? Figure 3: State diagrams of Shibboleth discovery protocol Performance characteristics As mentioned, simplicity of modelling can be easily achieved by proposing features which impose strong synchronisation. We show that the less simple FSJ? inversely adds much more overhead than PT. For our performance evaluation we ran successive rounds of the various protocols. All components were executing on distinct machines as well as in distinct networks within campus with 1 - 3 hubs connecting each pair of networks. We varied the percent probability that any given communication that could result in a failure would actually raise a failure notification. Thus as the percent probability of failures being raised increases, so does the number of times that portions of the protocol must be re-executed to achieve full completion. The exponential trend of the graphs of Figure 4 is due to the fact that these protocols have more than one possible failure point. Thus as the failure probability increases for each of the failure points, the probability of a successful protocol execution exponentially decreases. All executions were repeated 10000 times and averaged. As the figure shows, in all reasonable ranges of failure probability PT clearly outperforms FSJ? . For instance with Shibboleth, when the percent probability of a failure being raised at a communication point is 20% or below, PT takes only about 60-70% of the time of FSJ? . These benefits are due to the absence of redundant messages which occur with manual implementation of failures. We note that this measurement approximates an upper bound on our performance gains as the protocols primarily perform communication. Figure 5a normalises the improvements of PT over FSJ? . For instance with 2PC PT shaves off 9-22% of the time used by FSJ? . For 2PC, Currency Broker, and Buyer-SellerShipper performance improvements of PT are consistent, saving between 9% and 42% over FSJ? . In failure-free runs, PT only takes around 20% and 35% of the time FSJ? takes for Shibboleth and 1PC respectively. These performance improvements come from a number of factors depending on how protocols are implemented. For example, in the case of the FSJ? software of 2PC, all protocol paths incur extra branch decision control messages that are used to notify the components of whether or not a failure has occurred at each communication that can fail in addition to the actual sent FSJ* FSJ* (a) 1PC (b) 2PC (d) Currency Broker (c) 3PC FSJ* FSJ* FSJ* (e) Buyer-Seller-Shipper FSJ* (f) Shibboleth Figure 4: Average time to complete an iteration of a protocol vs failure probability (a) PT Normalized vs FSJ? FSJ* (b) Overhead of PT and FSJ? vs SJA? Figure 5: Overhead comparison message. A second source of spurious messages that appears in the FSJ? software is the way retries are implemented. Retries are implemented in FSJ? using session loops. These add 2 × (n − 1) extra messages to the failure free path where n is the number of components in the protocol. The first set of n − 1 messages are the control messages sent by the loop guard to all components. These messages signal components to enter the first iteration of the loop, in order to proceed with the execution for the first time. Then, at the end of executing the subprotocol within the loop, an extra n − 1 control messages are sent by the loop guard in order to signal whether to re-execute the loop body or not. This type of overhead occurs in FSJ? for Shibboleth, BuyerSeller-Shipper, 1PC, 2PC and Currency Broker examples. With PT, the components proceed to execute the subprotocol within the try block without any exchange of messages. Moreover, retry control signals that are sent only when a failure occurs and the handler offers the possibility of retrying the failure-prone subprotocol. The third source of spurious messages is nesting, as best demonstrated by the FSJ? version of 3PC. Due to the deep nesting implemented using branching, the performance of the failure-free path is hampered by n − 1 extra control messages for each nesting level: a branch guard has to send a control message to each of the n − 1 components which wait on it to assess whether a failure occurred. The 1PC and 3PC protocols show interesting performance trends. Because of the simplicity of the 1PC protocol we see little performance improvement with PT over FSJ? . This is because the number of spurious messages that are sent are very few. Moreover, as the failure probability increases the two softwares begin to converge and show similar performance results. For 3PC, on the other hand, PT shows consistent but humble improvements when the exception probability is below 60%. However, when the exception probability increases beyond the 60% point we see that PT performance improvements significantly increase. This is because the number of spurious messages exponentially increases as the exception probability increases. Last but not least, Figure 5b focuses on failure-free runs, showing the overheads of PT and FSJ? over SJA? . PT shows no overhead except for Shibboleth (∼55%) and 3PC (∼22%). In contrast, FSJ? invariably incurs between ∼22% (2PC) and ∼152% (Shibboleth) overhead. 6.4 Threats to Validity and Discussion We have shown that protocol type technique and tools have advantages over SJ both in terms of (1) simplicity and (2) efficiency on a range of different protocols. Although the examples are relatively small, they span a number of important protocol examples and families, and involve different failure models. We have implemented all of the examples in Java; however, we observe that most languages will contain primitives for loops and branches, while the protocol type specification language itself is language-independent, and complexity improvements for PT over FSJ? can be generalised: with FSJ? , a sequence of n failure-prone sends m1 , ..., mn can translate to n nested branches; mi+1 , ..., mn are duplicated at branch i at a degree proportionate to the number of different failures possibly occurring upon mi ; and every branch requires the sending of a label which is redundant with the subsequent message or failure. Furthermore, every loop introduced for retrying requires an additional multi-sending upon first execution. All of this translates to increased latency. Although the benchmarks show good percent improvement in the runtime of the software, we expect large, real-world systems to not always exhibit such performance improvements due to a lower ratio of communication to computation performed. However, we note that the benefits afforded by protocol type techniques and tools will be exhibited by reduced latency– a metric often of equal importance to raw throughput. Last but not least, even if protocol type yield high LoC savings, we believe their main benefit lies in the reduction of invalid paths compared to implementing retrying via loops – these must be ruled out manually by the programmer which defies the purpose of the static analysis approach. 7. RELATED WORK Session types. Following the pioneering work of Takeuchi, Honda, and Kubo [25], several type theories to guarantee deadlock-freedom and communication-safety for process calculi — so-called session types [12, 26, 6] — have been proposed. Honda et al. [16] and Bonelli et al. [6] extended the original bi-party session types to multi-party interaction (MPSTs). Bettini et al. [5] introduced implicit participantpairwise channels. Multi-channels have been used in a variant of SJ [23] for parallel software, describing protocols of an arbitrary number of components, but without addressing failure handling. Bejleri and Yoshida’s work [4] extends that of Honda et al. for synchronous communication among multiple interacting peers. Operational semantics for asynchronous session types were first studied by Neubauer et al. [20]. Session types have been applied to functional [12] and object-oriented settings [10], as well as others. Gay et al. [13] focus on modular typetheory of sessions via objects. A systems-level object-oriented system [10] integrate the theory of bi-party session-types and ownership-types into a variant of C#. It is used to describe the interfaces and verify the components of the Singularity OS. Components communicate via message-passing, designed over shared memory and implemented as pointer rewriting. The system is suitable for implementing concurrent, system-level software. The technique is similar to bi-party session types and is adaptable only for concurrency software but not for distributed. Scribble [15, 29] is an ongoing project on a session type based language and tool chain for large scale distributed applications. Pabble [22] is a recent dialect of Scribble to describe protocols of an arbitrary number of components and check their conformance to software written in MPI. The current version of it offers only a limited form of failure handling, namely interruption; thus, given the growing interest in it, we believe it would be worth investigating our failure handling technique into it. Exception handling. Carbone et al. [7, 8] propose structured interactional exceptions for session types based asynchronous communication. When a process throws an exception, execution is interrupted at all participants involved in the conversation and they move to another dialogue. Exceptions can be nested through nested try blocks but raising an exception is not permitted to occur within an exception handler. The model supports only one kind of exception. Hanazumi and Vieira de Melo [14] and Alexandar et al. [28] describe a method for exception handling based on Coordinated Atomic Actions (CAAs). Coordinated exception handling is achieved by satisfying a number of CAA properties on transactions, namely rollback and exception handling properties. When an exception is raised within a CAA or signaled to it, the participants handle the exception by executing the exception handling code for that CAA. If the exception is not handled within the CAA, it is propagated to other parts of the system. This method requires substantial programming effort. Also, transactional guarantees are not always needed or possible. 8. CONCLUSIONS AND OUTLOOK To help the programmer combat the challenges of design and development of distributed software in the presence of partial failures, we have proposed and presented protocol type specification technique, including static analysis, and a software system. We are exploring several extensions to our work, e.g., a failure propagation semantics with different root failure and different kind of communication channels instead of the hardwired “−>”. We plan to investigate our failure handling technique into emerging, promising specification approaches such as Scribble into a complete specification technique, that describes the structure of data, for distributed programming. 9. REFERENCES [1] Shibboleth. http://www.shibboleth.net. [2] A. Bejleri, T.-C. Chen, M. Qudeisat, L. Ziarek, and P. Eugster. Effective specification of fault tolerant distributed software. Available at: www.andibejleri.net/papers/protocol.pdf, 2015. [3] A. Bejleri, R. Hu, and N. Yoshida. Session-based programming for parallel algorithms: Expressiveness and performance. In Proceedings Second International Workshop on Programming Language Approaches to [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] Concurrency and Communication-cEntric Software, PLACES 2009., pages 17–29, 2009. A. Bejleri and N. Yoshida. Synchronous Multiparty Session Types. Electronic Notes in Theoretical Computer Science (ENTCS), 241:pp. 3–33, 2009. L. Bettini, M. D. Luca, M. Dezani-ciancaglini, and N. Yoshida. Global progress in dynamically interleaved multiparty sessions. In CONCUR 2008 Concurrency Theory, 19th International Conference, Proceedings, pages 418–433, 2008. E. Bonelli and A. Compagnoni. Multipoint session types for a distributed calculus. In Trustworthy Global Computing, Third Symposium, TGC 2007, Revised Selected Papers, pages 240–256, 2007. M. Carbone, K. Honda, and N. Yoshida. Structured interactional exceptions in session types. In CONCUR 2008 - Concurrency Theory, 19th International Conference, Proceedings, pages 402–417. 2008. M. Carbone, N. Yoshida, and K. Honda. Asynchronous session types: Exceptions and multiparty interactions. In Formal Methods for Web Services, pages 187–212. 2009. M. Dezani-Ciancaglini, D. Mostrous, N. Yoshida, and S. Drossopoulou. Session types for object-oriented languages. In ECOOP 2006 - Object-Oriented Programming, 20th European Conference, Proceedings, pages 328–352, 2006. M. F¨ ahndrich, M. Aiken, C. Hawblitzel, O. Hodson, G. Hunt, J. R. Larus, and S. Levi. Language support for fast and reliable message-based communication in singularity os. In Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006, EuroSys ’06, pages 177–190, 2006. F. C. G¨ artner. Fundamentals of fault-tolerant distributed computing in asynchronous environments. ACM Comput. Surv., 31(1):1–26, 1999. S. Gay, V. Vasconcelos, and A. Ravara. Session types for inter-process communication. Technical report, 2003. S. J. Gay, V. T. Vasconcelos, A. Ravara, N. Gesbert, and A. Z. Caldeira. Modular Session Types for Distributed Object-Oriented Programming. In Proceedings of the 37th ACM Symposium on Principles of Programming Languages (POPL ’10), pages 299–312, 2010. S. Hanazumi and A. C. V. de Melo. Coordinating exceptions of java systems: Implementation and formal verification. In Proceedings of the 2012 Eighth International Conference on the Quality of Information and Communications Technology, QUATIC ’12, pages 108–113. IEEE Computer Society, 2012. K. Honda, A. Mukhamedov, G. Brown, T.-C. Chen, and N. Yoshida. Scribbling interactions with a formal foundation. In Proceedings of the 7th International Conference on Distributed Computing and Internet Technologies (ICDCIT 2011), pages 55–75, 2011. K. Honda, N. Yoshida, and M. Carbone. Multiparty Asynchronous Session Types. In Proceedings of the 35th ACM Symposium on Principles of Programming Languages (POPL ’08), pages 273–284, 2008. R. Hu, D. Kouzapas, O. Pernet, N. Yoshida, and [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] K. Honda. Type-safe eventful sessions in java. In ECOOP 2010 - Object-Oriented Programming, 24th European Conference, Proceedings, pages 329–353, 2010. R. Hu, N. Yoshida, and K. Honda. Session-based distributed programming in java. In ECOOP 2008 Object-Oriented Programming, 22nd European Conference, Proceedings, pages 516–541, 2008. L. Lamport. Time, Clocks, and the Ordering of Events in a Distributed System. Communications of the ACM, 21(7):558–565, July 1978. M. Neubauer and P. Thiemann. An Implementation of Session Types. In Proceedings of the 6th ACM International Symposium on Practical Aspects of Declarative Languages (PADL ’04), pages 56–70, 2004. B. C. Neuman and T. Ts’o. Kerberos: An Authentication Service for Computer Networks. IEEE Communications, (9):33–38, Sept. 1994. N. Ng and N. Yoshida. Pabble: Parameterised scribble for parallel programming. In 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, pages 707–714. IEEE Computer Society, 2014. N. Ng, N. Yoshida, O. Pernet, R. Hu, and Y. Kryftis. Safe parallel programming with session java. In Proceedings of the 13th International Conference on Coordination Models and Languages, COORDINATION’11, pages 110–126, 2011. D. Skeen and M. Stonebraker. A Formal Model of Crash Recovery in a Distributed System. IEEE Transactions on Software Engineering, 9(3):219–228, May 1983. K. Takeuchi, K. Honda, and M. Kubo. An Interaction-based Language and its Typing System. In Proceedings of the 6th International PARLE Conference on Parallel Architectures and Languages Europe (PARLE ’94), pages 398–413, 1994. A. Vallecillo, V. T. Vasconcelos, and A. Ravara. Typing the Behavior of Software Components using Session Types. Fundamenta Informaticae, 73(4):pp. 583–598, 2006. G. Weikum and G. Vossen. Transactional Information Systems: Theory, Algorithms, and the Practice of Concurrency Control and Recovery. Morgan Kaufmann, 2002. J. Xu, A. B. Romanovsky, and B. Randell. Concurrent exception handling and resolution in distributed object systems. IEEE Trans. Parallel Distrib. Syst., 11(10):1019–1032, 2000. N. Yoshida, R. Hu, R. Neykova, and N. Ng. The scribble protocol language. In Trustworthy Global Computing - 8th International Symposium, TGC 2013, Revised Selected Papers, pages 22–41, 2013. APPENDIX A. TWO PHASE COMMIT Figure 6 outlines a simplified version of the Two Phase Commit (2PC) protocol in our protocol-type. For simplicity we focus on the case of two participants — which may fail — as that is enough to illustrate what we need, as argued by Skeen and Stonebraker [24]. 1 p r o t o c o l TwoPC { 2 p a r t i c i p a n t s : Coord , P a r t 1 , P a r t 2 3 Coord −> P a r t 1 : <l o n g > 4 Coord −> P a r t 2 : <l o n g > 5 try { 6 P a r t 1 −> Coord : <b o o l > | T i m e o u t F a i l u r e 7 P a r t 2 −> Coord : <b o o l > | T i m e o u t F a i l u r e 8 Coord −> P a r t 1 : <b o o l > 9 Coord −> P a r t 2 : <b o o l > 10 } h a n d l e ( T i m e o u t F a i l u r e ) { // a b o r t t o a l l 11 Coord −> P a r t 1 : <b o o l > 12 Coord −> P a r t 2 : <b o o l > 13 } 14 } Figure 6: 2PC protocol with failure handling in protocol-type In an asynchronous distributed system a TimeoutFailure does not necessarily imply a participant crash, and so we asynchronously notify both participants of the abort regardless of failures. The 2PC example points to the importance of choosing the semantics for failure handling. The coordinator always performs the same two sends of decisions to both participants, regardless of failures. Thus from the perspective of these participants there is no difference to replacing the entire try ... handle block on Lines 5-13 simply with Part1 Part2 Coord Coord −> −> −> −> Coord : Coord : Part1 : Part2 : <b o o l > <b o o l > <b o o l > <b o o l > This would also avoid sending a failure message and an abort decision to participants in case of failure. The net difference, however, is that the coordinator can get stuck waiting for a vote from a participant which indeed failed. Based on the semantics expressed inherently with the example of Figure 6, a timeout on either participant constrains the coordinator to proceed to the handle clause. It also implies that any non-faulty participant knows to not expect two messages. In other terms they too proceed to the handler. Figure 7 illustrates the 2PC in FSJ? . 1 protocol TwoPCExplicit { 2 p a r t i c i p a n t s : Coord , P a r t 1 , P a r t 2 3 Coord −> P a r t 1 : <l o n g > 4 Coord −> P a r t 2 : <l o n g > 5 Part1 : { 6 T i m e o u t F a i l u r e : // n e x t 2 l i n e s d u p l i c a t e d 7 Coord −> P a r t 1 : <B o o l e a n > 8 Coord −> P a r t 2 : <B o o l e a n > 9 NoFailure : 10 P a r t 1 −> Coord : <B o o l e a n > 11 Part2 : { 12 T i m e o u t F a i l u r e : // n e x t l i n e s d u p l . 7−8 13 Coord −> P a r t 1 : <B o o l e a n > 14 Coord −> P a r t 2 : <B o o l e a n > 15 NoFailure : 16 P a r t 2 −> Coord : <B o o l e a n > 17 } 18 } 19 } Figure 7: 2PC with FSJ?