by Guy Golan Gueta
Transcription
by Guy Golan Gueta
Tel Aviv University Raymond and Beverly Sackler Faculty of Exact Sciences School of Computer Science AUTOMATIC F INE -G RAINED S YNCHRONIZATION by Guy Golan Gueta under the supervision of Prof. Mooly Sagiv and Prof. Eran Yahav and the consultation of Dr. G. Ramalingam A thesis submitted for the degree of Doctor of Philosophy Submitted to the Senate of Tel Aviv University April 2015 ii To my loved ones, Sophie, Yasmin, Ortal and Ariel. iii iv Abstract Automatic Fine-Grained Synchronization Guy Golan Gueta School of Computer Science Tel Aviv University A key challenge in writing concurrent programs is synchronization: ensuring that concurrent accesses and modifications to a shared mutable state do not interfere with each other in undesirable ways. An important correctness criterion for synchronization is atomicity, i.e., the synchronization should ensure that a code section (a transaction) appears to execute atomically. Realizing an efficient and scalable synchronization that correctly ensures atomicity is considered a challenging task. In this thesis, we address the problem of achieving correct and efficient atomicity by developing and enforcing certain synchronization protocols. We present three novel synchronization approaches that utilize program-specific information using compile-time and run-time techniques. The first approach leverages the shape of shared memory in order to transform a sequential library into an atomic library (i.e., into a library in which each operation appears to execute atomically). The approach is based on domination locking, a novel fine-grained locking protocol designed specifically for concurrency control in object-oriented software with dynamic pointer updates. We present a static algorithm that automatically enforces domination locking in a sequential library which is implemented using a dynamic forest. We show that our algorithm can successfully add effective fine-grain locking to libraries where manually performing locking is challenging. The second approach transforms atomic libraries into transactional libraries, which ensure atomicity of sequences of operations. The central idea is to create a library that exploits information (foresight) provided by its clients. The foresight restricts the cases that should be considered by the library — thereby permitting more efficient synchronization. This approach is based on a novel synchronization protocol which is based on a notion of dynamic right-movers. We present a static analysis to infer the foresight information required by the approach, allowing a compiler to automatically insert the foresight information into the client. This relieves the client programmer of this burden and simplifies writing client code. We show a generic implementation technique to realize the approach in a given library. We show that this approach enables enforcing atomicity of a wide selection of real-life v Java composite operations. Our experiments indicate that the approach enables realizing efficient and scalable synchronization for real-life composite operations. Finally, we show an approach that enables using multiple transactional libraries. This approach is applicable to a special case of transactional libraries in which the synchronization is based on locking that exploits semantic properties of the library operations. This approach realizes a semantic-based finegrained locking which is based on the commutativity properties of the libraries operations and on the program’s dynamic pointer updates. We show that this approach leads to effective synchronization. In some cases, it improves the applicability and the performance of our second approach. We formalize the above approaches and prove they guarantee atomicity and deadlock freedom. We show that our approaches provide a variety of tools to effectively deal with common cases in concurrent programs. vi Acknowledgements First and foremost, I would like to express my deep gratitude and appreciation to my advisors, Mooly Sagiv and Eran Yahav. Their guidance, inspiration, knowledge, and optimism were crucial for the completion of this thesis. I would like to thank G. Ramalingam for his guidance, help, and support throughout the work on this thesis. This thesis would have been impossible without his guidance and help. I would like to thank Alex Aiken and Nathan Bronson for interesting discussions, joint work, and for the enjoyable visits at Stanford University. I would like to thank Mooly’s group for many fruitful discussions and for being such a wonderful combination of research colleagues and friends: Ohad Shacham, Shachar Itzhaky, Omer Tripp, Ofri Ziv, Oren Zomer, Ghila Castelnuovo, Ariel Jarovsky, Hila Peleg, and Or Tamir. vii viii Contents 1 2 Introduction 1 1.1 Automatic Fine-Grained Locking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Transactional Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Composition of Transactional Libraries via Semantic Locking . . . . . . . . . . . . . 5 Fine-Grained Locking using Shape Properties 7 2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 Domination Locking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4 Enforcing DL in Forest-Based Libraries . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.4.1 Eager Forest-Locking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.4.2 Enforcing EFL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.4.3 Example for Dynamically Changing Forest . . . . . . . . . . . . . . . . . . . 33 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.5.1 General Purpose Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.5.2 Specialized Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.5 3 Transactional Libraries with Foresight 41 3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.1.1 Serializable and Serializably-Completable Executions . . . . . . . . . . . . . 43 3.1.2 Serializably-Completable Execution: A Characterization . . . . . . . . . . . . 45 3.1.3 Synchronization Using Foresight . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.1.4 Realizing Foresight Based Synchronization . . . . . . . . . . . . . . . . . . . 48 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.2.1 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.2.2 Clients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Foresight-Based Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.3.1 52 3.2 3.3 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 3.4 3.5 3.6 3.7 4 3.3.2 The Client Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.3.3 Dynamic Right Movers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.3.4 Serializability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.3.5 B-Serializable-Completability . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.3.6 E-Completability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.3.7 Special Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Automatic Foresight for Clients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.4.1 Annotation Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.4.2 Inferring Calls to mayUse Procedures . . . . . . . . . . . . . . . . . . . . . . 62 3.4.3 Implementation for Java Programs . . . . . . . . . . . . . . . . . . . . . . . . 64 Implementing Libraries with Foresight . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.5.1 The Basic Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.5.2 Using Dynamic Information . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.5.3 Optimistic Locking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.5.4 Further Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.5.5 Java Threads and Transactions . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.6.1 Applicability and Precision Of The Static Analysis . . . . . . . . . . . . . . . 72 3.6.2 Comparison To Hand-Crafted Implementations . . . . . . . . . . . . . . . . . 73 3.6.3 Evaluating The Approach On Realistic Software . . . . . . . . . . . . . . . . 75 Java Implementation of the Transactional Maps Library . . . . . . . . . . . . . . . . . 77 3.7.1 Base Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.7.2 Extended Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.7.3 Utilizing Dynamic Information by Handcrafted Optimization . . . . . . . . . . 79 3.7.4 API Adapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Composition of Transactional Libraries via Semantic Locking 81 4.1 Semantic Locking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.1.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.1.2 ADTs With Semantic Locking . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.1.3 Automatic Atomicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Automatic Atomicity Enforcement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.2.1 Enforcing S2PL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.2.2 Lock Ordering Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.2.3 Enforcing OS2PL on Acyclic Graphs . . . . . . . . . . . . . . . . . . . . . . 92 4.2.4 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.2 x 4.2.5 5 6 Handling Cycles via Coarse-Grained Locking . . . . . . . . . . . . . . . . . . 98 4.3 Using Specialized Locking Operations . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.4 Implementing ADTs with Semantic Locking . . . . . . . . . . . . . . . . . . . . . . . 101 4.5 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.5.1 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.5.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Related Work 105 5.1 Synchronization Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.2 Concurrent Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.3 Automatic Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Conclusions and Future Work 109 Bibliography 111 xi xii Chapter 1 Introduction Concurrency is widely used in software systems because it helps reduce latency, increase throughput, and provide better utilization of multi-core machines [35, 69]. However, writing concurrent programs is considered a difficult and error-prone process, due to the need to consider all possible interleavings of code fragments which are executed in parallel. Atomicity Atomicity is a fundamental correctness property of code sections in concurrent programs. Intuitively, a code section is said to be atomic if for every (arbitrarily interleaved) program execution, there is an equivalent execution with the same overall behavior where the atomic code section is not interleaved with other parts of the program. In other words, an atomic code section can be seen as a code section which is always executed in isolation. Atomic code sections help reasoning about concurrent programs, since one can assume that each atomic code section is never interleaved with other parts of the program. Several different variants of the atomicity property have been defined and used in the literature of databases and shared memory systems, where each variant has its own semantic properties and is aimed at a specific set of scenarios and considerations (e.g., [45, 55, 75, 86]). For example, linearizability [55] is a variant of atomicity that is commonly used to describe implementations of shared libraries (and shared objects): an operation of a linearizable library can be seen as if it always takes effect instantaneously at some point between its invocation and its response. The linearizable property ignores the actual implementation details of the shared library; instead, it only considers the library behavior from the point of view of its client. The Problem In this thesis we address the problem of automatically ensuring atomicity of code sec- tions by realizing an efficient and scalable synchronization. One of the main challenges in this problem is to guarantee atomicity in a scalable way, restricting parallelism only where necessary. The synchronization should not have a high run-time overhead which makes it worthless — i.e., it should have better 1 2 C HAPTER 1. I NTRODUCTION performance than simple alternatives (like a single global lock [89]). Solutions for enforcing atomicity, which are implemented in practice, are predominantly handcrafted and tend to lead to concurrency bugs (e.g., see [33, 58, 79]). Automatic approaches (see Section 5.3) allow a programmer to declaratively specify the atomic code sections, leaving it to a compiler and run-time to implement the necessary synchronization. However, existing automatic approaches have not been widely adopted due to various concerns [27, 34, 36, 69, 84, 89], including high run-time overhead, poor performance and limited ability to handle irreversible operations (such as I/O operations). Specialized Synchronization In this thesis we present several approaches for automatic atomicity en- forcement, where each approach is designed to handle a restricted class of programs (and scenarios). The idea is to produce synchronization that enforces atomicity by exploiting the restricted properties of the programs. For each approach we describe a specialized synchronization protocol, and realize the protocol by using a combination of compile-time and run-time techniques. The synchronization protocols are designed to ensure efficient and scalable atomicity, without leading to deadlocks and without using any rollback mechanism. The presented approaches deal with two different aspects of the synchronization problem for concurrent programs. In the first approach, we deal with the code inside the libraries (i.e., the library implementation); whereas in the other two approaches, we deal with code which utilizes the libraries API. 1.1 Automatic Fine-Grained Locking In Chapter 2, we present an approach that leverages the shape of shared memory1 to transform a sequential library into a linearizable library [55]. This approach is based on the paper “Automatic Fine-Grain Locking using Shape Properties” that was presented at OOPSLA’2011 [40]. A library encapsulates shared data with a set of procedures, which may be invoked by concurrently executing threads. Given the code of a library, our goal is to add correct fine-grained locking that ensures linearizability and permits a high degree of parallelism. Specifically, we are interested in locking in which each shared object has its own lock, and locks may be released before the end of the computation. The main insight of this approach is to use the shape of the pointer data structures to simplify reasoning about fine-grained locking and automatically infer efficient and correct fine-grained locking. Domination Locking We define a new fine-grained locking protocol called Domination Locking. Domination Locking is a set of conditions that guarantee atomicity and deadlock-freedom. Domination Locking is designed to handle dynamically-manipulated recursive data structures by leveraging natural 1 The shape of the heap’s objects graph. 1.2. T RANSACTIONAL L IBRARIES 3 domination properties of paths in dynamically-changing data structures. This protocol is a strict generalization of several related fine-grained locking protocols such as dynamic tree locking and dynamic DAG locking [17, 19, 28]. Automatic Fine-Grained Locking We then present an automatic technique to enforce the conditions of Domination Locking. The technique is applicable to libraries where the shape of the shared heap is a forest. The technique allows the shape of the heap to change dynamically as long as the shape is a forest between invocations of library operations. We show that our technique adds efficient and scalable fine-grained locking in several practical data structures where it is hard to produce similar locking manually. We demonstrate the applicability of the method on balanced search-trees [16, 46], a self-adjusting heap [81] and specialized data structure implementations [18, 72]. 1.2 Transactional Libraries Linearizable libraries provide operations that appear to execute atomically. However, clients often need to perform a sequence of library operations that appears to execute atomically, referred to hereafter as an atomic composite operation. In Chapter 3, we consider the problem of extending a linearizable library to support arbitrary atomic composite operations by clients. We introduce a novel approach in which the library ensures atomicity of composite operations by exploiting information provided by its clients. We refer to such libraries as transactional libraries. Our basic methodology requires the client code to demarcate the sequence of operations for which atomicity is desired and provide declarative information to the library (foresight) about the library operations that the composite operation may invoke. It is the library’s responsibility to ensure the desired atomicity, exploiting the foresight information for effective synchronization. Example 1.2.1 The idea is demonstrated in code fragment shown in Figure 1.1. This code uses a shared Counter (a shared library) by invoking its Get and Inc operations. The code provides information (foresight) about the future possible Counter operations: at line 2 it indicates that any operation may be invoked (after line 2), at line 4 it indicates that only Inc may be invoked (after line 4), finally at line 9 it indicates that no more operations will be used (after line 9). This information is utilized by the Counter implementation in order to efficiently ensure that this code fragment will always be executed atomically. A detailed version of this example is described in Chapter 3. Our approach is based on the paper “Concurrent Libraries with Foresight” that was presented at PLDI’2013 [41]. 4 C HAPTER 1. I NTRODUCTION 1 /* @atomic */ { 2 @mayUseAll() 3 c = Get(); 4 @mayUseInc() 5 while (c > 0) { 6 c = c-1; 7 Inc(); 8 } 9 @mayUseNone() 10 } Figure 1.1: Code that provides information (foresight) about future possible operations. Foresight based Synchronization We first present a formalization of this approach. We formalize the desired goals and present a sufficient correctness condition. As long as the clients and the library extension satisfy the correctness condition, all composite operations are guaranteed atomicity without deadlocks. Our sufficiency condition is broad and permits a range of implementation options and finegrained synchronization. It is based on a notion of dynamic right-movers (Section 3.3.3), which generalizes traditional notions of static right-movers and commutativity [61, 66]. Our approach decouples the implementation of the library from the client. Thus, the correctness of the client does not depend on the way the foresight information is used by library implementation. The client only needs to ensure the correctness of the foresight information. Automatic Foresight for Clients We then present a static analysis to infer the foresight information required by our approach, allowing a compiler to automatically insert the foresight information into the client code. This relieves the client programmer of this burden and simplifies writing atomic composite operations. Library Extension Realization Our approach permits the use of customized, hand-crafted, implementations of the library extension. However, we also present a generic technique for extending a linearizable library with foresight. The technique is based on a novel variant of the tree locking protocol in which the tree is designed according to semantic properties of the library’s operations. We used our generic technique to implement a single general-purpose Java library for Map data structures. Our library permits composite operations to simultaneously work with multiple instances of Map data structures. (We focus on Maps, because Shacham [78] observed that Maps are heavily used for implementing composite operations in real-life concurrent programs). We use our library and the static analysis to enforce atomicity of a selection of real-life Java composite operations, including composite operations that manipulate multiple instances of Map data structures. Our experiments indicate that our approach enables realizing efficient and scalable synchronization for real-life composite operations. 1.3. C OMPOSITION OF T RANSACTIONAL L IBRARIES VIA S EMANTIC L OCKING 1.3 5 Composition of Transactional Libraries via Semantic Locking In Chapter 4, we present an approach for handling composite operations that use multiple transactional libraries. Our approach is described in the short paper “Automatic Semantic Locking” that was presented at PPOPP’2014 [42]. This approach is also used in the paper “Automatic scalable atomicity via semantic locking” that was presented at PPOPP’2015 [43]. Our approach can be seen as a combination of Chapter 3 and approaches for automatic lock inference (e.g., [68]). In this approach, we restrict the synchronization that can be implemented in the libraries to synchronization that resembles locking — this synchronization is similar to the semantic-aware locking from the database literature. We refer to such libraries as libraries with semantic-locking. We describe a static algorithm that enforces atomicity of code sections that use multiple libraries with semantic-locking. We implement this static algorithm and show it produces efficient and scalable synchronization. 6 C HAPTER 1. I NTRODUCTION Chapter 2 Fine-Grained Locking using Shape Properties In this chapter, we consider the problem of turning a sequential library into a linearizable library [55]. Our goal is to provide a synchronization method which guarantees atomicity of the library operations in a scalable way, restricting parallelism only where necessary. We are interested in a systematic method that is applicable to a large family of libraries, rather than a method specific to a single library. Fine-Grained Locking One way to achieve scalable multi-threading is to use fine-grained locking (e.g., [19]). In fine-grained locking, one associates, e.g., each shared object with its own lock, permitting multiple operations to simultaneously operate on different parts of the shared state. Reasoning about fine-grained locking is challenging and error-prone. As a result, programmers often resort to coarsegrained locking, leading to limited scalability. The Problem We would like to automatically add fine-grained locking to a library. A library encapsulates shared data with a set of procedures, which may be invoked by concurrently executing threads. Given the code of a library, our goal is to add correct locking that ensures atomicity and permits a high degree of parallelism. Specifically, we are interested in locking in which each shared object has its own lock, and locks may be released before the end of the computation. Our main insight is that we can use the restricted shape of pointer data structures to simplify reasoning about fine-grained locking and automatically infer efficient and correct fine-grained locking. Domination Locking We define a new fine-grained locking protocol called Domination Locking. Domination Locking is a set of conditions that guarantees atomicity and deadlock-freedom. Domination Locking is designed to handle dynamically-manipulated recursive data structures by leveraging natural domination properties of dynamic data structures. 7 8 C HAPTER 2. F INE -G RAINED L OCKING USING S HAPE P ROPERTIES root Key =10 Priority=99 Key =5 Priority=20 Key =15 Priority=72 Key =12 Priority=30 Key =18 Priority=50 Figure 2.1: An example for a Treap data structure. Automatic Fine-Grained Locking We present an automatic technique to enforce the conditions of Domination Locking. The technique is applicable to libraries where the shape of the shared memory is a forest. The technique allows the shape of the heap to change dynamically as long as the shape is a forest between invocations of library operations. In contrast to existing lock inference techniques, which are based on two-phased locking (see Section 5.3), our technique is able to release locks at early points of the computation. Finally, as we demonstrate in Section 2.4 and Section 2.5, our technique adds effective and scalable fine-grained locking in several practical data structures where it is extremely hard to manually produce similar locking. Our examples include balanced search-trees [16, 46], a self-adjusting heap [81] and specialized data structure implementations [18, 72]. Motivating Example Consider a library that implements the Treap data structure [16]. A Treap is a search tree that is simultaneously a binary search tree (on the key field) and a heap (on the priority field). An example is shown in Figure 2.1. If priorities are assigned randomly the resulting structure is equivalent to a random binary search tree, providing good asymptotic bounds for all operations. The Treap implementation consists of three procedures: insert, remove and lookup. Manually adding fine-grained locking to the Treap’s code, is challenging since it requires considering many subtle details of the Treap’s code. In contrast, our technique can add fine-grained locking to the Treap’s code without considering its exact implementation details. (In other words, our technique does not need to understand the actual code of the Treap). For example, consider the Treap’s remove operation shown in Figure 2.2. To achieve concurrent execution of its operations, we must release the lock on the root, while an operation is still in progress, once it is safe to do so. Either of the loops (starting at Lines 4 or 12) can move the current context to a subtree, after which the root (and, similarly, other nodes) should be unlocked. Several parts of this procedure implement tree rotations that change the order among the Treap’s nodes, complicating 2.1. OVERVIEW 9 any correctness reasoning that depends on the order among nodes. Figure 2.3 shows an example of manual fine-grained locking of the Treap remove operation. Manually adding fine-grained locking to the code took an expert several hours and was an extremely error-prone process. In several cases, the expert locking released a lock too early, resulting in an incorrect concurrent algorithm (e.g., the release operation in Line 28). Our technique is able to automatically produce fine-grained concurrency in the Treap’s code, by relying on its tree shape. This is in contrast to existing alternatives, such as manually enforcing handover-hand locking, that require a deep understanding of code details. Note that the dynamic tree locking protocol [17] is sufficient to ensure atomicity and deadlockfreedom of the Treap’s example. In fact, the locking shown in Figure 2.3 satisfies the conditions of the dynamic tree locking protocol. But in contrast to the domination locking protocol which can be automatically enforced in the Treap’s code, none of the existing synchronization techniques (see Section 5.3) can automatically enforce dynamic tree locking protocol for the Treap (even though the Treap is a single tree). 2.1 Overview In this section, we present an informal brief description of our approach. Domination Locking We define a new locking protocol, called Domination Locking (abbreviated DL). DL is a set of conditions that are designed to guarantee atomicity and deadlock freedom for operations of a well-encapsulated library. DL differentiates between a library’s exposed and hidden objects: exposed objects (e.g., the Treap’s root) act as the intermediary between the library and its clients, with pointers to such objects being passed back and forth between the library and its clients, while the clients are completely unaware of hidden objects (e.g., the Treap’s intermediate nodes). The protocol exploits the fact that all operations must begin with one or more exposed objects and traverse the heap-graph to reach hidden objects. The protocol requires the exposed objects passed as parameters to an operation to be locked in a fashion similar to two-phase-locking. However, hidden objects are handled differently. A thread is allowed to acquire a lock on a hidden object if the locks it holds dominate the hidden object. (A set S of objects is said to dominate an object u if all paths (in the heap-graph) from an exposed object to u contains some object in S.) In particular, hidden objects can be locked even after other locks have been released, thus enabling early release of other locked objects (hidden as well as exposed). This simple protocol generalizes several fine-grained locking protocols defined for dynamically changing graphs [17, 19, 28] and is applicable in more cases (i.e., the conditions of DL are weaker). We use the DL’s conditions as the basis for our automatic technique. 10 C HAPTER 2. F INE -G RAINED L OCKING USING S HAPE P ROPERTIES 1 boolean remove(Node par, int key) { 2 Node n = null; 3 n = par.right; // right child has root 4 while (n != null && key != n.key) { 5 par = n; 6 n = (key < n.key) ? n.left : n.right; 7 } 8 if (n == null) 9 return false; // search failed, no change 10 Node nL = n.left; 11 Node nR = n.right; 12 while (true) { // n is the node to be removed 13 Node bestChild = (nL == null || 14 (nR != null && nR.prio > nL.prio)) ? nR : nL; 15 if (n == par.left) 16 par.left = bestChild; 17 else 18 par.right = bestChild; 19 if (bestChild == null) 20 break; // n was a leaf 21 if (bestChild == nL) { 22 n.left = nL.right; // rotate nL to n’s spot 23 nL.right = n; 24 nL = n.left; 25 } else { 26 n.right = nR.left; // rotate nR to n’s spot 27 nR.left = n; 28 nR = n.right; 29 } 30 par = bestChild; 31 } 32 return true; 33 } Figure 2.2: Removing an element from a treap by locating it and then rotating it into a leaf position. (Our technique can add fine-grained locking to this code without understanding its details.) 2.1. OVERVIEW 11 1 boolean remove(Node par, int key) { 2 Node n = null; 3 acquire(par); 4 n = par.right; 5 if(n != null) acquire(n); 6 while (n != null && key != n.key) { 7 release(par); 8 par = n; 9 n = (key < n.key) ? n.left : n.right; 10 if(n != null) acquire(n); 11 } 12 if (n == null){ release(par); return false; } 13 Node nL = n.left; if(nL != null) acquire(nL); 14 Node nR = n.right; if(nR != null) acquire(nR); 15 while (true) { 16 Node bestChild = (nL == null || 17 (nR != null && nR.prio > nL.prio)) ? nR : nL; 18 if (n == par.left) 19 par.left = bestChild; 20 else 21 par.right = bestChild; 22 release(par); 23 if (bestChild == null) 24 break; 25 if (bestChild == nL) { 26 n.left = nL.right; 27 nL.right = n; 28 // release(nL); // an erroneous release statment 29 nL = n.left; 30 if(nL != null) acquire(nL); 31 } else { 32 n.right = nR.left; 33 nR.left = n; 34 nR = n.right; 35 if(nR != null) acquire(nR); 36 } 37 par = bestChild; 38 } 39 return true; 40 } Figure 2.3: Treap’s remove code with manual fine-grained locking. 12 C HAPTER 2. F INE -G RAINED L OCKING USING S HAPE P ROPERTIES Key =10 Priority=99 par n Key =5 Priority=20 Key =15 Priority=72 nL nR Key =12 Priority=30 bestChild Key =18 Priority=50 Figure 2.4: Execution of the Treap’s remove (Figure 2.2) in which the tree shape is violated. The node pointed by nR and bestChild has two predecessors. Automatic Locking of Forest-Based Libraries Our technique is able to automatically enforce DL, in a way that releases locks at early points of the computation. Specifically, the technique is applicable to libraries whose heap-graphs form a forest at the end of any complete sequential execution (of any sequence of operations). Note that existing shape analyses, for sequential programs, can be used to automatically verify if a library satisfies this precondition (e.g., [76, 88] ). In particular, we avoid the need for explicitly reasoning on concurrent executions. For example, the Treap is a tree at the end of any of its operations, when executed sequentially. Note that, during some of its operation (insert and remove) its tree shape is violated by a node with multiple predecessors (caused by the tree rotations). An example for tree violation (caused by the rotations in remove) is shown in Figure 2.4. Our technique uses the following locking scheme: a procedure invocation maintains a lock on the set of objects directly pointed to by its local variables (called the immediate scope). When an object goes out of the immediate scope of the invocation (i.e., when the last variable pointing to that object is assigned some other value), the object is unlocked if it has (at most) one predecessor in the heap graph (i.e., if it does not violate the forest shape). If a locked object has multiple predecessors when it goes out of the immediate scope of the invocation, then it is unlocked eventually when the object has at most one predecessor. The forest-condition guarantees that every lock is eventually released. To realize this scheme, we use a pair of reference counts to track incoming references from the heap and local variables of the current procedure. All the updates to the reference count can be done easily by instrumenting every assignment statement, allowing a relatively simple compile-time transformation. While we defer the details of the transformation to Section 2.4, Figure 2.5 shows the transformed implementation of remove (from Figure 2.2). ASNL and ASNF are macros that perform assignment to a local variable and a field, respectively, update reference counts, and conditionally acquire or release locks according to the above locking scheme. 2.1. OVERVIEW 13 1 boolean remove(Node par, int key) { 2 Node n = null; 3 Take(par); 4 ASNL(n, par.right); 5 while (n != null && key != n.key) { 6 ASNL(par, n); 7 ASNL(n, (key < n.key) ? n.left : n.right); 8 } 9 if (n == null) { 10 ASNL(par, null); 11 ASNL(n, null); 12 return false; 13 } 14 Node nL = null; ASNL(nL, n.left); 15 Node nR = null; ASNL(nR, n.right); 16 while (true) { 17 Node bestCh = null; ASNL(bestCh, (nL == null || 18 (nR != null && nR.prio > nL.prio)) ? nR : nL); 19 if (n == par.left) 20 ASNF(par.left, bestCh); 21 else 22 ASNF(par.right, bestCh); 23 if (bestCh == null) { 24 ASNL(bestCh, null); 25 break; 26 } 27 if (bestCh == nL) { 28 ASNF(n.left, nL.right); 29 ASNF(nL.right, n); 30 ASNL(nL, n.left); 31 } else { 32 ASNF(n.right, nR.left); 33 ASNF(nR.left, n); 34 ASNL(nR, n.right); 35 } 36 ASNL(par, bestCh); 37 ASNL(bestCh, null); 38 } 39 ASNL(par, null); ASNL(n, null); ASNL(nL, null); 40 ASNL(nR, null); 41 return true; 42 } Figure 2.5: Augmenting remove with macros to dynamically enforce domination locking. 14 C HAPTER 2. F INE -G RAINED L OCKING USING S HAPE P ROPERTIES Main Contributions The main contributions of this chapter can be summarized as follows: • We introduce a new locking protocol entitled Domination Locking. We show that domination locking can be enforced and verified by considering only sequential executions [17]: if domination locking is satisfied by all sequential executions, then atomicity and deadlock freedom are guaranteed in all executions, including non-sequential ones. • We present an automatic technique to generate fine-grained locking by enforcing the domination locking protocol for libraries where the heap graph is guaranteed to be a forest in between operations. Our technique can handle any temporary violation of the forest shape constraint, including temporary cycles. • We present a performance evaluation of our technique on several examples, including balanced search-trees [16, 46], a self-adjusting heap [81] and specialized data structure implementations [18, 72]. The evaluation shows that our automatic locking provides good scalability and performance comparable to hand crafted locking (for the examples where hand crafted locking solutions were available). • We discuss extensions and additional applications of our suggestions. 2.2 Preliminaries Our goal is to augment a library with concurrency control that guarantees strict conflict-serializability [75] and linearizability [55]. In this section we formally define what a library is and the notion of strict conflict-serializability for libraries. Syntax and Informal Semantics A library defines a set of types and a set of procedures that may be invoked by clients of the library, potentially concurrently. A type consists of a set of fields of type boolean, integer, or pointer to a user-defined type. The types are private to the library: an object of a type T defined by a library M can be allocated or dereferenced only by procedures of library M . However, pointers to objects of type T can be passed back and forth between the clients of library M and the procedures of library M . Dually, types defined by clients are private to the client. Pointers to clientdefined types may be passed back and forth between the clients and the library, but the library cannot dereference such pointers (or allocate objects of such type). Procedures have parameters and local variables, which are private to the invocation of the procedures. (Thus, these are thread-local variables.) There are no static or global variables shared by different invocations of procedures. (However, our results can be generalized to support them.) 2.2. P RELIMINARIES 15 stms = skip | x = e(y1,...,yk) | assume(b) | x = new R() | x = y.f | x.f = y | acquire(x) | release(x) | return(x) Figure 2.6: Primitive instructions, b stands for a local boolean variable, e(y1,...,yk) stands for an expression over local variables. We assume that body of a procedure is represented by a control-flow graph. We refer to the vertices of a control-flow graph as program points. The edges of a control-flow graph are annotated with primitive instructions, shown in Figure 2.6. Conditionals are encoded by annotating control-flow edges with assume statements. Without loss of generality, we assume that a heap object can be dereferenced only in a load (“x = y.f”) or store (“x.f = y”) instruction. Operations to acquire or release a lock refer to a thread-local variable (that points to the heap object to be locked or unlocked). The other primitive instructions reference only thread-local variables. We present a semantics for a library independent of any specific client. We define a notion of execution that covers all possible executions of the library that can arise with any possible client, but restricting attention to the part of the program state “owned” by the library. (In effect, our semantics models what is usually referred to as a “most-general-client” of the library.) For simplicity, we assume that each procedure invocation is executed by a different thread, which allows us to identify procedure invocations using a thread-id. We refer to each invocation of a procedure as a transaction. We model a procedure invocation as a creation of a new thread with an appropriate thread-local state. We describe the behavior of a library by the relation −→. A transition σ −→ σ 0 represents the fact that a state σ can be transformed into a state σ 0 by executing a single instruction. Transactions share a heap consisting of an (unbounded) set of heap objects. Any object allocated during the execution of a library procedure is said to be a library (owned) object. In fact, our semantics models only library owned objects. Any library object that is returned by a library procedure is said to be an exposed object. Other library objects are hidden objects. Note that an exposed object remains exposed forever. A key idea encoded in the semantics is that at any point during execution a new procedure invocation may occur. The only assumption made is that any library object passed as a procedure argument is exposed; i.e., the object was returned by some earlier procedure invocation. Each heap allocated object serves as a lock for itself. Locks are exclusive (i.e., a lock can be held by at most one transaction at a time). The execution of a transaction trying to acquire a lock (by an acquire 16 C HAPTER 2. F INE -G RAINED L OCKING USING S HAPE P ROPERTIES v ∈ V al = Loc ] Z ] {true, false, null} ρ ∈ E h ∈ H = V ,→ V al = Loc ,→ F ,→ V al l ∈ L = Loc s ∈ S = K × E × 2L σ ∈ Σ = H × (Loc ,→ {true, false}) × (T ,→ S) Figure 2.7: Semantic domains Instruction Transition ht,ei Side Condition hk0 , ρ, Li]i skip σ −→ hh, r, %[t 7→ x = e(y1 , . . . , yk ) σ −→ hh, r, %[t 7→ hk0 , ρ[x 7→ [[e]](ρ(y1 ), . . . , ρ(yk ))], Li]i assume(b) σ −→ hh, r, %[t 7→ hk0 , ρ, Li]i x = newR() σ −→ hh[a 7→ o], r, %[t 7→ hk0 , ρ[x 7→ a], Li]i x = y.f σ −→ hh, r, %[t 7→ hk0 , ρ[x 7→ h(ρ(y))(f )], Li]i ht,ei ht,ei ρ(b) = true ht,ei a 6∈ dom(h) ∧ ι(R)() = o ht,ei ht,ei x.f = y σ −→ hh[ρ(x) 7→ (h(ρ(x))[f 7→ ρ(y)])], r, %[t 7→ acquire(x) σ −→ hh, r, %[t 7→ hk0 , ρ, L ∪ {ρ(x)}i]i release(x) σ −→ hh, r, %[t 7→ hk0 , ρ, L \ {ρ(x)}i]i ht,ei ρ(y) ∈ dom(h) hk0 , ρ, Li]i ρ(x) ∈ dom(h) ρ(x) ∈ L ∨ ∀hk00 , ρ0 , L0 i ∈ range(%) : ρ(x) 6∈ L0 ht,ei ht,ei return(x) σ −→ hh, r[ρ(x) 7→ true], %[t 7→ return(x) σ −→ hh, r, %[t 7→ hk0 , ρ, Li]i hk0 , ρ, Li]i ht,ei ρ(x) ∈ L ρ(x) ∈ dom(h) ρ(x) 6∈ dom(h) Table 2.1: The semantics of primitive instructions. For brevity, we use the shorthands: σ = hh, r, %i and %(t) = hk, ρ, Li, and omit (k, k 0 ) = e ∈ CFGt from all side conditions. statement) which is held by another transaction is blocked until a time when the lock is available (i.e., is not held by any transaction). Locks are reentrant; an acquire statement has no impact when it refers to a lock that is already held by the current transaction. A transaction cannot release a lock that it does not hold. Whenever a new object is allocated, its boolean fields are initialized to false, its integer fields are initialized to 0, and pointer fields are initialized to null. Local variables are initialized in the same manner. Semantics Figure 2.7 defines the semantic domains of a state of a library, and meta-variables ranging over them. Let t ∈ T be the domain of transaction identifiers. A state σ = hh, r, %i ∈ Σ of a library is a triple: h assigns values to fields of dynamically allocated objects. A value v ∈ Val can be either a location, an integer, a boolean value,or null. r maps exposed objects to true, and hidden objects to false. Finally, % associates a transaction t with its transaction local state %(t). A transaction-local state s = hk, ρ, Li ∈ S is: k is the value of the transaction’s program counter, ρ records the values of its local 2.2. P RELIMINARIES 17 variables, and L is the transaction’s lock set which records the locks that the transaction holds. The behavior of a library is described by the relations −→ and ⇒. The relation −→ is a subset of Σ × (T × (K × K)) × Σ, and is defined in Table 2.1.1 ht,ei A transition σ −→ σ 0 represents the fact that σ can be transformed into σ 0 via transaction t executing the instruction annotating control-flow edge e. Invocation of a new transaction is modeled by the relation t ⇒⊆ Σ × T × Σ; we say that hh, r, %i ⇒ σ 0 if σ 0 = hh, r, %[t 7→ s]i where t 6∈ dom(%) and s is any valid initial local state: i.e., s = hentry, ρ, {}i, where entry is the entry vertex, and ρ maps local variables and parameters to appropriate initial values (based on their type). In particular, ρ must map any pointer parameter of a type defined by the library to an exposed object (i.e., an object u in h such t that r(u) = true). We write σ −→ σ 0 , if there exists t such that σ ⇒ σ 0 or there exists ht, ei such that ht,ei σ −→ σ 0 . Running Transactions Each control-flow graph of a procedure has two distinguished control points: an entry site from which the transaction starts, and an exit site in which the transaction ends (if a CFG edge is annotated with a return statement, then this edge points to the exit site of the procedure). We say that a transaction t is running in a state σ, if t is not in its entry site or exit site. An idle state, is a state in which no transaction is running. Executions The initial state σI has an empty heap and no transactions. A sequence of states π = σ0 , . . . , σk is an execution if the following hold: (i) σ0 is the initial state, (ii) for 0 ≤ i < k, σi −→ σi+1 . An execution π = σ0 , . . . , σk is a complete execution, if σk is idle. An execution π = σ0 , . . . , σk is a sequential execution, if for each 0 ≤ i ≤ k at most one transaction in σi is running. An execution is non-interleaved if transitions of different transactions are not interleaved (i.e., for every pair of transactions ti 6= tj either all the transitions executed by ti come before any transition executed by tj , or vice versa). Note that, a sequential execution is a special case of a non-interleaved execution. In a sequential execution a new transaction starts executing only after all previous transactions have completed execution. In a non-interleaved execution, a new transaction can start executing before a previous transaction completes execution, but the execution is not permitted to include transitions by the previous transaction once the new transaction starts executing. We say that a sequential execution is completable if it is a prefix of a complete sequential execution. Schedules The schedule of an execution π = σ0 , . . . , σk is a sequence ht0 , e0 i, . . . , htk−1 , ek−1 i such hti ,ei i t i that for 0 ≤ i < k: σi −→ σi+1 , or σi ⇒ σi+1 and ei = einit (where einit is disjoint with all edges in the CFG). We say that a sequence ξ = ht0 , e0 i, . . . , htk−1 , ek−1 i is a feasible schedule, if ξ 1 For simplicity of presentation, we use an idempotent variant of acquire (i.e., acquire has no impact when the lock has already owned by the current transaction). We note that this variant is permitted by the Lock interface from the java.util.concurrent.locks package, and can easily be implemented in languages such as Java and C++. 18 C HAPTER 2. F INE -G RAINED L OCKING USING S HAPE P ROPERTIES is a schedule of an execution. The schedule of a transaction t in an execution is the (possibly noncontiguous) subsequence of the execution’s schedule consisting only of t’s transitions. Notice that each feasible schedule uniquely defines a single execution because: (i) we assume that there exists a single intial state; and (ii) each instruction defined inTable 2.1 is a partial function (in our semantics nondeterminism is modeled by permitting CFG nodes with several outgoing edges). Graph-Representation The heap (shared memory) of a state identifies an edge-labelled multidigraph (a directed graph in which multiple edges are allowed between the same pair of vertices), which we call the heap graph. Each heap-allocated object is represented by a vertex in the graph. A pointer field f in an object u that points to an object v is represented by an edge (u, v) labelled f . (Note that the heap graph represents only objects owned by the library. Objects owned by the client are not represented in the heap graph.) . We define the allocation id of an object in an execution to be the pair (t, i) if the object was allocated by the i-th transition executed by a transaction t. An object o1 in an execution π1 corresponds to an object o2 in an execution π2 iff their allocation ids are the same. We compare states and objects belonging to different executions modulo this correspondence relation. Strict Conflict-Serializability and Linearizability Given an execution, we say that two transitions conflict if: (i) they are executed by two different transactions, (ii) they access some common object (i.e., read or write fields of the same object). Executions π and π 0 are said to be conflict-equivalent if they consist of the same set of transactions, and the schedule of every transaction t is the same in both executions, and the executions agree on the order between conflicting transitions (i.e., the ith transition of a transaction t precedes and conflicts with the jth transition of a transaction t0 in π, iff the former precedes and conflicts with the latter in π 0 ). Conflict-equivalent executions produce the same state [86]. An execution is conflict-serializable if it is conflict-equivalent with a non-interleaved execution. We say that an execution π is strict conflict-serializable if it is conflict-equivalent to a non-interleaved execution π 0 where a transaction t1 completes execution before a transaction t2 in π 0 if t1 completes execution before a transaction t2 in π. Assume that all sequential executions of a library satisfy a given specification Φ. In this case, a strict conflict-serializable execution is also linearizable [56] with respect to specification Φ.2 Thus, correctness in sequential executions combined with strict conflict-serializability is sufficient to ensure linearizability. 2 Strict conflict-serializability guarantees the atomicity and the run-time order required by the linearizability property. Moreover, note that according to the linearizability property (as defined in [56]) the execution may contain transactions that will never be able to complete. 2.3. D OMINATION L OCKING 19 The above definitions can also be used for feasible schedules because (as explained earlier) a feasible schedule uniquely defines a single execution. 2.3 Domination Locking In this section we present the Domination Locking Protocol (abbreviated DL). We show that if every sequential execution of a library satisfies DL and is completable, then every concurrent execution of the library is strict conflict-serializable and is a prefix of a complete-execution (i.e., atomicity and deadlockfreedom are guaranteed). The locking protocol is parameterized by a total order ≤ on all heap objects, which remains fixed over the whole execution. Definition 2.1 Let ≤ be a total order of heap objects. We say that an execution satisfies the Domination Locking protocol, with respect to ≤, if it satisfies the following conditions: 1. A transaction t can access a field of an object u, only if u is currently locked by t. 2. A transaction t can acquire an exposed object u, only if t has never acquired an exposed object v such that u ≤ v. 3. A transaction t can acquire an exposed object, only if t has never released a lock. 4. A transaction t can acquire a hidden object u, only if every path between an exposed object to u includes an object which is locked by t. Intuitively, the protocol works as follows. Requirement (1) prevents race conditions where two transactions try to update an object neither has locked. Conditions (2) and (3) deal with exposed objects. Very little can be assumed about an object that has been exposed; references to it may reside anywhere and be used at any time by other transactions that know nothing about the invariants t is maintaining. Thus, as is standard, requirements (2) and (3) ensure all transactions acquire locks on exposed objects in a consistent order, preventing deadlocks. The situation with hidden objects is different, and we know more: other threads can only gain access to t’s hidden objects through some chain of references starting at an exposed object, and so it suffices for t to guard each such potential access path with a lock. Another way of understanding the protocol is that previous proposals (e.g., [28, 59, 63, 80]) treat all objects as exposed, whereas domination locking also takes advantage of the information hiding of abstract data types to impose a different, and weaker, requirement on encapsulated data. In particular, no explicit order is imposed on the acquisition or release of locks on hidden objects, provided condition (4) is maintained. 20 C HAPTER 2. F INE -G RAINED L OCKING USING S HAPE P ROPERTIES Theorem 2.2 Let ≤ be a total order of heap objects. If every sequential execution of the library is completable and satisfies Domination Locking with respect to ≤, then every execution of the library is strict conflict-serializable, and is a prefix of a complete-execution. This theorem implies that a concurrent execution cannot deadlock, since it is guaranteed to be the prefix of a complete-execution. Domination Locking generalizes previously proposed protocols such as Dynamic Tree Locking (DTL) protocol and Dynamic Dag Locking (DDL) protocol [17], which themselves subsume idioms such as hand-over-hand locking. The DTL and DDL protocols were inspired by database protocols for trees and DAGs ([28, 59, 63, 80]), but customized for use in programs where shape invariants may be temporarily violated. In particular, any execution that satisfies DTL or DDL can be shown to satisfy DL. In comparing these protocols, it should be noted that DTL and DDL were described in a restricted setting where the exposed objects took the form of a statically fixed set of global variables. DL generalizes this by permitting a dynamic set of exposed objects (which can grow over time). More importantly, DL is a strict generalization of DTL and DDL: executions that satisfy DL might not satisfy either DTL or DDL. Among other things, DL does not require the heap graph to satisfy any shape invariants. Thus, the above theorem generalizes a similar theorem established for DDL and DTL in [17]. The above theorem, like those in [17], is important because it permits the use of sequential reasoning, e.g., to verify if a library guarantees strict conflict-serializability via DL. More interestingly, this reduction theorem also simplifies the job of automatically guaranteeing strict conflict-serializability via DL, as we illustrate in this paper. The requirement for a total order of exposed objects, does not restrict its applicability since in any conventional programming environment such order can be obtained (e.g., by using memory address of objects, or by using a simple mechanism that assigns unique identifiers to objects). Furthermore, no order is needed when each transaction accesses a single exposed object. Proof for Theorem 2.2 We now present the proof for Theorem 2.2. We start by discussing several basic properties of the programming model and domination locking. We then show that domination locking permits ordering the transactions in a way that allows all transaction to complete one after the other (assuming that all sequential executions are completable) such that the resultant complete execution is equivalent to a non-interleaved execution. An execution is said to be well-locked if every transaction in the execution accesses a field of an object, only when it holds a lock on that object. We say that a transaction t is in phase-1 if it is still running and has never released a lock. Otherwise, we say that t is in phase-2 (i.e., t is in phase-2 if it has already completed, or it has released at least one lock). 2.3. D OMINATION L OCKING 21 Lemma 2.3 Let ξ = ξp ξt ξs be any feasible well-locked schedule, where ξt is the schedule of a transaction t. If t is in phase-1 (after ξ), then there is no conflict between ξt and ξs (in ξ) . Proof Transaction t has never released a lock in ξt , hence the transactions in ξs are not able to acquire locks acquired by t in ξt . Since ξp ξt ξs is well-locked, the transactions in ξs do not access objects that are accessed by t in ξt . Hence there is no conflict between ξt and ξs . We say that an ni-execution (non-interleaved execution) is phase-ordered if all phase-2 transactions precede phase-1 transactions. Lemma 2.4 Any feasible well-locked ni-schedule ξ1 ξ2 · · · ξn (where each ξi is executed by ti ) is conflictequivalent to a well-locked phase-ordered ni-schedule ξi1 · · · ξin . Proof For any 1 ≤ i ≤ n such that ti is in phase-1, the ni-schedule ξ1 ξ2 · · · ξn is conflict-equivalent to the feasible ni-schedule ξ1 · · · ξi−1 ξi+1 · · · ξn ξi (because of Lemma 2.3). By repeatedly using this property we can move all phase-1 transactions to the end of the schedule. Hence there exists a feasible well-locked phase-ordered ni-schedule ξi1 · · · ξin that is conflict-equivalent to ξ1 ξ2 · · · ξn . In the following we assume that ≤h is a total order of all heap objects. We assume that ≤h has a minimal value ⊥ (i.e., if u is an object then ⊥ ≤h u). We say that u <h v, if u 6= v and u ≤h v. We say that max(σ, t) = u, if u is the maximal exposed object that is locked by transaction t in state σ (i.e., u is locked by t in σ, and every exposed object v that is locked by t in σ satisfies v ≤h u). If no exposed object is locked by t in σ, then max(σ, t) = ⊥. Let π = α1 · · · αk be a phase-ordered execution3 . Let s be the last state of π. We say that π is fully-ordered, if for every αi and αj that are in phase-1 the following holds: if i < j then max(s, tj ) ≤h max(s, ti ). Lemma 2.5 Any feasible well-locked ni-schedule ξ = ξ1 ξ2 · · · ξn (where each ξi is executed by ti ) is conflict-equivalent to a well-locked fully-ordered ni-schedule ξi1 · · · ξin . Proof Identical to to Lemma 2.4. Here we reorder the phase-1 transactions according to ≤h : let i 6= i0 such that ti and ti0 are in phase 1; if max(s, ti0 ) ≤h max(s, ti ) (where s is the state after ξ) , then we move ξi to the end of the schedule before moving ξi0 . We say that a set S of objects dominates an object u (in a given state) if every path from an exposed object to u contains some objects from S. We say that a transaction t blocks an object u (in a given state) if the set of objects locked by t dominates u. 3 When we write α1 · · · αk or β1 · · · βk or α1 β1 · · · αk βk , we mean that each αi (and βi ) is executed by transaction ti . 22 C HAPTER 2. F INE -G RAINED L OCKING USING S HAPE P ROPERTIES Lemma 2.6 Let π = π1 π2 be a sequential execution that follows domination locking. Let σ be the state at the end of π1 . Let t be a transaction in phase 2 at the end of π1 . For any object o in state σ, t can access o during the (remaining) execution π2 only if t blocks o in σ. Proof Let u be an object in σ which is not blocked by t in σ. Hence σ contains a path P from an exposed object e to u, such that none of the objects in P are locked by t. We inductively show that none of the objects in P are locked, and hence not accessed or modified during π2 . Let v be an object in P . We write L(v) to denote the length of the shortest path (in state σ) from e to v. We prove by induction on L(v) that t does not lock v in π2 If L(v) = 0, then v = e (v is an exposed object). Because of condition 3, transaction t does not lock v in π2 . If L(v) > 0, then from the induction hypothesis v is not blocked by t in π2 . Because of condition 4, transaction t does not lock v in π2 . A conflict-predecessor of a step (t, e) in execution π, is a step that precedes (t, e) in π and uses (i.e., accesses or locks) the same object and is executed by t0 6= t. Let π1 and π2 be two executions such that for every transaction t the schedule of t in π1 is a prefix of the schedule of t in π2 . π2 is said to be a conflict-equivalent extension of π1 if every step (t, e) in π1 has the same conflict-predecessors as the corresponding step in π2 . π2 is said to be an equivalent completion of π1 if it is a complete execution and is a conflict-equivalent extension of π1 . Note that if an execution α1 β1 · · · αn βn is a conflict-equivalent extension of π = α1 · · · αn , then the execution α1 · · · αn β1 · · · βn is also a conflict-equivalent extension of π. Lemma 2.7 Let πni be a well-locked ni-execution with a schedule α1 · · · αk . Let πe be a conflictequivalent extension of πni with a schedule α1 β1 · · · αk βk . Assume that ti blocks an object u at the end of αi in πe . Then, the execution of αi+1 · · · αk in πni does not access u. (Note that in this case ti might not actually block object u at the end of αi in πni ). Proof Let σ denote the state at the end of αi in πni . For any object x in σ accessed by the execution of αi+1 · · · αk in πni we define the path Px inductively as follows. If x is an exposed object in σ, then Px is defined to be the sequence [x]. If x is a hidden object in σ, then the execution of αi+1 · · · αk must have dereferenced some field of some object that pointed to x. Consider the first field y.f dereferenced by αi+1 · · · αk that pointed to x, where y represents an object. We define Px to consist of the sequence Py followed by x. Assume that u is accessed during the execution of αi+1 · · · αk in πni . Hence Pu exists at the end of αi in πni . By the definition of a conflict-equivalent extension, Pu also exists at the end of execution of αi in πe . (in particular, for 1 ≤ j ≤ i the execution of βj in πe does not access any object in Pu ). Hence, 2.3. D OMINATION L OCKING 23 ti must hold a lock on some object y in this path (at the end of αi in both πni as well as πe ). Since πni is well-locked, the execution of αi+1 · · · αk in πni could not have locked y which is a contradiction. Hence u is not accessed during the execution of αi+1 · · · αk in πni Lemma 2.8 Let π = α1 · · · αn be a well-locked fully-ordered execution with at least one incomplete transaction. Let tk be the first incomplete transaction in π (i.e., k is the minimal number such that tk is incomplete). If every sequential execution of a library follows domination locking and is completable, then π has an equivalent extension α1 · · · αk βk αk+1 · · · αn in which transaction tk is completed. Proof Since α1 · · · αk represents a sequential execution, it has a completion α1 · · · αk βk that follows domination locking. We consider the following cases. Case 1: After π transaction tk is in phase-2. Let σ represent the state produced by the execution of α1 · · · αk . From Lemma 2.6, all objects in σ accessed during the execution of βk (in α1 · · · αk βk ) must be blocked by tk in σ. From Lemma 2.7 the execution of αk+1 · · · αn (in α1 · · · αn ) cannot access any object blocked by tk in σ. Hence the schedule α1 · · · αk βk αk+1 · · · αn is feasible and is a conflictequivalent extension of α1 · · · αn . Case 2: k = n. Here α1 · · · αk βk is the equivalent extension. Case 3: k < n, and after π transaction tk is in phase-1. Let σ represent the state produced by the execution of α1 · · · αk−1 . Let k < m ≤ n. Because of Lemma 2.3, no conflict-dependence can exist between the running transactions (because they are all in phase-1), hence α1 · · · αk−1 αm represents a feasible sequential execution that follows domination locking. Let u be an exposed object in σ that is accessed by tk in α1 · · · αk−1 αk βk , we will show that u is not accessed by tm in α1 · · · αk−1 αm . If u is accessed or locked by αk in α1 · · · αk−1 αk βk , then u is not accessed or locked by αm in α1 · · · αk−1 αm (because tk and tm have no conflict in π). Otherwise, u is locked by βk in α1 · · · αk−1 αk βk . Let σ 0 denote the state produced by the execution of π. max(σ 0 , tm ) <h max(σ 0 , tk ) (because after the fully-ordered execution π, tk and tm are in phase-1 and tk precedes tm ). max(σ 0 , tk ) <h u (because of condition 2). Hence, max(σ 0 , tm ) <h u Hence, u is not accessed or locked by tm in α1 · · · αk−1 αm . Let v be a hidden object in σ that is accessed by tk in α1 · · · αk−1 αk βk . we will show that v is not accessed by tm in α1 · · · αk−1 αm . v is necessarily reachable from exposed objects in σ, hence there exists a path P (in σ) from an exposed object w to v, such that w is the only exposed object in P . tk accesses w in α1 · · · αk−1 αk βk (because of conditions 4 and 1). Assume that v is accessed by tm in α1 · · · αk−1 αm , then tm accesses w in α1 · · · αk−1 αm (conditions 4 and 1). But we have showed that this is not possible for exposed objects. Therefore v is not accessed by tm in α1 · · · αk−1 αm . We have showed, for every k < m ≤ n, tk does not access (in α1 · · · αk−1 αk βk ) any object that is accessed by tm (in α1 · · · αk−1 αm ). Hence, α1 · · · αk βk αk+1 · · · αn is an equivalent extension of 24 C HAPTER 2. F INE -G RAINED L OCKING USING S HAPE P ROPERTIES α1 · · · αn . Lemma 2.9 Let π = α1 · · · αn be a well-locked fully-ordered execution. If every sequential execution of a library follows domination locking and is completable, then π has an equivalent completion α1 β1 · · · αn βn . Proof If π is not a complete execution, we construct an equivalent completion α1 β1 · · · αn βn by repeatedly applying Lemma 2.8 (i.e., if π contains k > 0 incomplete transactions, lemma 2.8 is used to produce a equivalent extension for π that contains k − 1 incomplete transactions). Lemma 2.10 Let ξ = ξp ξt ξs be any feasible well-locked ni-schedule, where ξt is the schedule of a transaction t. If ξ · (t, e) is feasible, then ξp ξt · (t, e) is also feasible. Proof Assume that ξ · (t, e) is feasible. We show that ξp ξt · (t, e) is feasible. The only sources of infeasibility are when the step (t, e) involves a conditional branch (i.e., an assume statement) or an attempt to acquire a lock. We make the simplifying assumption that an assume statement refers to only thread-local variables. (Note that there is no loss of generality here since any statement “assume e” can be rewritten as “x = e; assume x” where x is a thread-local variable.) As a result, ξp ξt · (t, e) must be feasible if (t, e) involves a conditional branch. Now, consider the case where (t, e) involves an “acquire x” instruction where x is a thread-local variable. If the object x points to is unlocked at the end of ξp ξt ξs , it must be unlocked at the end of ξp ξt as well. Hence, feasibility follows in this case as well. Lemma 2.11 If every sequential execution of a library follows domination locking and is completable, then every ni-execution is well-locked. Proof We prove by induction on the length of the executions. Let ξ be a schedule of a well-locked ni-execution. We will prove that if ξ · (t, e) is feasible, then it is a schedule of a well-locked execution. Assume that after ξ, the step (t, e) accesses an object u. From Lemma 2.5, ξ is conflict-equivalent to a fully-ordered ni-schedule ξ 0 = α1 · · · αn . We consider the following cases. Case 1: there exists i such that t = ti and 1 ≤ i < n. From Lemma 2.10, α1 · · · αi · (ti , e) is a feasible schedule. From the induction hypothesis, ti holds a lock on u after α1 · · · αi . Hence, ti holds a lock on u after ξ 0 = α1 · · · αn . Hence, ti holds a lock on u after ξ. 2.3. D OMINATION L OCKING 25 Case 2: t = tn . From Lemma 2.9, ξ 0 has an equivalent completion with the schedule α1 β1 · · · αn βn . We define ξ 00 = α1 β1 · · · αn−1 βn−1 αn (this is a prefix of α1 β1 · · · αn βn ). The step (tn , e) accesses u after ξ 00 (because tn has the same local state after ξ 0 and ξ 00 ). Since ξ 00 · (tn , e) represents a sequential execution, u is locked by tn after ξ 00 . Hence, tn holds a lock on u after α1 · · · αn . Hence, tn holds a lock on u after ξ. Case 3: t does not appear in ξ According the definition of a schedule, the first step of a transaction does not access an object. Lemma 2.12 If every sequential execution of a library follows domination locking and is completable, then every execution π is conflict-equivalent to a fully-ordered execution π 0 such that a transaction t completes before a transaction t0 begins in π 0 if t completes before t0 begins in π. Proof We prove this by induction on the length of the execution. Consider any execution with a schedule ξ · (ti , e). By the inductive hypothesis, the execution of ξ is conflict-equivalent to a fully-ordered execution with the schedule ξ 0 = α1 · · · αk 4 such that a transaction t completes before a transaction t0 begins in ξ 0 if t completes before t0 begins in ξ. From Lemma 2.11, ξ 0 is well-locked. We consider the following cases: Case 1: After α1 · · · αi , transaction ti is in phase 2, and (ti , e) does not access an heap object. In this case, α1 · · · αk · (ti , e) is conflict equivalent to α1 · · · αi · (ti , e) · αi+1 · · · αk Case 2: After α1 · · · αi , transaction ti is in phase 2, and (ti , e) accesses an heap object u. According to Lemma 2.9, ξ 0 = α1 · · · αk has an equivalent completion ξ 00 = α1 β1 · · · αk βk . Let ξ 000 = α1 β1 · · · αi−1 βi−1 αi (ξ 000 is a prefix of ξ 00 ). ti has the same local state after α1 · · · αi and ξ 000 (according the definition of conflict-equivalent extension). According to Lemma 2.10, α1 · · · αi · (ti , e) is a feasible schedule, so ξ 000 · (ti , e) is also a feasible schedule. Also ξ 000 · (ti , e) represents a sequential execution (which follows domination locking). Hence according to Lemma 2.6, ti blocks u after ξ 000 . Hence, ti blocks u after αi in ξ 00 . Hence, according to Lemma 2.7, αi+1 · · · αk does not access u in ξ 0 = α1 · · · αk . Therefore, α1 · · · αk · (ti , e) is conflict equivalent to α1 · · · αi · (ti , e) · αi+1 · · · αk Case 3: Transaction ti is in phase 1 after α1 · · · αi . Because of Lemma 2.3, we can reorder all phase-1 transactions and ti (even if ti is in phase-2 after α1 · · · αk · (ti , e)). 4 Note that 1 ≤ i ≤ k and αi may be empty 26 C HAPTER 2. F INE -G RAINED L OCKING USING S HAPE P ROPERTIES If ti is in phase-2 after α1 · · · αk · (ti , e), then we can construct the fully-ordered equivalent execution by moving αi · (ti , e) just before all the phase-1 transactions. Otherwise (ti is still in phase-1 after α1 · · · αk · (ti , e)), we can construct the fully-ordered equivalent execution by moving αi · (ti , e) to the right place according to the max values (between the phase-1 transactions). Proof [for Theorem 2.2] From Lemma 2.12, we know that every execution of the library is strict conflictserializable. We now want to show that every execution is also a prefix of a strict conflict-serializable. Consider any execution π. According to Lemma 2.12, there exists a fully-ordered execution π 0 = α1 · · · αn which is conflict equivalent to π. According to Lemma 2.9, π 0 has an equivalent completion α1 β1 · · · αn βn . According the definition of conflict-equivalent extension, there exists execution α1 · · · αn β1 · · · βn . Hence, π 0 is a prefix of a complete execution. Since π and π 0 end with the same state, π is also a prefix of a complete execution. 2.4 Enforcing DL in Forest-Based Libraries In this section, we describe our technique for automatically adding fine-grained locking to a library when the library operates on heaps of restricted shape. Specifically, the technique is applicable to libraries that manipulate data structures with a forest shape, even with intra-transaction violations of forestness. For example, the Treap data structure (mentioned at the beginning of this chapter) has a tree shape which is temporarily violated by tree-rotations (during tree-rotations a node may have two parents). Our technique has no limit on the number of violations or their effect on the data structures shape, as long as they are eliminated before the end of the transaction. In Section 2.4.1, we describe the shape restrictions required by our technique, and present dynamic conditions that are enforced by our source transformation. We refer to these conditions as the Eager Forest-Locking protocol (EFL). In Section 2.4.2, we show how to automatically enforce EFL by a source-to-source transformation of the original library code. 2.4.1 Eager Forest-Locking When the shape of the heap manipulated by the library is known to be a forest (possibly with temporary violations), we can enforce domination locking by dynamically enforcing the conditions outlined below. First, we define what it means for a library to be forest-based. We say that a hidden object u is 2.4. E NFORCING DL IN F OREST-BASED L IBRARIES 27 consistent in a state σ, if u has at most one incoming edge in σ.5 We say that an exposed object u is consistent in a state σ, if it does not have any incoming edges in σ. Definition 2.13 A library M is a forest-based library, if in every sequential execution, all objects in idle states are consistent. For a forest-based library, we define the following Eager Forest-Locking conditions, and show that they guarantee that the library satisfies the domination locking conditions. Eager Forest-Locking Requirements Given a transaction t, we define t’s immediate scope as the set of objects which are directly pointed to by local variables of t. Intuitively, eager forest-locking is a simple protocol: a transaction t should acquire a lock on an object whenever it enters t’s immediate scope and t should release a lock on an object whenever the object is out of t’s immediate scope and is consistent. The protocol description below is a bit complicated because the abovementioned invariant will be temporarily violated while an object is being locked or unlocked. (In particular, conditions 1, 2, and 4 restrict the extent to which the invariant can be violated.) Definition 2.14 Let ≤ be a total order of heap objects. We say that an execution satisfies the Eager Forest-Locking (EFL) with respect to ≤, if it satisfies the following conditions: 1. A transaction t can access a field of an object, only if all objects in t’s immediate scope are locked by t.6 2. A transaction t can release an object, only if all objects in t’s immediate scope are locked by t. 3. A transaction t can release a lock of an object u, only if u is consistent. 4. Immediately after a transaction t releases a lock of an object u, t removes u from its immediate scope (i.e., the next instruction of t removes u from immediate scope by writing to a local variable that points to u) 5. A transaction t can acquire an exposed object u, only if t has never acquired an exposed object v such that u ≤ v. In contrast to the DL conditions, the EFL conditions can directly be enforced by instrumenting the code of a given library because all its dynamic conditions can be seen as conditions on its immediate scope and local memory. Such code instrumentation is allowed to only consider sequential executions, as stated by the following theorems: 5 In the graph representation of the heap. Recall that the heap-graph contains only library owned objects. In particular, this definition does not consider pointers to exposed objects that may be stored in client objects 6 Notice that t can accesses a field of object o only by executing x.f=y or y=x.f when x points to o. Furthermore, an unlocked object may be inside t’s immediate scope (because the programming model permits pointing to an object before locking this object). 28 C HAPTER 2. F INE -G RAINED L OCKING USING S HAPE P ROPERTIES Theorem 2.15 Let ≤ be a total order of heap objects. Let π be a sequential execution of a forest-based library. If π satisfies EFL with respect to ≤, then π satisfies DL with respect to ≤. From Theorem 2.2 and Theorem 2.15 we conclude the following. Theorem 2.16 Let ≤ be a total order of heap objects. If every sequential execution of a forest-based library is completable and satisfies EFL with respect to ≤, then every execution of this library is strict conflict-serializable, and is a prefix of a complete-execution. Proof for Theorem 2.15 We now present the proof for Theorem 2.15. We write DLi to denote condition (i) of the Domination Locking protocol (DL). We write EFLi to denote condition (i) of the Eager Forest-Locking protocol (EFL). Lemma 2.17 Let M be a forest-based library. Let π = σ0 , . . . , σk be a sequential execution of M that satisfies the EFL protocol. If a hidden object u is not consistent at σk , then u is locked at σk . Proof Let ξ = ht0 , e0 i, . . . , htk−1 , ek−1 i be the schedule of π. Let t be the last transaction in π. At the beginning of t, the object u is consistent (because M is a forest-based library). Let i be the maximal number such that t = ti , ei is annotated with x.f=y and y points to u at σi (such i exists because u is not consistent at σk ). Because of EFL1 , u is locked in σi . Because of EFL3 , u is not released after σi . Hence, u is locked at σk . Lemma 2.18 Let π be a sequential execution of a forest-based library. If π satisfies EFL , then π satisfies DL3 Proof Let t be a transaction in π. Because of EFL1 and EFL2 all exposed objects (which can be reached by t) are locked before t accesses or releases objects. Let s be the first state in which all exposed objects are locked by t. Because of EFL5 no exposed can be locked by t after s. Hence, π satisfies DL3 . Lemma 2.19 Let π = σ0 , . . . , σk be a sequential execution of a forest-based library. If π satisfies EFL , then π satisfies DL4 Proof We assume that π satisfies EFL. Let ξ = ht0 , e0 i, . . . , htk−1 , ek−1 i be the schedule of π. Let i be a number such that, ei is annotated with acquire(x) and x points to a hidden object u at σi , and u is not locked in σi . We consider the following cases. Case 1: At σi , u has no predecessors. In this case, there is no path between an exposed object to u. 2.4. E NFORCING DL IN F OREST-BASED L IBRARIES 29 Case 2: At σi , u has at least two predecessors. Because of Lemma 2.17, u is locked in σi . Hence, this case will never happen. Case 3: At σi , u has one predecessor. Let p be the predecessor of u at σi . Let j be the maximal number such that: j ≤ i, and u is in ti ’s immediate scope in state σj , and u is not in ti ’s immediate scope in state σj−1 . Because of EFL1 , ti cannot use x.f=y between σj and σi . Hence, ej is annotated with x=y.f and y points to p at σj . Became of EFL2 , p is not released by t between σj and σi . Hence, p is locked by t at σi . Proof [for Theorem 2.15] Because of EFL1 , π satisfies DL1 . Because of EFL5 , π satisfies DL2 . Because of Lemma 2.18, π satisfies DL3 . Because of Lemma 2.19, π satisfies DL4 . 2.4.2 Enforcing EFL In this section, we present a source-to-source transformation that enforces EFL in a forest-based library. The idea is to instrument the library such that it counts stack and heap references to objects, and use these reference counts to determine when to acquire and release locks. Since the EFL conditions are defined over sequential executions, reasoning about the required instrumentation is fairly simple. Run-Time Information The instrumented library tracks objects in the immediate scope of the current transaction7 by using stack-reference counters; the stack-reference counter of an object u, tracks the number of references from local variables to u; hence u is in the immediate scope of current transaction whenever its stack-reference counter is greater than 0. To determine consistency of objects, it uses a heap-reference counter; the heap-reference counter of an object u, tracks the number of references in heap objects that point to u; a hidden object is consistent, whenever its heap-counter equals to 0 or 1; and an exposed object is consistent, whenever its heap-counter equals to 0. To determine whether an object has been exposed, it uses a boolean field; whenever an object is exposed (returned) by the library, this field is set to true (in that object). Locking Strategy The instrumented code uses a strategy that follows EFL conditions. At the beginning of the procedure, the instrumented library acquires all objects that are pointed to by parameters (and are thus exposed objects). The order in which these objects are locked is determined by using a special function, unique that returns a unique identifier for each object8 . After locking all exposed objects, 7 Note that we consider sequential executions, so we can assume a single current transaction. Note that only exposed objects are pointed by the procedure parameters. And according to Definition 2.13 these are the only exposed objects the transaction will see. 8 30 C HAPTER 2. F INE -G RAINED L OCKING USING S HAPE P ROPERTIES Operation Code if(x!=null) { acquire(x); Take(x) x.stackRef++; } if(x!=null) { x.stackRef-- ; Drop(x) if(x.stackRef==0 && IsConsistent(x)) release(x); } if(x.isExposed) IsConsistent(x) return (x.heapRef == 0); else return (x.heapRef <= 1); MarkExposed(x) if(x!=null) x.isExposed=true; Table 2.2: Primitive operations used in the EFL transformation. 1 TakeArgs2(x,y) { 2 if(unique(x) < unique(y)) 3 { Take(x); Take(y); } 4 else 5 { Take(y); Take(x); } 6} Figure 2.8: Acquiring two procedure arguments in a unique locking order. the instrumented library acts as follows: (i) it acquires object u whenever its stack-reference-counter becomes 1; (ii) it releases object u whenever u is consistent, and its stack-reference-counter becomes 0. This strategy releases all locks before completion of a transaction (since every object becomes consistent before that point), so it cannot create incompletable sequential executions. Source-to-Source Transformation Our transformation instruments each object with three additional fields: stackRef and heapRef to maintain the stack and heap reference counts (respectively), and isExposed to indicate whether the object has been exposed. The transformation is based on the prim- itive operations of Table 2.2. The procedures Take and Drop maintain stack reference counters and perform the actual locking. Take(x) locks the object referenced by x and increments the value of its stack reference counter. Drop(x) decreases the stack reference count of the object referenced by x, and releases its lock if it is safe to do so according to the EFL protocol, i.e., if the reference from x was the only reference to the object, and the object is consistent. Drop uses the function IsConsistent which indicates whether an object is consistent or not (according to its heap-counter and the isExposed field). 2.4. E NFORCING DL IN F OREST-BASED L IBRARIES 31 ASNL(x,ptrExp) { temp=ptrExp; x = ptrExp Take(temp); Drop(x); x=temp; } ASNF(x.f,ptrExp) { temp=x.f; Take(temp); if(temp!=null) temp.heapRef--; Drop(temp); x.f = ptrExp temp=ptrExp; Take(temp); if(temp!=null) temp.heapRef++; x.f = temp; Drop(temp); } Table 2.3: The macros ASNL and ASNF for pointer assignments enforcing EFL. For each procedure, of the library, our transformation is performed as follows: 1. At the beginning of the procedure, add code that acquires all objects pointed to by arguments according to a fixed order; in a case of a single pointer argument l, this can be done by adding Take(l) (as in line 3 of Figure 2.5); the code of Figure 2.8 demonstrates the case of 2 pointer arguments; in the general case objects are sorted to obtain the proper order. 2. Replace every assignment of a pointer expression with the corresponding code macros in Table 2.3. The macro ASNL(x,ptrExp) replaces an assignment of a pointer expression ptrExp to a local pointer x, this macro performs this assignment, while maintaining stack-counters and following the required locking strategy. The macro ASNF(x.f,ptrExp) replaces an assignment of a pointer expression to a field of an object, this macro maintains the heap-counters in objects (its implementation follows the required locking strategy). 3. Whenever a local variable l reaches the end of its scope, add ASNL(l,null); this releases the object pointed by l. If this is the end of the procedure, and l is about to be returned (i.e., by the statement return(l)), then instead of adding ASNL(l,null) add the block {MarkExposed(l);Drop(l);}. 32 C HAPTER 2. F INE -G RAINED L OCKING USING S HAPE P ROPERTIES 1 void AddValues(Node x, Node y) { 2 while(x!=null && y!=null) { 3 x.value+=y.value; 4 x=x.next; 5 y=y.next; 6 }} Figure 2.9: Example procedure adding the values from one linked-list into another. 1 void AddValues(Node x, Node y) { 2 TakeArgs2(x,y); 3 while(x!=null && y!=null) { 4 x.value+=y.value; 5 ASNL(x,x.next); 6 ASNL(y,y.next); 7 } 8 ASNL(x,null); ASNL(y,null); 9} Figure 2.10: Transformed code enforcing EFL for the procedure AddValues of Figure 2.9. Example The procedure of Figure 2.9 takes a pair of pointers to singly-linked lists, and adds values of one list to the values of the other. Figure 2.10 shows the code transformed to enforce EFL. The transformed procedure starts with an invocation of TakeArgs2 (shown in Figure 2.8) to lock exposed objects in a fixed order. In the body of AddValues, the assignment x=x.next is replaced by the macro ASNL(x,x.next), which assigns x.next to x while maintaining EFL requirements. The assignment y=y.next is handled in a similar way. At the end of AddValues, local variables go out of scope and locks are released by adding ASNL(x,null) and ASNL(y,null). Practical Consideration In some cases, some of our instrumentation code can be avoided. For example, instead of replacing x=null with ASNL(x,null), we could just add Drop(x) before the assignment. Or whenever it is known that a variable will not have a null value, we could avoid the if statements in Take and Drop. In libraries where the forestness condition is not violated even temporarily, the heap reference counter is not needed (since all objects remain consistent during a sequential execution of this transaction). In many cases, exposed objects can be identified by the types of objects (e.g., List is a type of exposed objects, and Node is a type of hidden object); in such cases type information can be used instead of using the isExposed field. 2.4. E NFORCING DL IN F OREST-BASED L IBRARIES 33 1 void move(SkewHeap src, SkewHeap dest) { 2 Node t1, t3, t2; 3 t1=dest.root; 4 t2=src.root; 5 if(t1.key > t2.key) { // assume both heaps are not empty 6 t3=t1; t1=t2; t2=t3; 7 } 8 dest.root=t1; 9 src.root=null; 10 t3=t1.right; 11 while(t3 != null && t2 != null) { 12 t1.right=t1.left; 13 if(t3.key < t2.key) { 14 t1.left=t3; t1=t3; t3=t3.right; 15 } 16 else { 17 t1.left=t2; t1=t2; t2=t2.right; 18 } 19 } 20 if(t3 == null) t1.right=t2; 21 else t1.right=t3; 22 } Figure 2.11: Moving the content of one Skew-Heap to another Skew-Heap. Using Static Analysis The shown instrumented code can be optimized by using various static techniques. It is sufficient for such static techniques to consider only sequential executions of the library. A live-variables analysis [15] can detect local pointers with unused values. Assigning null to such pointers will eliminate unused pointers, and as a result will release locks earlier. Some static tools (e.g. [64]) can help avoid some of the instrumentation code. For example, if a tool can detect that a local variable l is always null at some point of the CFG, our instrumentation code can avoid calling Take(l) in this case. 2.4.3 Example for Dynamically Changing Forest As an example for a dynamically changing forest, consider the procedure shown in Figure 2.11. This procedure operates on two Skew-Heaps [81] (a self-adjusting minimum-heap implemented as a binary tree). The procedure moves the content of one Skew Heap (pointed by src) to another one (pointed by dest), by simultaneously traversing the heaps; during its operation, nodes are dynamically moved from one data structure to another one. Figure 2.12 show its code, after the source transformation. 34 C HAPTER 2. F INE -G RAINED L OCKING USING S HAPE P ROPERTIES 1 void move(SkewHeap src, SkewHeap dest) { 2 Node t1, t3, t2; 3 TakeArgs2(src,dest); 4 ASNL(t1, dest.root); 5 ASNL(t2, src.root); 6 if(t1.key > t2.key) { 7 ASNL(t3,t1); ASNL(t1,t2); ASNL(t2,t3); 8 } 9 ASNF(dest.root, t1); 10 ASNL(dest, null); // dest becomes dead 11 ASNF(src.root, null); 12 ASNL(src, null); // src becomes dead 13 ASNL(t3, t1.right); 14 while(t3 != null && t2 != null) { 15 ASNF(t1.right, t1.left); 16 if(t3.key < t2.key) { 17 ASNF(t1.left, t3); ASNL(t1, t3); ASNL(t3, t3.right); 18 } 19 else { 20 ASNF(t1.left, t2); ASNL(t1, t2); ASNL(t2, t2.right); 21 } 22 } 23 if(t3 == null) ASNF(t1.right, t2); 24 else ASNF(t1.right, t3); 25 ASNL(t1, null); ASNL(t2, null); ASNL(t3, null); 26 } Figure 2.12: Moving Skew Heaps with automatic fine-grained locking. 2.5. P ERFORMANCE E VALUATION 2.5 35 Performance Evaluation We evaluate the performance of our technique on several benchmarks. For each benchmark, we compare the performance of the benchmark using fine-grained locking automatically generated using our technique to the performance of the benchmark using a single coarse-grained lock. We also compare some of the benchmarks to versions with hand-crafted fine-grained locking. For some benchmarks, manually adding fine-grained locking turned out to be too difficult even for concurrency experts. In our experiments, we consider 5 different benchmarks: two balanced search-tree data structures, a self-adjusting heap data structure, and two specialized tree-structures (which are tailored to their application). Two different machines have been used for our experiments. The first machine is an Intel i7 machine with 8 hardware threads (one quad-core i7 CPU, each core with two hardware threads). The second is a Sun SPARC enterprise T5140 machine with 64 hardware threads (two eight-core CPUs, each core with four hardware threads). 2.5.1 General Purpose Data Structures Balanced Search-Trees We consider two Java implementations of balanced search trees: a Treap [16], and a Red-Black Tree with a top-down balancing [23, 46]. For both balanced trees, we consider the common operations of insert, remove and lookup. Methodology We follow the evaluation methodology of Herlihy et al. [50], and consider the data structures under a workload of 20% inserts, 10% removes, and 70% lookups. The keys are generated from a random uniform distribution between 1 and 2 × 106 . To ensure consistent and accurate results, each experiment consists of five passes; the first pass warms up the VM9 and the four other passes are timed. Each experiment was run four times and the arithmetic average of the throughput is reported as the final result. Every pass of the test program consists of each thread performing one million randomly chosen operations on a shared data structure; a new data structure is used for each pass. Evaluation For both search trees, we compare the results of our automatic locking to a coarse-grained global lock. For the Treap, we also consider a version with manual hand-over-hand locking. Enforcing hand-over-hand locking for the Treap is challenging because after a rotation, the next thread to traverse a path will acquire a different sequence of locks. Assuring the absence of deadlock under different acquisition orders is challenging. 9 Java virtual machine. 36 C HAPTER 2. F INE -G RAINED L OCKING USING S HAPE P ROPERTIES Throughput (ops/msec) Single Manual hand-over-hand Automatic 1800 1600 1400 1200 1000 800 600 400 200 0 1 2 4 Threads 6 8 Figure 2.13: Throughput for a Treap on the Intel machine with 70% lookups, 20% inserts and 10% removes. Throughput (ops/msec) Single Manual hand-over-hand Automatic 1800 1600 1400 1200 1000 800 600 400 200 0 1 2 4 8 Threads 16 32 64 Figure 2.14: Throughput for a Treap on the SPARC machine with 70% lookups, 20% inserts and 10% removes. For the Red-Black Tree, the task of manually adding fine-grained locks proved to be too challenging and error prone. Rotations and deletions are much more complicated than in a Treap. Previous work on fine-grained locking for these trees alters the tree invariants and algorithm, as in [74]. Figure 2.13 and Figure 2.14 show results for the Treap. On the Intel machine (Figure 2.13), our automatic locking scales as well as the manual hand-over-hand locking. On the SPARC machine (Figure 2.14), the manual hand-over-hand locking is more efficient than ours locking; they both scale up to 32 threads. The degradation in the performance of the SPARC machine for 64 threads, can be explained by cross-chip latency and cache invalidations, since only the 64 threads experiment spans more than one chip. In both machines, starting from 2 threads, the fine-grained approaches outperform the single-lock synchronization. Figure 2.15 and Figure 2.16 show results for the Red-Black Tree. On the Intel machine (Figure 2.15), our automatic locking scales up to 8 threads. On the SPARC machine (Figure 2.16) it scales up to 16 threads. In both machines, starting from 4 threads, our automatic locking outperforms the single-lock synchronization. 2.5. P ERFORMANCE E VALUATION 37 Single Automatic Throughput (ops/msec) 1400 1200 1000 800 600 400 200 0 1 2 4 Threads 6 8 Figure 2.15: Throughput for a Red-Black Tree on the Intel machine with 70% lookups, 20% inserts and 10% removes. Single Automatic Throughput (ops/msec) 600 500 400 300 200 100 0 1 2 4 8 Threads 16 32 64 Figure 2.16: Throughput for a Red-Black Tree on the SPARC machine with 70% lookups, 20% inserts and 10% removes. 38 C HAPTER 2. F INE -G RAINED L OCKING USING S HAPE P ROPERTIES Throughput (ops/msec) Single Automatic 1800 1600 1400 1200 1000 800 600 400 200 0 1 2 4 Threads 6 8 Figure 2.17: Throughput for a Skew Heap on the Intel machine with 50% inserts and 50% removeMin. Throughput (ops/msec) Single 1 Figure 2.18: Automatic 800 700 600 500 400 300 200 100 0 2 4 8 Threads 16 32 64 Throughput for a Skew Heap on the SPARC machine with 50% inserts and 50% re- moveMin. Self-Adjusting Heap We consider a Java implementation of a Skew Heap [23, 81], which is a self-adjusting heap data structure. We consider the operations of insert and removeMin. We use the same evaluation methodology we used for the search trees. Here we consider a workload of 50% inserts and 50% removes on a heap initialized with one million elements. We compare the results of our automatic locking to a coarse-grained global lock. The results are shown in Figure 2.17 and Figure 2.18. On the Intel machine (Figure 2.17), our automatic locking scales up to 6 threads. On the SPARC machine (Figure 2.18) it scales up to 16 threads. Here, in both machines, starting from 4 threads, our automatic locking is faster than the singlelock approach. 2.5.2 Specialized Implementations To illustrate the applicability of our technique to specialized data structures (which are tailored to their application), we consider Java implementation of Barnes-Hut algorithm [18], and a C++ implementation 2.5. P ERFORMANCE E VALUATION Single 39 Original hand-over-hand (manual) Automatic Normalized Time 120% 100% 80% 60% 40% 20% 0% 1 2 4 8 Threads Figure 2.19: Apriori (on the Intel machine): normalized time of hash-tree construction. of the Apriori Data-Mining algorithm [14] from [72]. Apriori In this application, a number of threads concurrently build a Hash-Tree data structure (a tree data structure in which each node is either a linked-list or a hash-table). The original application uses customized hand-over-hand locking tailored to this application. We evaluate the performance of our locking relative to this specialized manual locking and to a single global lock. We show that our locking performs as well as the specialized manual locking scheme in the original application. In the experiments, we measured the time required for the threads to build the Hash-Tree. Figure 2.19 shows the speedup of the original hand-crafted locking, and our locking over a single lock.10 For 2 and 4 threads, the speedup of our locking is almost as good as the original manual locking. In the case of 8 threads it performs better than the original locking (around 30% faster). Both have a small overhead in the case of a single thread (around 4% slower). Barnes-Hut The Barnes-Hut algorithm simulates the interaction of a system of bodies (such as galaxies or particles) and is built from several phases. Its main data structure is an OCT-Tree. We parallelized the Construction-Phase in which the OCT-Tree is built, and used our technique for synchronization. We measured the benefit gained by our locking. In the experiments, we measured the time required for threads to build the OCT-tree (i.e., the Construction-Phase). Figure 2.20 and Figure 2.21 show the results. In both machines, after 4 threads both fine grained locking approaches are fully scalable. However, as in the previous results, for small number of threads the fine grained approaches are still slower than the sequential version (probably, because of their synchronization overhead). 10 This benchmark has only been performed on the Intel machine because of compatibility problems with the environment installed in the SPARC machine. 40 C HAPTER 2. F INE -G RAINED L OCKING USING S HAPE P ROPERTIES Sequential Manual hand-over-hand 1 2 Automatic Normalized Time 250% 200% 150% 100% 50% 0% 4 Threads 6 8 Figure 2.20: Barnes-Hut (on the Intel machine): normalized time of OCT-Tree construction. Normalized Time Sequential Manual hand-over-hand Automatic 160% 140% 120% 100% 80% 60% 40% 20% 0% 1 2 4 8 Threads 16 32 64 Figure 2.21: Barnes-Hut (on the SPARC machine): normalized time of OCT-Tree construction. Chapter 3 Transactional Libraries with Foresight Linearizable libraries (such as the ones produced by the approach in Chapter 2) shield programmers from the complexity of concurrency. Indeed, modern programming languages such as Java, Scala, and C# provide a large collection of linearizable libraries — these libraries provide operations that are guaranteed to be atomic, while hiding the complexity of the implementation from clients. Unfortunately, clients often need to perform a sequence of library operations that appears to execute atomically. In the sequel, we refer to such a sequence as an atomic composite operation. The problem of realizing atomic composite operations is an important and widespread one [24, 78]. Atomic composite operations are a restricted form of software transactions [48]. However, generalpurpose software transaction implementations have not gained acceptance [27, 34, 36, 69, 84, 89] due to high runtime overhead, poor performance, and limited ability to handle irreversible operations (such as I/O operations). Programmers typically realize such composite operations using ad-hoc synchronization leading to many concurrency bugs in practice (e.g., [78]). Transactional Libraries with Foresight In this chapter, we address the problem of extending a linearizable library [55] to allow clients to execute an arbitrary composite operation atomically. Our basic methodology requires the client code to demarcate the sequence of operations for which atomicity is desired and provide declarative information to the library (foresight) about the library operations that the composite operation may invoke (as illustrated later). It is the library’s responsibility to ensure the desired atomicity, exploiting the foresight information for effective synchronization. We first present a formalization of this approach. We formalize the desired goals and present a sufficient correctness condition. As long as the clients and the library extension satisfy the correctness condition, all composite operations are guaranteed atomicity without deadlocks. Furthermore, our condition does not require the use of rollbacks. Our sufficiency condition is broad and permits a range of implementation options and fine-grained synchronization. It is based on a notion of dynamic rightmovers, which generalizes traditional notions of static right-movers and commutativity [61, 66]. Our formulation decouples the implementation of the library from the client. Thus, the correctness 41 42 C HAPTER 3. T RANSACTIONAL L IBRARIES WITH F ORESIGHT of the client does not depend on the way the foresight information is used by library implementation. The client only needs to ensure the correctness of the foresight information. Automatic Foresight for Clients We then present a simple static analysis to infer calls (in the client code) to the API used to pass the foresight information. Given a description of a library’s API, our algorithm conservatively infers the required calls. This relieves the client programmer of this burden and simplifies writing atomic composite operations. Library Extension Realization Our approach permits the use of customized, hand-crafted, implementations of the library extension. However, we also present a generic technique for extending a linearizable library with foresight. The technique is based on a novel variant of the tree locking protocol in which the tree is designed according to semantic properties of the library’s operations. We used our generic technique to implement a general-purpose Java library for Map data structures. Our library permits composite operations to simultaneously work with multiple instances of Map data structures. Experimental Evaluation We use the Maps library and the static analysis to enforce atomicity of a selection of real-life Java composite operations, including composite operations that manipulate multiple instances of Map data structures. Our experiments indicate that our approach enables realizing efficient and scalable synchronization for real-life composite operations. Main Contributions Of This Chapter We develop the concept of transactional libraries with foresight along several dimensions, providing the theoretical foundations, an implementation methodology, and an empirical evaluation. Our main contributions are: • We introduce the concept of transactional libraries with foresight, in which the library ensures atomicity of composite operations by exploiting information (foresight) provided by its clients. The main idea is to shift the responsibility of synchronizing composite operations of the clients to the hands of the library, and have the client provide useful foresight information to make efficient library-side synchronization possible. • We define a sufficient correctness condition for clients and the library extension. Satisfying this condition guarantees atomicity and deadlock-freedom of composite operations (Section 3.3). • We show how to realize both the client-side (Section 3.4) and the library-side (Section 3.5) for leveraging foresight. Specifically, we present a static analysis algorithm that provides foresight information to the library (Section 3.4), and show a generic technique for implementing a family of transactional libraries with foresight (Section 3.5). • We realized our approach and evaluated it on a number of real-world composite operations. We show that our approach provides efficient and scalable synchronization (Section 3.6). 3.1. OVERVIEW 43 int value = I; void Inc() { atomic { value=value+1; } } void Dec() { atomic { if (value > 0) then value=value-1; } } int Get() { atomic { return value; } } Figure 3.1: Specification of the Counter library. I denotes the initial value of the counter. 1 /* Thread T1 */ 2 /* @atomic */ { 1 /* Thread T2 */ 2 /* @atomic */ { 3 @mayUseInc() 3 @mayUseDec() 4 Inc(); 4 Dec(); 5 Inc(); 5 Dec(); 6 @mayUseNone() 6 @mayUseNone() 7} 7} Figure 3.2: Simple compositions of counter operations. 3.1 Overview We now present an informal overview of our approach, for extending a linearizable library into a transactional library with foresight, using a toy example. Figure 3.1 presents the specification of a single Counter (library). The counter can be incremented (via the Inc() operation), decremented (via the Dec() operation), or read (via the Get() operation). The counter’s value is always nonnegative: the execution of Dec() has an effect only when the counter’s value is positive. All the counter’s procedures are atomic. Figure 3.2 shows an example of two threads each executing a composite operation: a code fragment consisting of multiple counter operations. (The mayUse annotations will be explained later.) Our goal is to execute these composite operations atomically: a serializable execution of these two threads is one that is equivalent to either thread T1 executing completely before T2 executes or vice versa. Assume that the counter value is initially zero. If T2 executes first, then neither decrement operation will change the counter value, and the subsequent execution of T1 will produce a counter value of 2. If T1 executes first and then T2 executes, the final value of the counter will be 0. Figure 3.3 shows a slightly more complex example. 3.1.1 Serializable and Serializably-Completable Executions Figure 3.4 shows prefixes of various interleaved executions of the code shown in Figure 3.2 for an initial counter value of 0. Nodes are annotated with the values of the counter. Bold double circles depict non-serializable nodes: these nodes denote execution prefixes that are not serializable (and thus need to be avoided by proper synchronization). E.g., node #18 is a non-serializable node since it represents 44 C HAPTER 3. T RANSACTIONAL L IBRARIES WITH F ORESIGHT 1 /* Thread T1 */ 2 /* @atomic */ { 3 @mayUseAll() 4 c = Get(); 1 /* Thread T2 */ 2 /* @atomic */ { 5 @mayUseInc() 3 @mayUseDec() 6 while (c > 0) { 4 Dec(); c = c-1; 5 Dec(); Inc(); 6 @mayUseNone() 7 8 9 7} } 10 @mayUseNone() 11 } Figure 3.3: Compositions of counter dependent operations. #1 value=0 T1 #2 value=1 T1 T2 #3 value=0 T2 T1 #4 value=2 #5 value=0 #6 value=1 T2 T1 T2 T1 T2 #7 value=0 T2 T1 #8 value=1 #9 value=1 #10 value=0 #11 value=2 #12 value=0 #13 value=1 T2 T2 T1 T2 T1 T1 #14 value=0 #15 value=0 #16 value=1 #17 value=1 #18 value=1 #19 value=2 Figure 3.4: Execution prefixes of the code shown in Figure 3.2, for a counter with I = 0. Each node represents a prefix of an execution; a leaf node represents a complete execution. 3.1. OVERVIEW 45 the non-serializable execution T2 .Dec(); T1 .Inc(); T2 .Dec(); T1 .Inc() (which produces a final counter value of 1). Bold single circles depict doomed nodes: once we reach a doomed node, there is no way to order the remaining operations in a way that achieves serializability. E.g., node #6 is a doomed node since it only leads to non-serializable complete executions (represented by nodes #17 and #18). Finally, dashed circles depict safe nodes, which represent serializably-completable executions. We formalize this notion later, but safe nodes guarantee that the execution can make progress, while ensuring serializability. Our goal is to ensure that execution stays within safe nodes. Even in this simple example, the set of safe nodes and, hence, the potential for parallelism depends on the initial value I of the counter. For I ≥ 1 all nodes are safe and thus no further synchronization is necessary. Using our approach enables realizing all available parallelism in this example (for every I ≥ 0), while avoiding the need for any backtracking (i.e., rollbacks). 3.1.2 Serializably-Completable Execution: A Characterization We now present a characterization of serializably-completable executions based on a generalization of the notion of static right movers [66]. We restrict ourselves to executions of two threads here, but our later formalization considers the general case. We define an operation o by a thread T to be a dynamic right-mover with respect to thread T 0 after an execution p, iff for any sequence of operations s executed by T 0 , if p; T.o; s is feasible, then p; s; T.o is feasible and equivalent to the first execution. Given an execution ξ, we define a relation @ on the threads as follows: T @ T 0 if ξ contains a prefix p; T.o such that o is not a dynamic right-mover with respect to T 0 after p. As shown later, if @ is acyclic, then ξ is a serializably-completable execution (as long as every sequential execution of the threads terminates). In Figure 3.4, node #2 represents the execution prefix T1 .Inc() for which T1 @ T2 .This is because T2 .Dec() is a possible suffix executed by T2 , and T1 .Inc(); T2 .Dec() is not equivalent to T2 .Dec(); T1 .Inc(). On the other hand, node #5 represents the execution prefix T1 .Inc(); T2 .Dec() for which T2 6@ T1 .This execution has one possible suffix executed by T1 (i.e., T1 .Inc()), and the execution T1 .Inc(); T2 .Dec(); T1 .Inc() is equivalent to the execution T1 .Inc(); T1 .Inc(); T2 .Dec(). Observe that the relation @ corresponding to any non-serializable or doomed node has a cycle, while it is acyclic for all safe nodes. Note that the use of a dynamic (i.e., state-dependent) right-mover relation is critical to a precise characterization above. E.g., Inc and Dec are not static right-movers with respect to each other. 46 C HAPTER 3. T RANSACTIONAL L IBRARIES WITH F ORESIGHT 3.1.3 Synchronization Using Foresight We now show how we exploit the above characterization to ensure that an interleaved execution stays within safe nodes. Foresight. A key aspect of our approach is to exploit knowledge about the possible future behavior of composite operations for more effective concurrency control. We enrich the interface of the library with operations that allow the composite operations to assert temporal properties of their future behavior. In the ”Counter” example, assume that we add the following operations: • mayUseAll(): indicates transaction may execute arbitrary operations in the future. • mayUseNone(): indicates transaction will execute no more operation. • mayUseDec(): indicates transaction will invoke only Dec operations in the future. • mayUseInc(): indicates transaction will invoke only Inc operations in the future. The code in Figure 3.2 is annotated with calls to these operations in a straightforward manner. The code shown in Figure 3.3, is conservatively annotated with a call to mayUseAll() since the interface does not provide a way to indicate that the transaction will only invoke Get and Inc operations. Utilizing Foresight. We utilize a suitably modified definition of the dynamic right mover relation, where we check for the right mover condition only with respect to the set of all sequences of operations the other threads are allowed to invoke (as per their foresight assertions). To utilize foresight information, a library implementation maintains a conservative over-approximation @0 of the @ relation. The implementation permits an operation to proceed iff it will not cause the relation @0 to become cyclic (and blocks the operation otherwise until it is safe to execute it). This is sufficient to guarantee that the composite operations appear to execute atomically, without any deadlocks. We have created an ad-hoc implementation of the counter that (implicitly) maintains a conservative over-approximation @0 (see Figure 3.5). Our implementation permits all serializably-completable execution prefixes for the example shown in Figure 3.2 (for every I ≥ 0). Our implementation also provides high degree of parallelism for the example shown in Figure 3.3 — for this example the loop of T1 can be executed in parallel to the execution of T2 . Fine-grained Foresight. We define a library operation to be a tuple that identifies a procedure as well as the values of the procedure’s arguments. For example, removeKey(1) and removeKey(2) are two different operations of a library with the procedure removeKey(int k). In order to distinguish between different operations which are invoked using the same procedure, a mayUse procedure (which is used to pass the foresight information) can have parameters. For example, a library that represents a single Map data structure can have a mayUse procedure mayUseKey(int k), where mayUseKey(1) is defined to refer to all operations on key 1 (including, for example, removeKey(1)), and mayUseKey(2) is defined to refer to all operations on key 2 (including, for example, removeKey(2)). Special Cases. Our approach generalizes several ideas that have been proposed before. One exam- 3.1. OVERVIEW 47 Counter c ; // an internal counter ReadWriteLock zeroLock; ReadWriteLock getLock; // two read-write locks void mayUseAll() { if(!holdLock(getLock) && !holdLock(zeroLock)) { acquireWrite(getLock); acquireWrite(zeroLock); } } void mayUseInc() { if(!holdLock(getLock) && !holdLock(zeroLock)) { acquireRead(getLock); acquireRead(zeroLock); } else if(holdLockInWriteMode(getLock) && holdLockInWriteMode(zeroLock)) { downgrade(getLock); downgrade(zeroLock); } } void mayUseDec() { if(!holdLock(getLock) && !holdLock(zeroLock)) acquireRead(getLock); } void mayUseNone() { if(holdLock(getLock)) releaseLock(getLock); if(holdLock(zeroLock)) releaseLock(zeroLock); } void Inc() { assert(holdLock(getLock) && holdLock(zeroLock)); c.Inc(); } void Dec() { assert(holdLock(getLock) && !holdLockInReadMode(zeroLock)); if(c.Get()<numOfDecs && !holdLockInWriteMode(zeroLock)) acquireWrite(zeroLock); c.Dec(); } int Get() { assert(holdLockInWriteMode(getLock) && holdLockInWriteMode(zeroLock)); return c.Get(); } Figure 3.5: Pseudo code of an ad-hoc implementation of a transactional Counter with foresight. This implementation ensures acyclicity of the @ relation as long as the mayUse (foresight) information is correct. It is based on an internal atomic Counter, and read-write locks. In the code, acquireWrite(l) acquires l in a write-mode, and acquireRead(l) acquires l in a read-mode; if lock l is already held by the current thread in a write-mode, then downgrade(l) downgrades l from write-mode to readmode. holdLockInWriteMode(l) returns true iff l is held by the current thread in a write-mode. holdLockInReadMode(l) returns true iff l is held by the current thread in a read-mode. holdLock(l) returns true iff either holdLockInWriteMode(l)==true or holdLockInReadMode(l)==true. The value of numOfDecs is equal to the number of threads that are currently executing the procedure Dec() (for brevity we omit the code that updates numOfDecs). 48 C HAPTER 3. T RANSACTIONAL L IBRARIES WITH F ORESIGHT ple, from databases, is locking that is based on operations-commutativity (e.g., see [20, chapter 3.8]). Such locking provides several lock modes where each mode corresponds to a set of operations; two threads are allowed to execute in parallel as long as they do not hold lock modes that correspond to non-commutative operations. A simple common instance is a read-write lock [30], in which threads are allowed to simultaneously hold locks in a read-mode (which corresponds with read-only operations that are commutative with each other). Interestingly, the common lock-acquire and lock-release operations used for locking, can be seen as special cases of the procedures used to pass the foresight information. Another example is shared-ordered locking [13]. This locking allows threads to simultaneously hold lock-modes that correspond with non-commutative operations. Their implementation ensures atomicity by guaranteeing that the actual execution is equivalent to a non-interleaved execution in which the same threads acquire the locks in the same order. 3.1.4 Realizing Foresight Based Synchronization What we have described so far is a methodology for foresight-based concurrency control. This prescribes the conditions that must be satisfied by the clients and library implementations to ensure atomicity for composite operations. Automating Foresight For Clients. One can argue that adding calls to mayUse operations is an error prone process. Therefore, in Section 3.4 we show a simple static analysis algorithm which conservatively infers calls to mayUse operations (given a description of the mayUse operations supported by the library). Our experience indicates that our simple algorithm can handle real-life programs. Library Implementation. We permit creating customized, hand-crafted, implementations of the library extension (e.g., Figure 3.5). However, in order to simplify creating such libraries, we present a generic technique for implementing a family of libraries with foresight (Section 3.5). The technique is based on a novel variant of the tree locking protocol in which the tree is designed according to semantic properties of the library’s operations. We have utilized the technique to implement a general purpose Java library for Map data structures. 3.2 3.2.1 Preliminaries Libraries A library A exposes a set of procedures PROCSA . We define a library operation to be a tuple (p, v1 , · · · , vk ) consisting of a procedure name p and a sequence of values (representing actual values of the procedure arguments). The set of operations of a library A is denoted by OPA . Library operations are invoked by client threads (defined later). Let T denote the set of all thread identifiers. An event is a tuple (t, m, r), 3.2. P RELIMINARIES 49 Library Designer Library specification Programmer Composite operations static analysis Composite operations + mayUse operations atomic composite operations Figure 3.6: Overview of our approach for foresight based synchronization. where t is a thread identifier, m is a library operation, and r is a return value. An event captures both an operation invocation as well as its return value. A history is defined to be a finite sequence of events. The semantics of a library A is captured by a set of histories HA — if h ∈ HA , then we say that h is feasible for A. Histories capture the interaction between a library and its client (a set of threads). Though multiple threads may concurrently invoke operations, this simple formalism suffices in our setting, since we assume the library to be linearizable. An empty history is an empty sequence of events. Let h ◦ h0 denote the concatenation of history h0 to the end of history h. Note that the set HA captures multiple aspects of the library’s specification. If h is feasible, but h ◦ (t, m, r) is not, this could mean one of three different things: r may not be a valid return value in this context, or t is not allowed to invoke m is this context, or t is allowed to invoke m in this context, but the library will block and not return until some other event has happened. A library A is said to be total if for any thread t, operation m ∈ OPA and h ∈ HA , there exists r such that h ◦ (t, m, r) ∈ HA . A library A is said to be deterministic if the following is satisfied: ∀h, t, m, r, r0 : h ◦ (t, m, r) ∈ HA ∧ h ◦ (t, m, r0 ) ∈ HA =⇒ r = r0 3.2.2 Clients Syntax and Informal Semantics A client t1 || t2 || · · · || tn consists of the parallel composition of a set of sequential client programs ti (also referred to as threads). Each thread ti is represented by the control-flow graph CFGti . The edges of a control-flow graph are annotated with client instructions, shown in Figure 3.7. Conditionals are encoded by annotating control-flow edges with assume statements. A library operation is used by invoking a procedure (by using the client instruction “x = p(x1,...,xk)”). All variables referenced in a thread ti are private to ti (i.e., they are thread-local 50 C HAPTER 3. T RANSACTIONAL L IBRARIES WITH F ORESIGHT stms ::= skip | x = exp | assume(x) | x = p(x1,...,xk) Figure 3.7: Client instructions. x, x1,...,xk stand for a local variables, exp stands for an expression over local variables, and p stands for a procedure name. Client Instruction Transition hk, vi⇒i skip Side Condition hk 0 , vi x = exp(x1 , . . . , xn ) hk, vi⇒i hk 0 , v[x 7→ [[exp]](v(x1 ), . . . , v(xn ))]i assume(x) hk, vi⇒i hk 0 , vi x = p(x1 , . . . , xn ) hk, vi ⇒ (ti ,m,r) 0 i hk , v[x v(x) = true 7→ r]i m = (p, v(x1 ), . . . , v(xn )) Table 3.1: The relation ⇒i . exp(x1 , . . . , xn ) stands for an expression over local variables x1 , . . . , xn . In all cases we assume that the edge (k, k 0 ) ∈ CFGti is annotated with the client instruction. variables). Thus, the threads have no shared state except the (internal) state of the library, which is accessed or modified only via library operations. A CFG-node may have several outgoing edges (this enables writing programs with conditional branches, and programs with nondeterministic choices). For simplicity, we assume that if node u has two (or more) outgoing edges, then each one of these edges is annotated with either a skip or an assume instruction. Each control-flow graph has two distinguished nodes: an entry site from which the thread starts, and an exit site in which the thread ends. The entry site has no incoming edges, and the exit site has no outgoing edges. Semantics The semantics [[ti ]] of a single thread ti is defined to be a labelled transition system (Σi , ⇒i ) over a set of thread-local states Σi . A local state s = hk, vi ∈ Σi of a thread ti is a pair: k is the value of the ti ’s program counter (a control-flow graph node), and v is a function from local variables to values. Let EVi be the set of all events executed by thread ti (i.e., EVi = {(t, m, r) | t = ti }). The behavior of ti is described by using the relation ⇒i ⊆ Σi × (EVi ∪ {}) × Σi . This relation is defined in Table 3.1. The execution of any instruction other than a library operation invocation is represented by a (thread local) transition σ ⇒i σ 0 . The execution of a library operation invocation is represented by a transition e of the form σ ⇒i σ 0 , where event e captures both the invocation as well as the return value. Note that 3.2. P RELIMINARIES 51 this semantics captures the semantics of the “open” program ti . When ti is “closed” by combining it with a library A, the semantics of the resulting closed program is obtained by combining [[ti ]] with the semantics of A, as illustrated later. An initial state is a state in which the thread location is at its entry site. A final state is a state in which the thread location is at its exit site. Let s0 be an initial state of ti ; a ti -execution is defined to a a a be a sequence of ti -transitions s0 ⇒1i s1 , s1 ⇒2i s2 , · · · , sk−1 ⇒ki sk such that every aj is either or an event. Such execution is said to be complete if sk is a final state of ti . The semantics of a client C = t1 || · · · || tn is obtained by simply composing the semantics of the individual threads, permitting any arbitrary interleaving of the executions of the threads. We define the set of transitions of C to be the disjoint union of the set of transitions of the individual threads. A C-execution is defined to be a sequence ξ of C-transitions such that each ξ | ti is a ti -execution, where ξ | ti is the subsequence of ξ consisting of all ti -transitions. We now define the semantics of the composition of a client C with a library A. Given a C-execution ξ, we define φ(ξ) to be the sequence of event labels in ξ. The set of (C, A)-executions is defined to be the set of all C-executions ξ such that φ(ξ) ∈ HA . We abbreviate “(C, A)-execution” to execution if no confusion is likely. Threads as Transactions Our goal is to enable threads to execute code fragments containing multiple library operations as atomic transactions (i.e., in isolation). For notational simplicity, we assume that we wish to execute each thread as a single transaction. (Our results can be generalized to the case where each thread may wish to perform a sequence of transactions.) In the sequel, we may think of threads and transactions interchangeably. This motivates the following definitions. Non-Interleaved and Sequential Executions An execution ξ is said to be a non-interleaved execution if for every thread t all t-transitions in ξ appear contiguously. Thus, a non-interleaved execution ξ is of the form ξ1 , · · · , ξk , where each ξi represents a different thread’s (possibly incomplete) execution. Such a non-interleaved execution is said to be a sequential execution if for each 1 ≤ i < k, ξi represents a complete thread execution. Serializability Two executions ξ and ξ 0 are said to be equivalent iff for every thread t, ξ | t = ξ 0 | t. An execution ξ is said to be serializable iff it is equivalent to some non-interleaved execution. Serializably Completable Executions For any execution ξ, let W(ξ) denote the set of all threads that have at least one transition in ξ. An execution ξ is said to be a complete execution iff ξ | t is complete for every thread t ∈ W(ξ). A client execution ξ is completable if ξ is a prefix of a complete execution ξc such that W(ξ) = W(ξc ). An execution ξ is said to be serializably completable iff ξ is a prefix of a complete serializable execution ξc such that W(ξ) = W(ξc ). Otherwise, we say that ξ is a doomed execution. 52 C HAPTER 3. T RANSACTIONAL L IBRARIES WITH F ORESIGHT An execution may be incompletable due to problems in a client thread (e.g., a non-terminating loop) or due to problems in the library (e.g., blocking by a library procedure leading to deadlocks). 3.3 Foresight-Based Synchronization We now formalize our goal of extending a base library B into a foresight-based library E that permits clients to execute arbitrary composite operations atomically. 3.3.1 The Problem Let B be a given library. (Note that B can also be considered to be a specification.) We say that a library E is an extension of B if (i) PROCSE ⊃ PROCSB , (ii) {h ↓ B | h ∈ HE } ⊆ HB , where h ↓ B is the subsequence of events in h that represent calls of operations in OPB , and (iii) PROCSE \ PROCSB do not have a return value1 . We are interested in extensions where the extension procedures (PROCSE \ PROCSB ) are used for synchronization to ensure that each thread appears to execute in isolation. Given a client C of the extended library E, let C ↓ B denote the program obtained by replacing every extension procedure invocation in C by the skip statement. Similarly, for any execution ξ of (C, E), we define ξ ↓ B to be the sequence obtained from ξ by omitting transitions representing extension procedures. We say that an execution ξ of (C, E) is B-serializable if ξ ↓ B is a serializable execution of (C ↓ B, B). We say that ξ is B-serializably-completable if ξ ↓ B is a serializably completable execution of (C ↓ B, B). We say that E is a transactional extension of B if for any (correct) client C of E, every (C, E)-execution is B-serializably-completable. Our goal is to build transactional extensions of a given library. 3.3.2 The Client Protocol In our approach, the extension procedures are used by transactions (threads) to provide information to the library about the future operations they may perform. We refer to procedures in PROCSE \ PROCSB as mayUse procedures, and to operations in MU E = OPE \OPB as mayUse operations. We now formalize the client protocol, which captures the preconditions the client must satisfy, namely that the foresight information provided via the mayUse operations must be correct. The semantics of mayUse operations is specified by a function mayE : MU E 7→ P(OPB ) that maps every mayUse operation to a set of base library operations. In Section 3.4 we show simple procedure annotations that can be used to define the set MU E and the function mayE . The mayUse operations define an intention-function IE : HE × T 7→ P(OPB ) where IE (h, t) 1 It means that they always return the value ”void”; the value ”void” is never used in expressions and never passed as a procedure argument. 3.3. F ORESIGHT-BASED S YNCHRONIZATION 53 represents the set of (base library) operations thread t is allowed to invoke after the execution of h. For every thread t ∈ T and a history h ∈ HE , the value of IE (h, t) is defined as follows. Let M denote the set of all mayUse operations invoked by t in h. ( I ) IF M IS EMPTY, THEN IE (h, t) = OPB . ( II ) IF M T IS NON - EMPTY, THEN IE (h, t) = m∈M mayE (m). We extend the notation and define IE (h, T ), for S any set of threads T , to be t∈T IE (h, t). Note that, the intention set IE (h, t) can only shrink as the execution proceeds. The mayUse operations cannot be used to increase the intention set of a thread. Definition 3.1 (Client Protocol) Let h be a history of library E. We say that h follows the client protocol if for any prefix h0 ◦ (t, m, r) of h, we have m ∈ IE (h0 , t) ∪ MUE . We say that an execution ξ follows the client protocol, if φ(ξ) follows the client protocol. 3.3.3 Dynamic Right Movers We now consider how the library extension can exploit the foresight information provided by the client to ensure that the interleaved execution of multiple threads is restricted to safe nodes (as described in Section 3.1). First, we formalize the notion of a dynamic right mover. Given a history h of a library A, we define the set EA [h] to be {h0 | h ◦ h0 ∈ HA }. (Note that if h is not feasible for A, then EA [h] = ∅.) Note that if EA [h1 ] = EA [h2 ], then the concrete library states produced by h1 and h2 cannot be distinguished by any client (using any sequence of operations). Dually, if the concrete states produced by histories h1 and h2 are equal, then EA [h1 ] = EA [h2 ]. Definition 3.2 (Dynamic Right Movers) Given a library A, a history h1 is said to be a dynamic right mover with respect to a history h2 in the context of a history h, denoted h : h1 .A h2 , iff EA [h ◦ h1 ◦ h2 ] ⊆ EA [h ◦ h2 ◦ h1 ]. An operation m is said to be a dynamic right mover with respect to a set of operations Ms in the context of a history h, denoted h : m .A Ms , iff for any event (t,m,r) and any history hs consisting of operations in Ms , we have h : (t, m, r) .A hs . Properties of Dynamic Right Movers The following example shows that an operation m can be a dynamic right mover with respect to a set M after some histories but not after some other histories. Example 3.3.1 Consider the Counter described in Section 3.1. Let hp be a history that ends with a counter value of p > 0. The operation Dec is a dynamic right mover with respect to the set {Inc} in the context of hp since for every n the histories hp ◦ (t, Dec, r) ◦ (t1 , Inc, r1 ), . . . , (tn , Inc, rn ) and hp ◦ (t1 , Inc, r1 ), . . . , (tn , Inc, rn ) ◦ (t, Dec, r) have the same set of suffixes (since the counter value is p − 1 + n after both histories). 54 C HAPTER 3. T RANSACTIONAL L IBRARIES WITH F ORESIGHT Let h0 be a history that ends with a counter value of 0. The operation Dec is not a dynamic right mover with respect to the set {Inc} in the context of h0 since after a history h0 ◦(t, Dec, r)◦(t0 , Inc, r0 ) the counter’s value is 1, and after h0 ◦ (t0 , Inc, r0 ) ◦ (t, Dec, r) the counter’s value is 0. Thus, (t, Get, 1) is a feasible suffix after the first history but not the second. The following example shows that the dynamic right mover is not a symmetric property. Example 3.3.2 Let hi be a history that ends with a counter value of i > 0. The operation Inc is not a dynamic right mover with respect to the set {Dec} in the context of hi since after a history hi ◦ (t, Inc, r) ◦ (t1 , Dec, r1 ), . . . , (ti+1 , Dec, ri+1 ) the Counter’s value is 0, and after hi ◦ (t1 , Dec, r1 ), . . . , (ti+1 , Dec, ri+1 ) ◦ (t, Inc, r) the Counter’s value is 1. One important aspect of the definition of dynamic right movers is the following: it is possible to have h : m .A {m1 } and h : m .A {m2 } but not h : m .A {m1 , m2 }. Static Right Movers and Commutativity The notion of dynamic right mover can be used to define static right mover and commutativity. We say that an operation m is a static right mover with respect to operation m0 , if every feasible history h satisfies h : m .A {m0 }. We say that m and m0 are staticallycommutative, if m is a static right mover with respect to m0 and vice versa. 3.3.4 Serializability It follows from the preceding discussion that an incomplete history h may already reflect some executionorder constraints among the threads that must be satisfied by any other history that is equivalent to h. These execution-order constraints can be captured as a partial-ordering on thread-ids. Definition 3.3 (Safe Ordering) Let h be a history of E; and let Th be the set of threads that appear in h. A partial ordering v ⊆ Th × Th , is said to be safe for h iff for any prefix h0 ◦ (t, m, r) of h, where m ∈ OPB , we have h0 ↓ B : m .B IE (h0 , P ), where P = {t0 ∈ Th | t 6v t0 }. A safe ordering represents a conservative over-approximation of the execution-order constraints among thread-ids (required for serializability). Note that in the above definition, the right-mover property is checked only with respect to the base library B. Example 3.3.3 Assume that the Counter is initialized with a value I > 0. Consider the history (return values omitted for brevity): h = (t, mayU seDec), (t0 , mayU seInc), (t, Dec), (t0 , Inc). If v is a safe partial order for h, then t0 v t because after the third event Inc is not a dynamic right mover with respect to the operations allowed for t (i.e., {Dec}). Dually, the total order defined by t0 v0 t 3.3. F ORESIGHT-BASED S YNCHRONIZATION 55 is safe for h since after the second event, the operation Dec is a dynamic right mover with respect to the operations allowed for t0 (i.e., {Inc}) because the Counter’s value is larger than 0. Definition 3.4 (Safe Extension) We say that library E is safe extension of B, if for every h ∈ HE that follows the client protocol there exists a partial ordering vh on threads that is safe for h. The above definition prescribes the synchronization (specifically, blocking) that a safe extension must enforce. In particular, assume that h is feasible history allowed by the library. If the history h ◦ (t, m, r) has no safe partial ordering, then the library must block the call to m by t rather than return the value r. Theorem 3.5 (Serializability) Let E be a safe extension of a library B. Let C be a client of E. Any execution ξ of (C, E) that follows the client protocol is B-serializable. Proof Let v be a safe partial ordering for φ(ξ). Let tz1 , tz2 , · · · , tzp denote any total ordering of the threads that execute in ξ that is consistent with v: i.e., tzi v tzj ⇒ i ≤ j. Let ξb be ξ ↓ B. Let ξi denote ξb | tzi . We can inductively show that ξni = ξ1 , ξ2 , · · · , ξp is a valid non-interleaved execution of (C ↓ B, B), using the right-mover property. 3.3.5 B-Serializable-Completability We saw in Section 3.1 and Figure 3.4 that some serializable (incomplete) executions may be doomed: i.e., there may be no way of completing the execution in a serializable way. Safe extensions, however, ensure that all executions avoid doomed nodes and are serializably completable. However, we cannot guarantee completability if a client thread contains a non-terminating loop or violates the client protocol. This leads us to the following conditional theorem. Theorem 3.6 (B-Serializable-Completability) Let B be a total and deterministic library. Let E be a safe extension of B. Let C be a client of E. If every sequential execution of (C, E) follows the client protocol and is completable, then the following are satisfied: 1. every execution of (C, E) is B-serializably-completable 2. every execution of (C, E) follows the client protocol The precondition in Theorem 3.6 is worth noting. We require client threads to follow the client protocol and terminate. However, it is sufficient to check that clients satisfy these requirements in sequential executions. This simplifies reasoning about the clients. 56 C HAPTER 3. T RANSACTIONAL L IBRARIES WITH F ORESIGHT Proof for Theorem 3.6 We prove the theorem by using the following lemmas. Lemma 3.7 Let B be a deterministic library. Let E be an extension of B. Let C be a client of E such that every sequential execution of (C, E) is completable. If π is a sequential execution of C such that π ↓ B is an execution of (C ↓ B, B), then π is an execution of (C, E). (i.e., π is an execution of the composition of client C with library E.) Proof We use induction on the length of the executions. Assume that π = π 0 ◦ α where α is a single transition. Form the induction hypothesis, π 0 is an execution of (C, E). Hence, φ(π 0 ) ∈ HE . If α does not execute an event then φ(π) ∈ HE — therefore π is an execution of (C, E). If α executes an event, we assume that α executes the event (t, m, r). There exists a sequential execution of (C, E) in which after π 0 thread t invokes operation m (this operation is not blocked since all sequential executions of (C, E) are completable). Hence, there exists r0 such that φ(π 0 ) ◦ (t, m, r0 ) ∈ HE . We consider the following cases. Case 1: m is a mayUse operation. All mayUse operations always return ”void”, hence r = r0 . Therefore π is an execution of (C, E). Case 2: m is not a mayUse operation. In this case,m ∈ OPB . φ(π 0 ↓ B) ◦ (t, m, r) ∈ HB (because π ↓ B is an execution of (C ↓ B, B)). Since B is deterministic, r = r0 . Therefore π is an execution of (C, E). Lemma 3.8 Let B be a total and deterministic library. Let E be a safe extension of B. Let C be a client of E such that every sequential execution of (C, E) follows the client protocol and is completable. Let π be an incomplete execution such that π ↓ B is an execution of (C ↓ B, B), and v is a safe total order for φ(π). There exists a sequence of transitions π 0 such that ππ 0 ↓ B is a complete execution of (C ↓ B, B), v is a safe total order for φ(ππ 0 ), and each thread that appears in π 0 also appears in π. Proof Intuitively, we wish to show that π ↓ B can be completed (in (C ↓ B, B)) such that v remains a safe order for the execution. We write tz1 , tz2 , · · · , tzn to denote the threads in π where i ≤ j ⇒ tzi v tzj . Let πi denote π | tzi . From Definition 3.3, we know that π1 , · · · , πn ↓ B is an execution of (C ↓ B, B) such that EB [φ(π ↓ B)] ⊆ EB [φ(π1 , · · · , πn ↓ B)]. Consider the maximal k such that π1 , · · · , πk is a sequential execution. We show below how we can let tzk complete execution. Applying the same argument inductively gives us the lemma. From Lemma 3.7, we know that π1 , · · · , πk is a sequential execution of (C, E). All sequential executions of (C, E) are completable, hence there exists α such that: α is a transition of tzk and π1 , · · · , πk , α is a sequential execution of (C, E). We want to show that πα ↓ B is an execution of (C ↓ B, B) and v is safe for φ(πα) We assume that α 3.3. F ORESIGHT-BASED S YNCHRONIZATION 57 invokes an event (tzk , m, r) where m ∈ OPB (otherwise, the proof is trivial). Thread tzk invokes operation m after π1 , · · · , πk , therefore it invokes operation m after π (it has the same local state after both executions). m is in the intension set of tzk after π1 , · · · , πk , therefore m is in the intension set of tzk after π. Since B is total, φ(π ↓ B) ◦ (tzk , m, r0 ) ∈ HB . The order v is safe for φ(π) ◦ (tzk , m, r0 ), therefore: φ(π1 , · · · , πk ↓ B) ◦ (tzk , m, r0 ) ◦ φ(πk+1 , · · · , πn ↓ B) ∈ HB . Since π1 , · · · , πk , α ↓ B is an execution of (C ↓ B, B), we know that φ(π1 , · · · , πk ↓ B) ◦ (tzk , m, r) ∈ HB . Since B is deterministic, r = r0 . Therefore, πα ↓ B is an execution of (C ↓ B, B) and v is safe for φ(πα). Lemma 3.9 Let B be a total and deterministic library. Let E be a safe extension of B. Let C be a client of E such that every sequential execution of (C, E) follows the client protocol and is completable. If π is an execution of (C, E) that follows the client protocol, then π is B-serializably-completable. Proof Since E is a safe extension and φ(π) follows the client protocol, there exists a total order v that is safe for φ(π). From Lemma 3.8, there exists a sequence of transitions π 0 such that ππ 0 ↓ B is a complete execution of (C ↓ B, B), v is a safe order for φ(ππ 0 ), and each thread that appears in π 0 also appears in π. Hence ππ 0 ↓ B is a complete serializable execution of (C ↓ B, B). Lemma 3.10 Let B be a total and deterministic library. Let E be a safe extension of B. Let C be a client of E such that every sequential execution of (C, E) follows the client protocol and is completable. Every execution of (C, E) follows the client protocol. Proof We use induction on the length of the executions. Assume that π = π 0 ◦ α where α is a single transition. Form the induction hypothesis, π 0 follows the client protocol. If α does not invoke a base operation, then π follows the client protocol. Otherwise, we assume that α invokes (t, m, r). From Lemma 3.9, there exists π 00 such that π 0 π 00 ↓ B is a complete serializable execution. Let πs be a sequential execution such that πs ↓ B is equivalent to π 0 π 00 ↓ B. From Lemma 3.7, πs is an execution of (C, E). Hence, πs follows the client protocol. Hence, m is in the intension set of t after π 0 |t. Hence, m is in the intension set of t after π 0 . Therefore, π follows the client protocol. Theorem 3.6 is a conclusion from Lemma 3.9 and Lemma 3.10. 58 C HAPTER 3. T RANSACTIONAL L IBRARIES WITH F ORESIGHT 3.3.6 E-Completability The preceding theorem about B-Serializable-Completability has a subtle point: it indicates that it is possible to complete any execution of (C, E) in a serializable fashion in B. The extended library E, however, could choose to block operations unnecessarily and prevent progress. This is undesirable. We now formulate a desirable progress condition that the extended library must satisfy. In the sequel we assume that every terminated thread has an empty intension set — i.e., if thread t has terminated in execution ξ then IE (φ(ξ), t) = ∅. This can be realized (for example) by ensuring that every thread always executes a mayUse operation m such that mayE (m) = {} before it terminates (such operation can be seen as an ”end-transaction” operation). Given a history h and a thread t, we say that t is incomplete after h iff IE (h, t) 6= ∅. We say that history h is incomplete if there exists some incomplete thread after h. We say that a thread t is enabled after history h, if for all events (t, m, r) such that h ◦ (t, m, r) satisfies the client protocol and h ◦ (t, m, r) ↓ B ∈ HB , we have h ◦ (t, m, r) ∈ HE . Note that this essentially means that E will not block t from performing any legal operation. Definition 3.11 (Progress Condition) We say that a library E satisfies the progress condition iff for every history h ∈ HE that follows the client protocol the following conditions hold: • If h is incomplete, then at least one of the incomplete threads t is enabled after h. • If h is complete, then every thread t that does not appear in h is enabled after h. Theorem 3.12 (E-Completability) Let B be a total and deterministic library. Let E be a safe extension of B that satisfies the progress condition. Let C be a client of E. If every sequential execution of (C, E) follows the client protocol and is completable, then every execution of (C, E) is completable and serializable. Proof From Theorem 3.6 and the progress condition. 3.3.7 Special Cases In this subsection we describe two special cases of safe extension. Eager-Ordering Library Our notion of safe-ordering permits v to be a partial order. In effect, this allows the system to determine the execution-ordering between transactions lazily, only when forced to do so (e.g., when one of the transactions executes a non-right-mover operation). One special case of this approach is to use a total order on threads, eagerly ordering threads in the order in which they execute 3.4. AUTOMATIC F ORESIGHT FOR C LIENTS 59 their first operations. The idea of shared-ordered locking [13] in databases is similar to this. Using such approach guarantees strict-serializability [75] which preserves the runtime order of the threads. Definition 3.13 Given a history h we define an order ≤h of the threads in h such that: t ≤h t0 iff t = t0 or the first event of t precedes the the first event of t0 (in h). Definition 3.14 (Eager-Ordering Library) We say that library E is eager-ordering if for every h ∈ HE that follows the client protocol, ≤h is safe for h. Library with Semantic-Locking A special case of eager-ordering library is library with semantic locking. (This special case appears in the database literature, e.g., see [20, chapter 3.8] and [86, chapters 6–7]). The idea here is to ensure that two threads are allowed to execute concurrently only if any operations they can invoke commute with each other. This is achieved by treating each mayUse operation as a lock acquisition (on the set of operations it denotes). A mayUse operation m by any thread t, after a history h, will be blocked if h contains a thread t0 6= t such that some operation in mayE (m) does not statically commute with some operation in IE (h, t0 ). Definition 3.15 (Library with Semantic-Locking) We say that E is a library with semantic locking, if for every h ∈ HE that follows the client protocol: if t, t0 are two different threads that appear in h, and m ∈ IE (h, t) and m0 ∈ IE (h, t0 ), then m and m0 are statically-commutative. Note that, for the examples shown in Section 3.1 such library will not allow the threads to run concurrently. This is because the operations Inc and Dec are not statically-commutative. 3.4 Automatic Foresight for Clients In this section, we present our static analysis to infer calls (in the client code) to the API used to pass the foresight information. The static analysis works for the general case covered by our formalism, and does not depend on the specific implementation of the extended library. We assume that we are given the interface of a library E that extends a base library B, along with a specification of the semantic function mayE using a simple annotation language. We use a static algorithm for analyzing a client C of B and instrumenting it by inserting calls to mayUse operations that guarantee that (all sequential executions of) C correctly follows the client protocol. Example Library. In this section, we use a library of Maps as an example. The base procedures of the library are shown in Figure 3.8 (their semantics will be described later). The mayUse procedures are shown in Figure 3.9 — their semantic function is specified using the annotations that are shown in this figure (the annotation language is described in Section 3.4.1). Figure 3.10 shows an example of a code section with calls to the base library procedures. The calls to mayUse procedures shown in bold are inferred by our algorithm (described in Section 3.4.2). 60 C HAPTER 3. T RANSACTIONAL L IBRARIES WITH F ORESIGHT int createNewMap(); int put(int mapId,int k,int v); int get(int mapId,int k); int remove(int mapId,int k); bool isEmpty(int mapId); int size(int mapId); Figure 3.8: Base procedures of the example Maps library. void mayUseAll();@{(createNewMap),(put,*,*,*),(get,*,*),(remove,*,*),(isEmpty,*),(size,*)} void mayUseMap(int m);@{(put,m,*,*),(get,m,*),(remove,m,*),(isEmpty,m),(size,m)} void mayUseKey(int m,int k);@{(put,m,k,*),(get,m,k),(remove,m,k)} void mayUseNone();@{} Figure 3.9: Annotated mayUse procedures of the example library. mayUseMap(m); if (get(m,x) == get(m,y)) { mayUseKey(m,x); remove(m,x); mayUseNone(); } else { remove(m,x); mayUseKey(m,y); remove(m,y); mayUseNone(); } Figure 3.10: Code section with inferred calls to mayUse procedures. 3.4. AUTOMATIC F ORESIGHT FOR C LIENTS 3.4.1 61 Annotation Language The semantic function mayE is specified using annotations. These annotations are described by symbolic operations and symbolic sets. Let PVar be a set of variables, and ∗ be a symbol such that ∗ 6∈ PVar. A symbolic operation (over PVar) is a tuple of the form (p, a1 , · · · , an ), where p is a base library procedure name, and each ai ∈ PVar ∪ {∗}. A symbolic set is a set of symbolic operations. Example 3.4.1 Here are four symbolic sets for the example library (we assume that m, k ∈ PVar): SY1 = {(createNewMap), (put, ∗, ∗, ∗), (get, ∗, ∗), (remove, ∗, ∗), (isEmpty, ∗), (size, ∗)} SY2 = {(put, m, ∗, ∗), (get, m, ∗), (remove, m, ∗), (isEmpty, m), (size, m)} SY3 = {(put, m, k , ∗), (get, m, k ), (remove, m, k )}. SY4 = {} Let Value be the set of possible values (of parameters of base library procedures). Given a function asn : P V ar 7→ Value and a symbolic set SY, we define the set of operations SY(asn) to be [ {(p, v1 , . . . , vn ) | ∀i.(ai 6= ∗) ⇒ (vi = asn(ai ))}. (p,a1 ,...,an )∈SY Example 3.4.2 Consider the symbolic sets from Example 3.4.1. The set SY3 (asn) contains all operations with the procedures put, get, and remove in which the first parameter is equal to asn(m) and the second parameter is equal to asn(k). The sets SY1 (asn) and SY4 (asn) are not dependent on asn. The set SY1 (asn) contains all operations with the procedures createNewMap, put, get, remove, isEmpty and size. The set SY4 (asn) is empty. The Annotations Every mayUse procedure p is annotated with a symbolic set over the the set of formal parameters of p. For example, in Figure 3.9, the procedure mayUseAll is annotated with SY1 , mayUseMap is annotated with SY2 , mayUseKey is annotated with SY3 , and mayUseNone is annotated with SY4 . Let p be a mayUse procedure with parameters x1 , . . . , xn which is annotated with SYp . An invocation of p with the values v1 , . . . , vn is a mayUse operation that refers to the set defined by SYp and a function that maps xi to vi (for every 1 ≤ i ≤ n). Example 3.4.3 In Figure 3.9, the procedure mayUseAll() is annotated with SY1 , hence its invocation is a mayUse operation that refers to all the base library operations . The procedure mayUseKey(int m, int k) is annotated with SY3 , hence mayUseKey(0,7) refers to all operations with the proce- dures put, get, and remove in which the first parameter is 0 and the the second parameter is 7. 62 C HAPTER 3. T RANSACTIONAL L IBRARIES WITH F ORESIGHT 3.4.2 Inferring Calls to mayUse Procedures We use a simple abstract interpretation algorithm ([73]) to infer calls to mayUse procedures. Given a client C of B and annotated mayUse procedures, our algorithm conservatively infers calls to the mayUse procedures such that the client protocol is satisfied in all sequential executions of C. Assumptions. The algorithm assumes that there exists a mayUse procedure (with no parameters) that refers to the set of all base library operations (the client protocol can always be enforced by adding a call to this procedure at the beginning of each code section). It also assumes that there exists a mayUse procedure (with no parameters) that refers to an empty set, the algorithm adds a call to this procedure at the end of each code section. Correct Symbolic Sets Every thread local state σ defines a function asnσ : P V ar 7→ Value such that asnσ (x) = v iff v is the value of x in state σ. We say that a symbolic set SY is correct for a program point ` if every sequential execution ξ, thread t, and a local state σ in which thread t is at location ` satisfy: if σ is a state of ξ | t, then all operations invoked by thread t after σ are in SY(asnσ ). The Relation ⊇. We say that symbolic set SY is a superset of a symbolic set SY0 , denoted SY ⊇ SY0 , if every total function asn : P V ar 7→ Value satisfies SY(asn) ⊇ SY0 (asn). For example, the symbolic sets defined in Example 3.4.1 satisfy SY1 ⊇ SY2 ⊇ SY3 ⊇ SY4 . Computing Correct Symbolic Sets Our algorithm uses a simple static analysis that computes a cor- rect symbolic set for every program point. We phrase this analysis as a simple backwards abstract interpretation ([73]). Our analysis computes symbolic sets in which each procedure has at most 2 symbolic operations (i.e., in every symbolic set there are no 3 symbolic operations with the same procedure name)2 . We write US to denote the set of such symbolic sets. We use the set US and the relation ⊇ to define the lattice L = hUS, ⊇i. Note that its minimum element is an empty set; its maximum element is a set in which every procedure p has the symbolic operation with the form (p, ∗, · · · , ∗) (where the number of ∗ instances equals to the number of p’s arguments). We use a function R : P(US) × PVar 7→ US which is defined as follows: for every S ⊆ US and x ∈ PVar, the value of R(S, x) is obtained from S by replacing every instance of x with ∗. For example the value of R({{(get, ∗, x), (put, ∗, x, y)}, {(contains, ∗, z)}}, x) is {{(get, ∗, ∗), (put, ∗, ∗, y)}, {(contains, ∗, z)}} 2 Any constant number can be used. In our implementation, we have used 2. 3.4. AUTOMATIC F ORESIGHT FOR C LIENTS 63 [[skip]](S) = S [[x = exp]](S) = R(S, x) [[assume(x)]](S) = S [[x = p(x1 , . . . , xn )]](S) = R(S ∪ {p(x1 , . . . , xn )}, x) Figure 3.11: Abstract transformers for computing correct symbolic sets. For each program point we compute an element of L by using the abstract transformers shown in Figure 3.11. mayUse Invocations. Every possible client instruction (as defined in Figure 3.7) that invokes a mayUse procedure corresponds to a symbolic set (as described in Section 3.4.1). For example, according to Figure 3.9, the instruction mayUseKey(x,y) corresponds to the symbolic set: {(put, x, y, ∗), (get, x, y), (remove, x, y)}. For every program label l with a computed symbolic set SYl , we find a minimal symbolic set SY0l ⊇ SYl that corresponds to a client instruction that invokes a mayUse procedure. We add this instruction to the code at l. The assumption that the library has a mayUse procedure that corresponds to all operations ensures that for every program label we will find an instruction. The added instructions guarantee that the transformed code sections follow the client protocol (in all possible sequential executions). The assumption that the library has a mayUse procedure that corresponds to an empty set ensures that the algorithm adds a call to this procedure at the end (i.e., exit site) of each code section. Identifying Redundant mayUse Operations The algorithm identifies and removes redundant mayUse operations by inspecting the CFG of the code sections. The algorithm repeatedly uses the following heuristics: • If the CFG has two nodes n1 , n2 with mayUse operations, and every path from the entry site to n2 contains n1 , and every path from n1 to n2 does not contain a call to a procedure (of the base library), then the mayUse operation in n1 is redundant. • If the CFG has two nodes n1 , n2 with an identical call to mayUse operation p(x1 , . . . , xn ), and every path from the entry site to n2 contains n1 , and the variables x1 , . . . , xn are not assigned between n1 and n2 , then the mayUse operation in n2 is redundant. 64 C HAPTER 3. T RANSACTIONAL L IBRARIES WITH F ORESIGHT 3.4.3 Implementation for Java Programs We have implemented the algorithm for Java programs in which the relevant code sections (that should appear to executed atomically) are annotated as atomic sections (an example is shown in Figure 3.12.) Unprotected Accesses and Invocations. Our implementation may fail to enforce atomicity of the Java code sections because: (i) Java code can access shared-memory which is not part of the extended library (e.g., by accessing a global variable); (ii) our simple implementation does not analyze the procedures which are invoked by the annotated code sections. The implementation reports (warnings) about suspected accesses (to shared-memory) and about invocations of procedures that do not belong to the extended library. These reports should be handled by a programmer or by a static algorithm (e.g., purity analysis [82]) that verifies that they will not be used for inter-thread communication (in our formal model, they can be seen as thread-local operations). In order to avoid superfluous warnings, the implementation uses a list of common pure procedures from the Java standard libraries (part of its configuration) — these procedures are not reported by our implementation. For example the procedure invoked in Figure 3.12 at line 7 (the constructor of Integer) will not be reported because this procedure is known as a pure procedure. Java Exceptions The implementation of our algorithm has a mode in which it considers Java ex- ceptions. This is realized by considering a CFG that contains edges which are used when exceptions are thrown (our implementation obtained such CFG by using utilities from [1]). For simplicity, in most part of this work we ignore exceptions. In the experiments described in Section 3.6, we have considered exceptions because we cannot assume an absence of exceptions, and we do not know the intended impact of exceptions in the mentioned applications (e.g., in Figure 3.12 the exception NullPointerException is thrown at line 5, this exception may have a meaning which is used by the application code). When considering exception, the mayUse operations that should be added to the exit site, are added using a Java try-finally statement (to make sure that these operations will be executed). This is demonstrated in Figure 3.13. 3.5 Implementing Libraries with Foresight In this section we present a generic technique for realizing an eager-ordering safe extension (see Definition 3.14) of a given base library B. Our technique exploits a variant of the tree locking protocol over a tree that is designed according to semantic properties of the library’s operations. The approach can be used by a library designer for the implementation of an extended (transactional) library. 3.5. I MPLEMENTING L IBRARIES WITH F ORESIGHT 65 1 public int getStateIndex(String state, boolean add) { @atomic { 2 Integer index = stepStateIndex.get(state); 3 if ((index == null) && add) { 4 if (state == null) 5 throw new NullPointerException("String state"); 6 int t = stepStateIndex.size(); 7 index = new Integer(t); 8 Integer check = stepStateIndex.putIfAbsent(state,index); 9 if (check != null) { 10 11 index = check; } 12 } 13 return index != null ? index : -1; 14 }} Figure 3.12: Composite operation from Tammi[2]. 1 public int getStateIndex(String state, boolean add) { @atomic { 2 try { 3 TLibrary.mayUseAll(); 4 Integer index = stepStateIndex.get(state); 5 if ((index == null) && add) { 6 if (state == null) 7 throw new NullPointerException("String state"); 8 int t = stepStateIndex.size(); 9 index = new Integer(t); 10 TLibrary.mayUseKey(state); 11 Integer check = stepStateIndex.putIfAbsent(state,index); 12 TLibrary.mayUseNone(); 13 if (check != null) { 14 15 index = check; } 16 } 17 return index != null ? index : -1; 18 } finally { TLibrary.mayUseNone(); } 19 } } Figure 3.13: Composite operation from Figure 3.12 with added mayUse operations. (In this figure, the mayUse operations are invoked by calling static methods of the class ”TLibrary”, see details in Section 3.7.4) . 66 C HAPTER 3. T RANSACTIONAL L IBRARIES WITH F ORESIGHT Assumptions In this section we assume that: (i) the first operation which is invoked by a thread is a mayUse operation; (ii) the last operation which is invoked by a thread is a mayUse operation that refers to an empty set. Note that, both assumptions are satisfied by the calls inferred by the algorithm of Section 3.4. 3.5.1 The Basic Approach Example Library. Here, we use the example from Section 3.4. The procedures of the base library are shown in Figure 3.8. The procedure createNewMap creates a new Map and returns a unique identifier corresponding to this Map. The other procedures have the standard meaning (e.g., as in java.util.Map), and identify the Map to be operated on using the unique mapId identifier. In all procedures, k is a key, v is a value. We now describe the mayUse procedures we use to extend the library interface (formally defined by the annotations in Figure 3.9): (1) mayUseAll(): indicates that the transaction may invoke any library operations (2) mayUseMap(int mapId): indicates that the transaction will invoke operations only on Map mapId (3) mayUseKey(int mapId, int k) : indicates that the transaction will invoke operations only on Map mapId and key k (4) mayUseNone(): indicates the end of transaction (it will invoke no more operations) In the following, we write mayE (m) to denote the the set of operations associated with the mayUse operation m. Implementation Parameters Our technique is parameterized and permits the creation of different instantiations offering tradeoffs between concurrency granularity and overheads. The parameters to our extension are a locking-tree and a lock-mapping. A locking-tree is a directed static tree where each node n represents a (potentially unbounded) set of library operations On , and satisfies the following requirements: (i) the root of the locking-tree represents all operations of the base library; (ii) if n0 is a child of n, then On0 ⊆ On ; (iii) if n and n0 are roots of disjoint sub-trees, then every m ∈ On and m0 ∈ On0 are statically-commutative. Example 3.5.1 Figure 3.14 shows a possible locking-tree for the Map library. The root A represents all (library) operations. Each M i (i = 0, 1) represents all operations with argument mapId that satisfies3 : i = mapId % 2. Each Kji (i = 0, 1 and j = 0, 1, 2) represents all operations with arguments mapId and k that satisfy: i = mapId % 2 ∧ j = k% 3. 3 We write % to denote the modulus operator. Note that we can use a hash function (before applying the modulus operator). 3.5. I MPLEMENTING L IBRARIES WITH F ORESIGHT 67 A 1 0 M M 0 K0 0 K1 0 K2 1 K0 1 K1 1 K2 Figure 3.14: The locking-tree used in the example. The lock-mapping is a function P from mayUse operations to tree nodes and a special value ⊥. For a mayUse operation m, P (m) is the node which is associated with m. For each mayUse operation m, P should satisfy: if mayE (m) 6= ∅, then mayE (m) ⊆ OP (m) , otherwise P (m) = ⊥. Example 3.5.2 Here is a possible lock-mapping for our example. mayUseAll() is mapped to the root A. mayUseMap(mapId) is mapped to M i where i = mapId % 2. mayUseKey(mapId,k) is mapped to Kji where i = mapId % 2 ∧ j = k% 3. mayUseNone() is mapped to ⊥. Implementation We associate a lock with each node of the locking-tree. The mayUse operations are implemented as follows: • The first invocation of a mayUse operation m by a thread (that has not previously invoked any mayUse operation) acquires the lock on P (m) as follows. The thread follows the path in the tree from the root to P (m), locking each node n in the path before accessing n’s child. Once P (m) has been locked, the locks on all nodes except P (m) are released.4 • An invocation of a mayUse operation m0 by a thread that holds the lock on P (m), locks all nodes in the path from P (m) to P (m0 ) (in the same tree order), and then releases all locks except P (m0 ). If P (m0 ) = P (m) or P (m0 ) is not reachable from P (m),5 then the execution of m0 has no impact. • If a mayUse operation m is invoked by t and P (m) = ⊥, then t releases all its owned locks . Furthermore, our extension adds a wrapper around every base library procedure, which works as follows. When a non-mayUse operation m is invoked, the current thread t must hold a lock on some node n (otherwise, the client protocol is violated). Conceptually, this operation performs the following steps: (1) wait until all the nodes reachable from n are unlocked; (2) invoke m of the base library and return its return value. Here is a possible pseudo-code for isEmpty: 4 5 This is simplified version. Other variants, such as hand-over-hand locking, will work as well. This may happen, for example, when OP (m0 ) ⊃ OP (m) . 68 C HAPTER 3. T RANSACTIONAL L IBRARIES WITH F ORESIGHT bool isEmpty(int mapId) { n := the node locked by the current thread wait until all nodes reachable from n are unlocked return baseLibrary.isEmpty(mapId); } Correctness The implementation satisfies the progress condition because: if there exist threads that hold locks, then at least one of them will never wait for other threads (because of the tree structure, and because we assume that the base library is total). We say that t is smaller than t0 , if the lock held by t is reachable from the lock held by t0 . The following property is guaranteed: if t ≤h t0 (see Definition 3.13) then either t is smaller than t0 or all operations allowed for t and t0 are statically-commutative. In an implementation, a non-mayUse operation waits until all smaller threads are completed, hence the extended library is a safe extension. Note that we have ignored cases in which the first operation that is invoked by a thread is a nonmayUse operation (because of our assumption, and because such cases will never occur with the algorithm described in Section 3.4). Nevertheless, if needed, a library implementation can dynamically ensure that the first operation is a mayUse operation by automatically invoke such operation just before executing the first non-mayUse operation. This can be implemented, for example, by adding the following code to the beginning of each base procedure: if( no node is locked by the current thread ) mayUseAll(); ... // the remaining code of the procedure 3.5.2 Using Dynamic Information The dynamic information utilized by the basic approach is limited. In this section we show two ways that enable (in some cases) to avoid blocking by utilizing dynamic information. Utilizing the State of the Locks In the basic approach, a non-mayUse operation, invoked by thread t, waits until all the reachable nodes (i.e., reachable from the node which is locked by t) are unlocked — this ensures that the operation is a right-mover with respect to the preceding threads. In some cases this is too conservative; for example: Example 3.5.3 Consider the example from Section 3.5.1, and a case in which thread t holds a lock on M 0 (assume t is allowed to use all operations of a single Map). If t invokes remove(0,6) it will wait until K00 , K10 and K20 are unlocked. But, waiting for K10 and K20 is not needed, because threads that hold 3.5. I MPLEMENTING L IBRARIES WITH F ORESIGHT 69 locks on these nodes are only allowed to invoke operations that are commutative with remove(0,6). In this case it is sufficient to wait until K00 is unlocked. So, if a non-mayUse operation m is invoked, then it is sufficient to wait until all reachable nodes in the following set are unlocked: {n | ∃m0 ∈ On : m is not static-right-mover with m0 } Utilizing the State of the Base Library In some cases, the state of the base library can be used to avoid waiting. For example: Example 3.5.4 Consider the example from Section 3.5.1, and a case in which thread t holds a lock on M 0 (assume t is allowed to use all operations of a single Map), and other threads hold locks on K00 , K10 and K20 . If t invokes isEmpty, it will have to wait until all the other threads unlock K00 , K10 and K20 . This is not always needed, for example, if the Map manipulated by t has 4 elements, then the other threads will never be able to make the Map empty (because, according to the Map semantics, they can only affect 3 keys, so they cannot remove more than 3 elements). Hence, the execution of isEmpty by t is a dynamic-right-mover. A library designer can add code that observes the library’s state and checks that the operation is a dynamic-right-mover; in such a case, it executes the operation of the base library (without waiting). For example, the following code lines can be added to the beginning of isEmpty(int mapId): bool c1 = M 0 or M 1 are held by the current thread ; bool c2 = baseLibrary.size(mapId) > 3 ; if(c1 and c2) return baseLibrary.isEmpty(mapId); ... // the remaining code of isEmpty This code verifies that the actual Map cannot become empty by the preceding threads; in such a case we know that the operation is a dynamic-right-mover. Note that writing code that dynamically verifies right-moverness may be challenging, because it may observe inconsistent state of the library (i.e., the library may be concurrently changed by the other threads). 3.5.3 Optimistic Locking According to the above description, the threads are required to lock the root of the locking-tree. This may create contention (because of several threads trying to lock the root at the same time) and potentially degrade performance [54]. To avoid contention, we use the following technique. For each lock we add a counter — the counter is incremented whenever the lock is acquired. When a mayUse operation m is invoked (by a thread that 70 C HAPTER 3. T RANSACTIONAL L IBRARIES WITH F ORESIGHT has not invoked a mayUse operation) it performs the following steps: (1) go over all nodes from the root to P (m) and read the counter values; (2) lock P (m); (3) go over all nodes from the root to P (m) (again), if one node is locked or its counter has been modified then unlock P (m) and restart (i.e., go to step 1). The idea is to simulate hand-over-hand locking by avoid writing to shared memory. This is done by only locking the node P (m) (and only read the state of the other nodes). When we do not restart in step 3, we know that the execution is equivalent to one in which the thread performs hand-over-hand locking from the root to P (m). 3.5.4 Further Extensions In this section, we describe additional extensions of the basic approach. We first describe how to utilize read-write locks — this enables situations in which several threads hold the same lock (node). We also describe how to associate several nodes with the same mayUse operation — this enables, for example, mayUse operation that is associated with operations on several different keys. Utilizing Read-Write Locks The basic approach does not permit situations in which several threads simultaneously hold lock on the same node, this prevents situations in which several threads are simultaneously allowed to invoke commutative operations which are represented by the same node (this is sometimes desirable, for example, one may want to allow several threads to simultaneously invoke all read-only operations which are represented by the root). In order to extend the basic approach, we represent an implementation by using a locking-tree and a lock-mapping P (as defined in Section 3.5.1) and a set of operations R. The set R should satisfy the following: every m, m0 ∈ R are statically-commutative. Example 3.5.5 For the example presented in Section 3.5.1, we can use a set R with the following operations: all read-only operations (i.e., invocations of isEmpty, size and get) and invocations of createNewMap. Any pair of operations from R are statically-commutative. Here the implementation is similar to Section 3.5.1 with the following differences: • For each mayUse operation m (such that mayE (m) 6= ∅) we create a mayUse operation mR such that mayE (mR ) = mayE (m) ∩ R. The lock-mapping P maps m and mR to the same node (i.e., P (m) = P (mR )). m is called a W-mayUse operation, mR is called R-mayUse operation. For the example from Section 3.5.1, such operations can be addded by adding the following mayUse procedures: mayUseAllReadOnly(), mayUseMapReadOnly(int mapId) and mayUseKeyReadOnly(int mapId, int k). 3.5. I MPLEMENTING L IBRARIES WITH F ORESIGHT 71 • The locks are read-write locks — can be acquired in a read-mode or a write-mode. A write-mode can be downgraded to a read-mode. (e.g., see Java’s ReentrantReadWriteLock). • The R-mayUse operations always acquire nodes in a read-mode, the W-mayUse operations always acquire in a write-mode. • If a R-mayUse operation mR is invoked, and P (mR ) is currently held by the current thread in a write mode, then the locking mode of P (mR ) is changed (downgraded) from write-mode to read-mode. • If a W-mayUse operation is invoked after a R-mayUse operation (by the same thread) the invocation has no impact. • When a non-mayUse operation is invoked then: if the thread currently holds lock in a write-mode then it waits until all reachable nodes are unlocked; otherwise (holds in a read-mode), it waits until all reachable nodes are either unlocked or locked in a read-mode. (note that, the technique from Section 3.5.2 can still be used in order to avoid considering all reachable nodes). Associating Several Nodes with One mayUse operation In some cases it is desirable to associate several nodes with one mayUse operation. This enables, for example, mayUse operation that is associated with the nodes K00 and K10 from Section 3.5.1 (these nodes represent operations on different keys) In order to handle this, the lock-mapping P is a function from mayUse operations to (non-empty) sets of nodes and ⊥. For each mayUse operation m, P should satisfy: if mayE (m) 6= ∅, then mayE (m) ⊆ S n∈P (m) On , otherwise P (m) = ⊥. Here, a mayUse operation m needs to consider all the paths to the nodes in P (m) — i.e., the approach remains the same but instead of considering one node, we consider several nodes. • An invocation of a mayUse operation m (by a thread that has not invoked a mayUse operation) locks all nodes in the paths from the root to the nodes in P (m). The nodes are locked in a tree order — if n is a parent of n0 , then n is locked before n0 . After all nodes in the paths are locked: all the nodes, except those in P (m), are released. • An invocation of a mayUse operation m0 by a thread that holds lock on nodes in the set S, locks all nodes in the paths from S to P (m0 ) (in the same tree order), and then releases all nodes except the nodes in P (m0 ). Note that, nodes which are not reachable from S are ignored. 72 C HAPTER 3. T RANSACTIONAL L IBRARIES WITH F ORESIGHT 3.5.5 Java Threads and Transactions In the Java environment (and in other modern programming environments) the same thread may need to execute several transactions (one after the other). This does not restrict our technique, since whenever a thread t invokes mayUse operation of an empty set, the library forgets the identity of t (by releasing all locks which are owned by t), and therefore thread t can be seen as a brand new transaction. 3.6 Experimental Evaluation In this section we present an experimental evaluation of our approach. The goals of the evaluation are: (i) to measure the precision and applicability of the static algorithm presented in Section 3.4; (ii) to compare the performance of our approach to a synchronization implemented by experts; and (iii) to determine if our approach can be used to perform synchronization in realistic software with reasonable performance. Towards these goals, we implemented a general purpose Java library for Map data structures using the technique presented in Section 3.5 (see also Section 3.7). In all cases, in which our library is used, the calls to the mayUse operations have been automatically inferred by our static algorithm. Methodology For all performance benchmarks (except the GossipRouter), we follow the performance evaluation methodology of Herlihy et al. [50], and consider the clients under different workloads. To ensure consistent and accurate results, each experiment consists of eight passes; the first three passes warms up the VM and the five other passes are timed. Each experiment was run four times and the arithmetic average of the throughput is reported as the final result. Every pass of the test program consists of each thread performing one million randomly chosen operations on a client code6 . We used a Sun SPARC enterprise T5140 server machine running Solaris 10 — this is a 2-chip Niagara system in which each chip has 8 cores (the machine’s hyperthreading was disabled). 3.6.1 Applicability and Precision Of The Static Analysis We applied our static analysis (from Section 3.4) to 58 Java code sections (composite operations) from [79] that manipulate Maps (taken from open-source projects). Each composite operation has (at least) two calls to Map procedures. For 18 composite operations, our implementation reported (warnings) about procedure invocations that did not belong to the library (see Section 3.4.3) — we manually verified that these invocations are pure7 (so they can be seen as thread-local operations). A summary of this experiment is shown in 6 Note that in this section a thread is a Java thread. This is in contrast to our formal model in which a thread is a compositeoperation. See Section 3.5.5. 7 We have found that the purity of the invoked procedures is obvious, and can be verified by existing static algorithms such as [82]. 3.6. E XPERIMENTAL E VALUATION 73 # Source Application Nodes mayUse # Source Application Nodes mayUse 1 Adobe BlazeDS 30 2 30 Hudson 27 2 2 Adobe BlazeDS 30 2 31 ifw2 29 2 3 Apache Cassandra* 45 2 32 ifw2 29 2 4 Apache MyFaces Trinidad 20 2 33 Jack4J* 61 2 5 Apache ServiceMix 27 2 34 JBoss 29 2 6 Apache Tomcat* 66 6 35 Jetty 29 2 7 Apache Tomcat* 63 6 36 Jetty 26 2 8 Apache Tomcat 17 4 37 Jetty 26 2 9 Apache Tomcat 22 2 38 Jexin 26 2 10 Apache Tomcats 21 2 39 Keyczar 30 2 11 autoandroid* 56 2 40 memcache-client 22 4 12 Beanlib* 62 3 41 OpenEJB* 44 2 13 Clojure 30 2 42 OpenJDK 30 2 14 Cometdim 26 2 43 Tammi* 30 2 15 DWR 27 2 44 Tammi 33 2 16 dyuproject 29 4 45 Tammi 46 4 17 ehcache-spring-annotation* 27 2 46 Tammi 32 5 18 Ektorp 43 5 47 ProjectTrack 26 2 19 Flexive* 23 2 48 ProjectTrack 28 2 20 Flexive 29 4 49 RestEasy 27 4 21 Flexive* 19 2 50 RestEasy* 29 2 22 GlassFish* 30 2 51 RestEasy 27 2 23 GlassFish* 34 2 52 RestEasy 25 2 24 Granite* 48 4 53 RestEasy 26 2 25 Gridkit* 37 2 54 retrotranslator* 34 4 26 Gridkit 37 2 55 Torque-spring 23 2 27 GWTEventService 14 2 56 Yasca 27 2 28 GWTEventService* 19 2 57 OpenJDK 26 4 29 Hazelcast 21 2 58 OpenJDK 48 4 Figure 3.15: Composite operations from [79]. For each composite operation we mention its source application, the number of CFG nodes , and the number of mayUse operations inferred by our algorithm. We use * to denote cases in which we manually verified that non-library operations are Pure. Figure 3.15. Surprisingly, in spite of its simplicity, the algorithm inferred ”optimal” mayUse operations in the following sense: in all cases we were not able to correctly add mayUse operations such that for some program locations the set of allowed operations will be smaller (during an execution). 3.6.2 Comparison To Hand-Crafted Implementations We selected several composite operations over a single Map: the computeIfAbsent pattern [3], and a few other common composite Map operations (that are supported by ConcurrentHashMap [4]). For these composite operations, we compare the performance of our approach to a synchronization implemented by experts in the field. 74 C HAPTER 3. T RANSACTIONAL L IBRARIES WITH F ORESIGHT operations/mllisecond Global Lock Ours Manual ConcurrentHashMapV8 4800 2400 1200 600 300 1 2 4 8 16 Threads Figure 3.16: ComputeIfAbsent: throughput as a function of the number of threads. ComputeIfAbsent The ComputeIfAbsent pattern appears in many Java applications. Many bugs in Java programs are caused by non-atomic realizations of this simple pattern (see [79]). It can be described with the following pseudo-code: if(!map.containsKey(key)) { value = ... // pure computation map.put(key, value); } The idea is to compute a value and store it in a Map, if and only if, the given key is not already present in the Map. We chose this benchmark because there exists a new version of Java Map, called ConcurrentHashMapV8, with a procedure that gets the computation as a parameter (i.e., a function is passed as a parameter), and atomically executes the pattern’s code [3]. We compare four implementations of this pattern: (i) an implementation which is based on a global lock; (ii) an implementation which is based on our approach; (iii) an implementation which is based on ConcurrentHashMapV8; (iv) an implementation which is based on hand-crafted fine-grained locking8 . The computation was emulated by allocating a relatively-large Java object (∼ 128 bytes). The results are shown in Figure 3.16. As expected, the global lock implementation does not scale with the number of threads. Note that, the ConcurrentHashMapV8 performs well for 1 and 2 threads, this can be explained by the fact that the pattern is directly implemented inside the data structure (and it only scans the data structure once). We are encouraged by the fact that our approach provides better performance than ConcurrentHashMapV8 for at least 8 threads; and also, it is (at most) 25% slower than the hand-crafted fine-grained locking. Common Composite Map Operations We studying common composite operations over a single Map. The composite operations we consider here are implemented in the Java’s ConcurrentHashMap [4]. 8 we used lock stripping, similar to [49], with 32 locks; this is an attempt to estimate the benefits of manual hand-crafted synchronization without changing the underlying library. 3.6. E XPERIMENTAL E VALUATION 75 Global Lock 8000 8000 7000 6000 2000 6000 4000 1000 5000 2000 4000 3000 40% putIfAbsent 30% remove 30% replace 0 1 2 4000 4 8 16 Ours ConcurrentHashMap 4000 70% contains 10% putIfAbsent 10% remove 10% replace 3000 50% contains 50% putIfAbsent 2000 1000 0 1 2 4 8 16 0 1 2 4 8 16 3000 Figure 3.17: 2000 Composite Map Operations: throughput (operations/millisecond) as a function of the 1000 number of threads (1-16). 0 1 2 4 8 16 Java’s ConcurrentHashMap[4] implementation provides the following composite operations: • putIfAbsent(K key, V value) — adds a (key,value) into the map only if the key is not already present. Implemented using the basic operations contains and put. • replace(K key, V value) — replaces the value for a key only if it is currently mapped to some value. Implemented using the basic operations contains and put. • remove(K key, V value) — remove entry for key from the map only if it is currently mapped to the given value. Implemented using the basic operations get and remove(K key). The ConcurrentHashMap documentation for each of the above procedures provides a pseudo-code in which the procedure is expressed in terms of basic Map operations. We based our implementation of composite operations directly on these pseudo-code fragments. We compare four implementations of these operations: (i) an implementation which is based on a global lock; (ii) an implementation which is based on our approach; (iii) the actual implementation of ConcurrentHashMap [4]. The results, for three workloads, are shown in Figure 3.17. 3.6.3 Evaluating The Approach On Realistic Software We applied our approach to three benchmarks with multiple Maps — in these benchmarks, several Maps are simultaneously manipulated by the composite operations. We used the Graph benchmark [49], Tomcat’s Cache [5], and a multi-threaded application GossipRouter [10]. In these benchmarks, we compare the performance to coarse grained locking. Graph This benchmark is based on a Java implementation of the Graph that has been used for the evaluation of [49]. The Graph consists of four composite operations: find successors, find predecessors, insert edge, and remove edge. Its implementation uses two Map data structures in which several different values can be associated with the same key (such type of Maps is supported by our library; also [8] contains an example for such type of Maps). 76 C HAPTER 3. T RANSACTIONAL L IBRARIES WITH F ORESIGHT Global Lock 1200 1000 800 600 400 200 0 4500 Graph 70% Find successors 20% Insert 10% Remove 2 4 8 1500 1000 3500 200 3000 100 Graph 45% Find successors 45% Find predecessors 9% Insert 1% Remove 1500 Graph 50% Insert 50% Remove 1000 35% Find successors 35% Find predecessors 20% Insert 10% Remove 500 0 0 16 2000 1 2 800 500 600 0 8 16 2 4 8 16 2 Cache 800 90% Get, 10% Put Size=50K 1 4 8 16 8 16 (c) 600 2 4 Cache 90% Get, 10% Put Size=5000K 400 8 16 200 0 1 1 1000 400 200 500 4 (b) 1000 0 Ours (w/o optimization) Graph 1500 (a) 2000 300 2500 1 2500 4000 Ours 0 1 (d) 2 4 8 16 (e) 1 2 4 (f) Figure 3.18: Graph and Cache: throughput (operations/millisecond) as a function of the number of threads (1-16). We compare a synchronization which is based on a global lock, and a synchronization which is based on our approach. We use the workloads from [49]. The results are shown in Figure 3.18(a)–(d). For some of the workloads, we see that there is a drop of performance between 8 and 16 threads. This can be explained by the fact that each chip of the machine has 8 cores, so using 16 threads requires using both chips (this creates more overhead). Tomcat’s Cache This benchmark is based on a Java implementation of Tomcat’s cache [5]. This cache uses two types of Maps which are supported by our library: a standard Map, and a weak Map (see [6]). The cache consists of two composite operations Put and Get which manipulate the internal Maps. In this cache, the Get is not a read-only operation (in some cases, it copies an element from one Map to another). The cache gets a parameter (size) which is used by its algorithm. Figure 3.18(e) and Figure 3.18(f) show results for two workloads. GossipRouter The GossipRouter is a Java multi-threaded routing service from [10]. Its main state is a routing table which consists of several Map data structures. (The exact number of Maps is dynamically determined). We use a version of the router (”3.1.0.Alpha3”) with several bugs that are caused by an inadequate synchronization in the code that access the routing table. We have manually identified all code sections that access the routing table as atomic sections; and verified (manually) that: whenever these code sections are executed atomically, the known bugs do not occurred. We compare two ways to enforce atomicity of the code sections: a synchronization which is based on a global lock, and a synchronization which is based on our approach. We used a performance tester from 3.7. JAVA I MPLEMENTATION OF THE T RANSACTIONAL M APS L IBRARY Messages/Second Global Lock 1200 800 400 0 77 Ours 5000 Messages per client 16 Clients 1 2 Cores 4 8 16 Figure 3.19: GossipRouter: throughput as a function of the number of active cores. [10] (called MPerf ) to simulate 16 clients where each client sends 5000 messages. In this experiment the number of threads cannot be controlled from the outside (because the threads are autonomously managed by the router). Therefore, instead of changing the number of threads, we changed the number of active cores. The results are shown in Figure 3.19. For the router’s code, our static analysis has reported (warnings) about invocations of procedures that do not belong to the Maps library. Interestingly, these procedures perform I/O and do not violate atomicity of the code sections. Specifically, they perform two types of I/O operations: logging operation (used to print debug messages), and operations that send network messages (to the router’s clients). They do not violate the atomicity of the atomic sections, because they are not used to communicate between the threads (from the perspective of our formal model, they can be seen as thread-local operations). 3.7 Java Implementation of the Transactional Maps Library For our evaluation, we have implemented a single Java library of Maps by using the technique described in Section 3.5. Compatible API. In order to simplify the evaluation (on existing Java code), we have created an API which is similar to the existing Java Maps (see Section 3.7.4). The (existing) code sections were converted to use our library by changing the Java types of the Maps. (For example, ”ConcurrentHashMap<K,V>” was changed to ”CLIBConcurrentHashMap<K,V>”). 3.7.1 Base Library We have implemented the base library9 by wrapping several existing implementations of concurrent Maps. All procedures of the base library are linearizable. We have used the following Map types: • Standard Map (similar to [4], the implementation was taken from [8]). • Weak Map (this Map is an efficient variant of [6], the implementation was also taken from [8]). 9 this library does not have mayUse procedures. 78 C HAPTER 3. T RANSACTIONAL L IBRARIES WITH F ORESIGHT • Multi Map — in this Map each key may be associated with multiple values10 . We have manually created a linearizable implementation of this Map by using Java locks and a standard Map. For each Map type T , we have created a procedure that allocates Maps of type T . 3.7.2 Extended Library MayUse procedures The extended library has the following mayUse procedures: void mayUseAll(); void mayUseKey(Object key); void mayUseKey(Object key1, Object key2); void mayUseMap(Map map); void mayUseNone(); void mayUseAllRO(); void mayUseKeyRO(Object key); void mayUseKeyRO(Object key1, Object key2); void mayUseMapRO(Map map); An invocation of mayUseAll() is associated with all the library operations. An invocation of mayUseKey(k) is associated with all operations on key k (e.g., an operation that puts a value in an element with key = k) 11 . An invocation of mayUseKey(k1,k2) is associated with all operations on key k1 or key k2. An invocation of mayUseMap(m) is associated with all operations on map m. An invocation of mayUseNone() is associated with an empty set of operations. The other mayUse operations are called read-only (their procedure names end with ”RO”) — for each mayUse operation m (except mayUseNone()) we have a read-only mayUse operation mR which is associated with m’s operations that are read-only. (If the name of m is p(...) , then the name of mR is pRO(...)). Parameters We have used a locking-tree which is similar to the example shown in Section 3.5.1 — in our implementation the structure of the locking-tree is configurable. In our experiments we have used a locking-tree in which the root has 2 children (denoted by M 0 and M 1 ), and each M i has 32 children. The lock-mapping P is defined according to the above description of the mayUse operations. The set R (discussed in Section 3.5.4) is the set of all read-only operations. Hash-Function In contrast to Section 3.5.1 our Maps are represented by Java objects (and not by integers), so we use a hash-function that maps Java objects to integers (we use the hash-function before applying the modulus operator). 10 11 Similar to com.google.common.collect.Multimap from [8] We are using the fact that operations on different keys are always commutative. 3.7. JAVA I MPLEMENTATION OF THE T RANSACTIONAL M APS L IBRARY 3.7.3 79 Utilizing Dynamic Information by Handcrafted Optimization In order to use the optimization ”Utilizing the State of the Base Library” (Section 3.5.2), we have manually changed the following procedures • boolean isEmpty(Map m): returns true IFF Map m is empty • boolean sizeLargerThan(Map m,int n): returns true IFF Map m contains more than n elements • boolean sizeSmallerThan(Map m,int n): returns true IFF Map m contains less than n elements The procedure isEmpty was changed as demonstrated in Section 3.5.2. The following pseudo-code describes the optimization for sizeLargerThan (sizeSmallerThan is similar): bool sizeLargerThan(Map m,int n) { wait until M i is not held by another thread int s = baseLibrary.size(m); if(n < s-C) return false;// C is equal to the number nodes reachable from M i if(n > s+C) return true; // at most C threads may precede the current thread ... // the remaining code of sizeLargerThan In our performance evaluation (Section 3.6), this optimization (only) affects the performance of two benchmarks: the Cache and the GossipRouter. For the Cache benchmark the average throughput improvement is 27%, and the maximal throughput improvement is 60%. For the GossipRouter benchmark the average throughput improvement is 3%, and the maximal throughput improvement is 9%. 3.7.4 API Adapter Our library is realized as a Java class (”TLibrary”) — its ”procedures” are static methods of this class. In order to create API which is similar to Java’s API, we added a way to represent Maps as Java objects. We use the Adapter design pattern [39]. For each Map type T , we created a Java class T such that: (1) the constructor of Map T creates a Map (using a procedure of our library) and store a reference to this Map; (2) other methods of T, call to the corresponding procedure of the library. For example, a constructor of class Map is implemented by: {this.myMap=TLibrary.createMap();} and the method isEmpty() is implemented by: {return TLibrary.isEmpty(this.myMap);} 80 C HAPTER 3. T RANSACTIONAL L IBRARIES WITH F ORESIGHT Chapter 4 Composition of Transactional Libraries via Semantic Locking The approach presented in Chapter 3 is designed to handle programs in which the composite operations use a single transactional library; the transactional library has to support all abstract data types (ADT) that may be concurrently used by the program’s threads. In this chapter, we present an approach for composite operations that simultaneously use multiple transactional libraries. We focus on a special case of transactional libraries: libraries with semantic-locking (see Section 3.3.7). In this chapter, the mayUse operations of these libraries are called locking operations since their meaning resembles the meaning of the common locking operations (lock-acquire and lock-release). In order to handle multiple libraries, we assume that each ADT instance is a single independent library (with semantic-locking). The locking operations of the ADTs are used to enforce atomicity of the composite operations while exploiting semantic properties of the ADTs. The Problem We consider a Java multi-threaded program (also referred to as a client), which makes use of several ADTs (libraries) with semantic-locking. We assume that the only mutable state shared by multiple threads are instances of these ADTs. We permit atomic sections as a language construct: a block of code may be marked as an atomic section (as demonstrated in Figure 4.1). An execution of an atomic section is called a transaction. Our goal is to ensure that transactions appear to execute atomically and make progress (avoiding deadlocks), while exploiting the semantic-locking provided by the ADTs. Automatic Atomicity We present a compiler that automatically infers calls to semantic-locking operations. Given a client program (without calls to semantic locking operations) and a specification of the semantic-locking operations of the ADTs used by the client, the compiler inserts invocations of semantic locking operations into the atomic sections in the client program to guarantee atomicity and 81 82 C HAPTER 4. C OMPOSITION OF T RANSACTIONAL L IBRARIES VIA S EMANTIC L OCKING 1 atomic { 2 set=map.get(id); 3 if(set==null) { 4 set=new Set(); map.put(id, set); 5 } 6 set.add(x); 7 if(flag) { 8 queue.enqueue(set); 9 map.remove(id); 10 } 11 } Figure 4.1: An atomic section that manipulates several ADTs (a Map, a Set, and a Queue). This example is inspired by the code of Intruder [26], further discussed in Section 4.5. 1 atomic { 2 map.lockKey(id); set=map.get(id); 3 if(set==null) { 4 set=new Set(); map.put(id, set); 5 } 6 set.lockAdd(); set.add(x); 7 if(flag) { 8 queue.lockAll(); queue.enqueue(set); queue.unlockAll(); 9 map.remove(id); 10 } 11 map.unlockAll(); set.unlockAll(); 12 } Figure 4.2: The atomic section of Figure 4.1 with calls to semantic locking operations automatically inserted by our compiler. 83 deadlock-freedom of these atomic sections. The synchronization generated by our compiler follows a semantics-aware variant of the two-phase locking protocol (see [20, chapter 3.8] and [86, chapters 6–7]). Example 4.0.1 Figure 4.1 shows an example for an atomic section which is given to our compiler. This code section manipulates several ADTs (a Map, a Set, and a Queue), and does not contain calls to their semantic-locking operations. Figure 4.2 shows the result of applying our compiler to the atomic section of Figure 4.1. Pointers and Limitations Our compiler handles programs in which pointers to ADT instances are dynamically manipulated; these programs are allowed to work with an unbounded number of ADT objects. For some of these programs, our compiler is unable to ensure deadlock-freedom by using only the semantic locking operations of the ADTs. These programs are handled using an additional specialized coarse-grained synchronization. However, our experimental evaluation (Section 4.5) shows that our compiler creates effective synchronization that benefits from semantic locking even in a program in which coarse-grained synchronization is used. The main differences between this chapter and Chapter 3 can be described as follows. • Chapter 3 focuses on a more restrictive version of the problem addressed here: namely, atomic sections in which the shared state consists of ADTs belonging to a single ADT library (with a single centralized foresight-based synchronization). In contrast, in this chapter we permit the ADTs to belong to different libraries: this enables using libraries that have been implemented independently. • The protocol in Chapter 3 is incomparable to the protocol used in this chapter. In some ways the protocol in Chapter 3 is more permissive, because it is able to exploit properties like dynamiccommutativity and dynamic-right-mover (in Section 3.1 we show an example in which dynamicright-movers are essential to achieve parallelism). In other ways the protocol in Chapter 3 is less permissive, because it requires all the semantic locks to be obtained at the beginning of the atomic section as a result, the implementation technique presented in Section 3.5 is not able to provide parallelism for transactions that may use operations which are not right-mover with each other (the limited performance of such transactions is demonstrated in Section 4.5 by the Intruder example). Moreover, the approach of Chapter 3 is implemented as a semantic-based variant of the tree locking protocol (see Section 3.5), whereas in this chapter we utilize a semantic-based variant of the two-phase locking protocol. 84 C HAPTER 4. C OMPOSITION OF T RANSACTIONAL L IBRARIES VIA S EMANTIC L OCKING • We believe that in chapter the required internal library synchronization is simpler than the one used in Chapter 3 — mainly becuause it is based on commutativity rather than on right-movers. [43] shows that realizing the synchronization discussed in this chapter is relatively simple (by using a semi-automatic static algorithm). • In this chapter the compiler aims to statically avoid synchronization between transactions that accesses unrelated libraries/ADTs (by using points-to analysis). This is in contrast to Chapter 3 in which the shared single library is responsible to dynamically handle all transactions. 4.1 Semantic Locking In this section we describe the terminology that is used in this chapter, present our methodology for realizing atomic sections using semantic locking, and formalize the problem addressed in the subsequent sections. 4.1.1 Basics Clients A client is a concurrent program that satisfies the following restrictions. All state shared by multiple threads is encapsulated as a collection of ADT instances. (The notion of an ADT is formalized later.) The shared mutable state is accessed only via ADT operations. The language provides support for atomic sections: an atomic section is simply a block of sequential code prefixed by the keyword atomic. Shared state can be accessed or modified only within an atomic section. We use the term transaction to refer to the execution of an atomic section within an execution. In the simpler setting, a client is a whole program (excluding the ADT implementations). More generally, a client can be a module or simply a set of atomic sections. However, we assume that all atomic sections accessing the shared state are available. ADTs An abstract data type (ADT) encapsulates state accessed and modified via a set of methods. Statically, it consists of an interface (also referred to as its API) and a class that implements the interface. The implementation is assumed to be linearizable [55] with respect to the sequential specification of the ADT. We also assume that its object constructor is a pure-method. We will use the term ADT instance to refer to a runtime object that is an instance of the ADT class. We will abbreviate “ADT instance” to just ADT if no confusion is likely. Two different ADT instances can have no shared state. Every ADT instance is assumed to have an unique identifier (such as its memory address). As in Chapter 3, we use the term operation to denote a tuple consisting of an ADT method name m and runtime values v1 , · · · , vn for m’s arguments (not including the ADT instance on which the operation is performed), written as (m, v1 , . . . , vn ) . An operation represents an invocation of a method 4.1. S EMANTIC L OCKING 85 on an ADT instance (at runtime). For example, we write (add,7) to denote the operation that represents an invocation of the method add with the argument 7. Methodology Our goal is to realize an implementation of the atomic sections in clients. Our approach decomposes the responsibility for this task into two parts: one to be realized by the ADT (library) implementations and one to be realized by the compiler (on behalf of the client). This lets us exploit ADT-specific semantic locking that can take advantage of the semantics of ADT operations (to achieve more parallelism and fine-grained concurrency). 4.1.2 ADTs With Semantic Locking Synchronization API The basic idea is to utilize locks on operations (as opposed to locks on data). Specifically, every ADT must provide a synchronization API, in addition to its standard API, that allows a transaction to acquire (and release) operations (on ADT instances). We refer to an operation (m, v1 , · · · , vn ) as a locking operation if method m belongs to the synchronization API. We refer to it as a standard operation if method m belongs to the standard API. A locking operation l is meant to be used (by a client transaction) to acquire (permission to invoke) certain standard operations. Thus, we may think of l as corresponding to a lock on a set of standard operations (on the corresponding ADT instance). (Notice that, a locking operation is essentially a mayUse operation as defined in Chapter 3.) Example 4.1.1 Figure 4.3 presents the API for a Set-of-integers ADT. The figure presents both the standard API and the synchronization API. The semantics of the standard API operations is the usual one. The locking operations are used to lock (and unlock) standard operations. Consider the following 4 sets of standard operations (where we write N to denote the set of integers): La = {(add, v), (remove, v), (contains, v), (size), (clear) | v ∈ N } Lb = {(size), (contains, v) | v ∈ N } Lc = {(add, v) | v ∈ N } Ld = {(add, 7), (remove, 7), (contains, 7)} The set La can be locked by calling lockAll(); The set Lb can be locked by calling lockReadOnly(); the set Lc can be locked by calling lockAdd(); and the set Ld can be locked by calling lockValue(7). Note that the methods lockAll(), lockReadOnly() and lockAdd() do not have arguments, each one of them is used to lock a constant set of operations. The lockValue method has an argument which is used to determined the set locked by its invocation. For any integer v, the set {(add, v), (remove, v), (contains, v)} can be locked by calling lockValue(v). The locking operations are not meant to be called directly by the client code. Instead, our compiler will insert calls to these operations while compiling atomic sections. To enable this, we require the ADT 86 C HAPTER 4. C OMPOSITION OF T RANSACTIONAL L IBRARIES VIA S EMANTIC L OCKING // Standard API // Synchronization API void add(int i); void lockAll(); void remove(int i); void lockReadOnly(); boolean contains(int i); void lockAdd(); int size(); void lockValue(int i); void clear(); void unlockAll(); Figure 4.3: API of a Set with semantic locking. interface to declare for each locking method the set of operations it corresponds to using the annotation language from Section 3.4.1. We also assume that each ADT has a method (without arguments) that locks all its standard operations (and this method is called ”lockAll”); and a method (without arguments) that unlocks all the ADT operations that are held by the current transaction (and this method is called ”unlockAll”). Requirements From ADTs We now describe the semantic guarantees that the ADT (implementation) is required to satisfy, specifically with regards to the synchronization API. We first formalize the notion of commutativity of operations. Two operations are said to be commutative if applying them to the same ADT instance in either order leads to the same final ADT state (and returns the same response). For example, the operations (add,7) and (remove,7) are not commutative; in contrast, the operations (add,7) and (remove,10) are commutative. (Since there is no shared state between different ADT instances, operations on different ADT instances always commute.) Every ADT implementation is required to satisfy the following guarantee: no two threads are allowed to concurrently hold locks on non-commuting operations (on the same ADT instance). Specifically, if a thread t holds locks on the operations in the set Ot for an ADT instance A and, at the same time, a different thread t0 holds locks on the operations in the set Ot0 for the same ADT instance A, then every operation in Ot must commute with every operation in Ot0 . This means that the implementation (of the ADT’s synchronization methods) must block, whenever necessary, to ensure the above requirement. That is, if a thread t holds locks on the operations in Ot , and a thread t0 tries to lock the operations in Ot0 where some operation in Ot does not commute with some operation in Ot0 , then t0 waits (blocked) until it is legal (for t0 ) to hold locks on all operations in Ot0 . Furthermore, the only role of the locking operations is to enforce concurrency control as described above. In particular, the locking operations are required to not have any effect on the standard ADT state (i.e., the specification of the standard operations). Finally, in order to ensure progress we require that: (i) locking operations on an ADT A are never blocked when no thread holds locks on A’s operations; and (ii) standard operations and unlockAll() are never blocked. 4.1. S EMANTIC L OCKING 87 void f(Set x, Set y) { atomic { x.lockReadOnly(); int i = x.size(); y.lockAdd(); x.unlockAll(); y.add(i); y.unlockAll(); } } Figure 4.4: Code that follows the S2PL protocol. Example 4.1.2 Consider example 4.1.1. Here, a thread should not be allowed hold a lock on the set Lb while another thread holds a lock on Lc , because (for example) size does not commute with the add operations. However, it is legal to permit a thread to hold a lock on Lc , while another thread holds a lock on Lc , because add operations commute with each other. Similarly, it is legal to permit Ld and {(add, 1), (remove, 1), (contains, 1)} to be simultaneously locked by different threads, because, according to the Set semantics, operations on different values commute. Notice that an ADT instance that satisfies the above requirements, also stratifies Definition 3.15 (Section 3.3.7), and the progress condition (Section 3.3.6). Therefore, each ADT instance can be seen a single transactional library. 4.1.3 Automatic Atomicity We now describe how atomic sections in a client can be automatically realized by a compiler using the semantic locking capabilities provided by the underlying ADTs. The S2PL Protocol Our synchronization is based on a semantics-aware two-phase locking protocol [20] (S2PL). We say that an execution π follows S2PL, if the following conditions are satisfied by each transaction t in π: (C1) t invokes a standard operation p of an ADT instance A, only if t currently holds a lock on operation p of A. (C2) t locks operations only if t has never unlocked operations. An execution that satisfies S2PL is a serializable execution [20] — i.e., it is equivalent to an execution in which no two transactions are interleaved. Therefore, a serializable execution in which all transactions are completed can be seen as an execution in which all transactions are executed atomically. Example 4.1.3 Consider a transaction t that executes ”f (s1 , s2 )” where f is the function shown in Figure 4.4, and s1 , s2 are two different Sets. This transaction follows the S2PL rules. 88 C HAPTER 4. C OMPOSITION OF T RANSACTIONAL L IBRARIES VIA S EMANTIC L OCKING The S2PL protocol enables substantial parallelism. Consider two transactions t and t0 that execute ”f (s1 , s2 )”. In this case, all operations locked by t commute with all operations locked by t0 (even though they work on the same ADT instances). Hence, it is possible for the two transactions to run in parallel, while guaranteeing serializability. Notice that the first condition of the S2PL protocol is equivalent to the client protocol from Chapter 3 (when considering a single ADT instance). The OS2PL Protocol The S2PL protocol does not guarantee deadlock-freedom — in order to avoid deadlocks we use the Ordered S2PL Protocol (OS2PL). We say that an execution follows the OS2PL protocol if the execution follows S2PL and satisfies the following additional condition: (C3) There exists an irreflexive and transitive relation @ on ADT instances such that if a transaction t locks operations of ADT instance A after it locks operations of ADT instance A0 , then A0 @ A. The rule requires that ADT operations be locked according to a consistent order on the ADT instances. Notice that A and A0 may represent the same ADT instance in the above rule. Hence, the rule implies that a transaction should not invoke multiple locking operations on the same ADT instance. Following this rule ensures that an execution cannot reach a deadlock caused by the locking provided by the ADTs. 4.2 Automatic Atomicity Enforcement In this section, we present an algorithm for compiling atomic sections. The algorithm inserts semantic locking operations into the atomic section to ensure that every transaction follows Ordered S2PL. This algorithm uses only the locking operations lockAll() and unlockAll() of each ADT. Figure 4.15 shows the code produced by our algorithm for the atomic section presented in Figure 4.1. In essence, this algorithm uses a locking granularity at the ADT instance level: two transactions cannot concurrently invoke operations on the same ADT. In Section 4.3, we present a refined algorithm that exploits the specialized locking operations of the ADTs (such as lockAdd() and lockValue(7)), permitting more fine-grained concurrency. The algorithm we describe here is a simple one, whose correctness is easy to establish. We improve the results produced by this simple algorithm using a series of optimizations. In the sequel, we say that an ADT instance A is locked by transaction t if the operations of A are locked by t. 4.2.1 Enforcing S2PL Ensuring that all transactions follow S2PL is relatively straightforward. For every statement x.f(...) in the atomic section that invokes an ADT method, we insert code, just before the statement, to lock the ADT instance that x points to, unless it has already been locked by the current transaction. At the end of 4.2. AUTOMATIC ATOMICITY E NFORCEMENT 89 1 LV(x) { 2 if(x!=null && !LOCAL_SET.contains(x)) { 3 x.lockAll(); 4 LOCAL_SET.add(x); 5 }} Figure 4.5: Code macro with the locking code. void f(Set x, Set y) { atomic { LOCAL SET.init(); // prologue LV(x); int i = x.size(); LV(y); y.add(i); foreach(t : LOCAL SET) t.unlockAll(); // epilogue }} Figure 4.6: Atomic section that follows the S2PL protocol. the atomic section, we insert code to unlock all ADT instances that have been locked by the transaction. We achieve this as follows. Locked Objects Each transaction uses a private set, denoted LOCAL SET, to keep track of all ADT instances that it has currently locked. This set is used to avoid locking the same ADT multiple times and to make sure that all ADTs are eventually unlocked. Prologue and Epilogue At the beginning of each atomic section, we add code that initializes LOCAL SET to be empty. At the end of each atomic section, we add code that iterates over all ADTs in the LOCAL SET, and invokes their unlockAll operations. Locking Code We utilize the macro LV(x) shown in Figure 4.5 to lock the ADT instance pointed to by a variable x. The macro locks the object pointed by x and adds it to LOCAL SET. It has no impact when x is null or points to an object that has already been locked by the current transaction. Figure 4.6 shows an example for an atomic section with inserted locking code that ensures the S2PL protocol. 4.2.2 Lock Ordering Constraints The basic idea sketched above does not ensure that all transactions lock ADTs in a consistent order. Hence, it is possible for the transactions to deadlock. We now describe an extension of the algorithm that statically identifies a suitable ordering on ADT instances and then inserts locking code to ensure that ADT instances are locked in this order. We first describe the restrictions-graph, a data structure that captures constraints on the order in which ADT instances can be locked. We utilize this graph to determine the order in which the locking 90 C HAPTER 4. C OMPOSITION OF T RANSACTIONAL L IBRARIES VIA S EMANTIC L OCKING 1 void g(Map m, int key1, int key2, Queue q) { 2 atomic { 3 Set s1 = m.get(key1); 4 Set s2 = m.get(key2); 5 if(s1!=null && s2!=null) { 6 s1.add(1); 7 s2.add(2); 8 q.enqueue(s1); 9 } 10 }} Figure 4.7: Atomic section that manipulates a Map, a Queue, and two Sets. operations are invoked. A Static Finite Abstraction At runtime, an execution of the client program can create an unbounded number of ADT instances. Our algorithm is parameterized by a static finite abstraction of the set of ADT instances that the client program can create at runtime. Let PVar denotes the set of pointer variables that appear in the atomic code sections. The abstraction consists of an equivalence relation on PVar. For any x ∈ PVar, let [x] denote the equivalence class that x belongs to. The semantic guarantees provided by the abstraction are as follows. Any ADT instance created by any execution corresponds to exactly one of the equivalence classes. Furthermore, at any point during an execution, any variable x is guaranteed to be either null or point to an object represented by the equivalence class [x]. Note that the abstraction required above can be computed using any pointer analysis (e.g., see [65]) or simply using the static types of pointer variables. In our compiler, we utilize the points-to analysis of [12] to compute this information. Note that even though various pointer analyses give more precise information than that captured by the above abstraction, our implementation requires only this information. Example 4.2.1 The atomic section in Figure 4.7 has 4 pointer variables (m, q, s1 and s2). The equivalence relation consisting of the three equivalence classes {m}, {q} and {s1, s2} is a correct abstraction for this atomic section. This abstraction can be produced using static type information, without the need for a whole program analysis. The Restrictions-Graph Each node of the restrictions-graph represents an equivalence class in PVar (and, hence, is a static representation of a set of ADT instances that may be created at runtime). An edge u → v in the restrictions-graph conservatively indicates the possibility of an execution path along which an ADT instance belonging to u must be locked before an ADT instance belonging to v (within the same transaction). We identify these constraint edges as follows. 4.2. AUTOMATIC ATOMICITY {q} E NFORCEMENT {m} {s1,s2} 91 {q} {s1,s2} Figure 4.8: Restrictions-graph for the atomic{m} section in Figure 4.7. 1 atomic { 2 sum=0; 3 for(int i=0;i<n;i++) { 4 set = map.get(i); 5 if(set!=null) sum += set.size(); 6 }} Figure 4.9: Atomic section for which the restrictions-graph has a cycle. We write l: x.f(...) to denote a call to a method f via the variable x ∈ PVar at the program location l. Consider an atomic section with two calls l: x.f(...) and l’: x’.f’(...) such that location l’ is reachable from location l (in the CFG of the atomic section). Obviously, we need to lock the object pointed by x before location l and to lock the object pointed by x’ before location l’. Clearly, we can lock the object pointed to by x before we lock the object pointed to by x’. However, when can we lock these two objects the other way around? If the value of x’ is never changed along the path between l and l’, then the object pointed by x’ can be locked before l. However, if x’ is assigned a value along the path between l and l’, then, in general, we may not be able to lock the object pointed to by x’ (at location l’) before the object pointed to by x, as we may not know the identity of the object to be locked. In such a case, we conservatively add an edge [x] → [x’] to the restrictions-graph. Example 4.2.2 In Figure 4.7, the object pointed by m should be locked before the object pointed by s1, because the call s1.add(1) can only be executed after the call m.get(key1), and the value of s1 is changed between these calls. Example 4.2.3 Figure 4.8 shows a restrictions-graph for the atomic section in Figure 4.7. According to this graph the objects pointed by m should be locked before objects pointed by s1 or s2. This is the only restriction in the graph. For example, the graph does not restrict the order between the objects pointed by m and the objects pointed by q. Moreover, it does not restrict the order between objects pointed by s1 and the objects pointed by s2 (even though they are represented by the same node). The calls in l and l’ can be the same call (i.e., l = l’). This is demonstrated in the atomic section of Figure 4.9: the call set.size() is reachable from itself (because of the loop), and set can be changed between two invocations of this call. A possible restrictions-graph is shown in Figure 4.10. 92 C HAPTER 4. C OMPOSITION OF T RANSACTIONAL L IBRARIES VIA S EMANTIC L OCKING {map} {set} {q,queue} Figure 4.10: Restrictions-graph for the atomic section in Figure 4.9. {m} {s1,s2,set} {map} {q,queue} {m} Figure 4.11: Restrictions-graph for two atomic sections: the section in Figure 4.1 and the section in {s1,s2,set} Figure 4.7. {map} The restrictions-graph is computed for a set of the atomic sections. We write G(S) to denote the restrictions-graph for a set S of atomic sections. Figure 4.11 shows a restrictions-graph for the atomic sections from Figure 4.1 and Figure 4.7. 4.2.3 Enforcing OS2PL on Acyclic Graphs We now describe an algorithm to insert locking code into a set of atomic sections S to ensure that all transactions (from S) follow the OS2PL protocol. This technique is applicable as long as the restrictionsgraph G(S) is acyclic. In Section 4.2.5, we show how we handle programs with cyclic restrictionsgraphs. We sort the nodes in G(S) using a topological sort. This determines a total-order ≤ts on the equivalence classes. We define the relations < and ≤ on the variables in PVar as follows. We say that x < y iff [x] <ts [y] and that x ≤ y iff [x] ≤ts [y]. Note that ≤ is only a preorder on PVar and not a total order. Variables belonging to different equivalence classes are always ordered by <, whereas variables belonging to the same equivalence class are never ordered by <. The relation < is used to statically determine the order in which ADT instances belonging to different equivalent classes are to be locked. However, we cannot do the same for ADT instances belonging to the same equivalence class. Instead, we dynamically determine the order in which ADT instances belonging to the same equivalence class are locked, as described below. Locking Code Insertion Consider any statement l: x.f(...) in an atomic section that invokes an ADT method. We define the set LS(l) to be the set of variables y such that 1. y ≤ x, and 2. There exists a (feasible) path, within the same atomic section, from l to some statement of the form l’: y.g(...), i.e., a statement that invokes an ADT method using y. 4.2. AUTOMATIC ATOMICITY E NFORCEMENT 93 1 LV2(x,y) { 2 if(unique(x)<unique(y)) { 3 LV(x); LV(y) ; 4 } else { 5 LV(y); LV(x) ; 6 }} Figure 4.12: Locking two equivalent variables in a unique order . 1 void g(Map m, int key1, int key2, Queue q) { 2 atomic { 3 LOCAL SET.init(); // prologue 4 LV(m); Set s1 = m.get(key1); 5 LV(m); Set s2 = m.get(key2); 6 if(s1!=null && s2!=null) { 7 LV2(s1,s2); s1.add(1); 8 LV(s2); s2.add(2); 9 LV(q); q.enqueue(s1); 10 } 11 foreach(t : LOCAL SET) t.unlockAll(); // epilogue 12 }} Figure 4.13: The atomic section from Figure 4.7 with the non-optimized locking code . The locking was created by using the order: m < s1,s2 < q. The set LS(l) identifies the variables that we wish to lock before statement l. We insert locking code for all variables in this set as follows: • If y < y 0 , then the locking code for y is inserted before the locking code for y 0 . • If y and y 0 are in the same equivalence class, the locking order is determined dynamically (since, we do not calculate a static order for such variables). This is done by using unique object identifiers (e.g., their memory addresses). Let unique(y) denote the unique identifier of the object pointed by y. These identifiers are used by the inserted code to (dynamically) determine the order in which the variables are handled. Figure 4.12 demonstrates the case of two variables; in the general case objects are sorted to obtain the proper order. (For simplicity, we have omitted the handling of null pointers, which are straight-forward to handle.) Figure 4.13 shows the atomic section of Figure 4.7 with the inserted code, and Figure 4.14 shows the atomic section of Figure 4.1 with the inserted code. For both atomic sections, we used the graph in Figure 4.11. 94 C HAPTER 4. C OMPOSITION OF T RANSACTIONAL L IBRARIES VIA S EMANTIC L OCKING 1 atomic { 2 LOCAL SET.init(); // prologue 3 LV(map); set=map.get(id); 4 if(set==null) { 5 set=new Set(); LV(map); map.put(id, set); 6} 7 LV(map); LV(set); set.add(x); 8 if(flag) { 9 10 LV(map); LV(queue); queue.enqueue(set); LV(map); map.remove(id); 11 } 12 foreach(t : LOCAL SET) t.unlockAll(); // epilogue 13 } Figure 4.14: . The atomic section from Figure 4.1 with the non-optimized locking code . The locking was created by using the order: map < set < queue. 4.2.4 Optimizations The algorithm described above is simplistic in some respects. In this section we present a sequence of code transformations whose goal is to reduce the overhead of the synthesized code and to increase the parallelism permitted by the concurrency control. In particular, we present transformations that remove inserted code that can be shown to be redundant, and transformations that move calls to unlockAll so as to release locks on objects as early as possible (as determined by a static analysis). Figure 4.15 shows the optimized version of Figure 4.14. In the sequel, we use the term “path” to denote feasible execution paths within a single atomic section. Removing Redundant LV(x) In some cases, the code LV(x) inserted at a location l may be redundant. Our compiler removes redundant instances of LV(x) by repeatedly using the following rules: • If the object pointed to by x at l is locked along all (feasible) paths from the beginning of the atomic section to l, then the code LV(x) has no effect at l and can be removed. For example, in Figure 4.14, LV(map) can be removed from line 9 because the Map has already been locked. • If the object pointed to by x at l is never used along any feasible path from l to the end of the atomic section, then the code LV(x) is not required at l and can be removed. Figure 4.16 shows the code from Figure 4.14 after removing redundant instance of LV(x). Removing Redundant LOCAL SET Usage Our algorithm uses LOCAL SET to avoid locking the same object multiple times and to ensure that all objects are unlocked before the end of the atomic section. 4.2. AUTOMATIC ATOMICITY E NFORCEMENT 95 1 atomic { 2 map.lockAll(); set=map.get(id); 3 if(set==null) { 4 set=new Set(); map.put(id, set); 5} 6 set.lockAll(); set.add(x); 7 if(flag) { 8 queue.lockAll(); queue.enqueue(set); queue.unlockAll(); 9 map.remove(id); 10 } 11 map.unlockAll(); set.unlockAll(); 12 } Figure 4.15: The optimized version of Figure 4.14. Note that a large portion of the locking code is removed, and the set LOCAL SET is not explicitly used. Also, the object pointed by queue is unlocked before the end of the section. 1 atomic { 2 LOCAL SET.init(); // prologue 3 LV(map); set=map.get(id); 4 if(set==null) { 5 set=new Set(); map.put(id, set); 6} 7 LV(set); set.add(x); 8 if(flag) { 9 10 LV(queue); queue.enqueue(set); map.remove(id); 11 } 12 foreach(t : LOCAL SET) t.unlockAll(); // epilogue 13 } Figure 4.16: The code from Figure 4.14 after removing redundant instance of LV(x). 96 C HAPTER 4. C OMPOSITION OF T RANSACTIONAL L IBRARIES VIA S EMANTIC L OCKING 1 atomic { 2 if(map!=null) map.lockAll(); set=map.get(id); 3 if(set==null) { 4 set=new Set(); map.put(id, set); 5} 6 if(set!=null) set.lockAll(); set.add(x); 7 if(flag) { 8 if(queue!=null) queue.lockAll(); queue.enqueue(set); 9 map.remove(id); 10 } 11 if(map!=null) map.unlockAll(); 12 if(set!=null) set.unlockAll(); 13 if(queue!=null) queue.unlockAll(); 14 } Figure 4.17: The code from Figure 4.16 after removing the code that uses LOCAL SET. Often, we can achieve these goals without using LOCAL SET. LOCAL SET is not needed for a pointer variable x if the following conditions hold: (1) There is no path containing an occurrence of LV(x) and another occurrence of LV(y) where x and y may point to the same object. (2) The value of x is never modified along any path from an occurrence of LV(x) to the end of the atomic section. (3) The value of x is null at the end of any path to the end of the atomic section that contains no occurrence of LV(x). Because of (1) we know that re-locking is not possible; and because of (2) and (3) we know that we can release all the objects used via x by calling to ”if(x!=null) x.unlockAll()” at the end of the atomic section. If the conditions are satisfied for x, we replace all instances of LV(x) with if(x!=null) x.lockAll() and, at the end of the section we insert the code if(x!=null) x.unlockAll() If, after all applications of the above transformation, the set LOCAL SET is not used for any variable in an atomic section, we remove the set, and the corresponding prologue and epilogue. Figure 4.17 shows the code from Figure 4.16 after applying this optimization. 4.2. AUTOMATIC ATOMICITY E NFORCEMENT 97 1 atomic { 2 if(map!=null) map.lockAll(); set=map.get(id); 3 if(set==null) { 4 set=new Set(); map.put(id, set); 5} 6 if(set!=null) set.lockAll(); set.add(x); 7 if(flag) { 8 9 10 if(queue!=null) queue.lockAll(); queue.enqueue(set); if(queue!=null) queue.unlockAll(); //this line was moved map.remove(id); 11 } 12 if(map!=null) map.unlockAll(); 13 if(set!=null) set.unlockAll(); 14 } Figure 4.18: The code from Figure 4.17 after moving an unlockAll operation. Early Lock Release Our basic algorithm unlocks all objects at the end of the atomic section. In some cases, it is possible to unlock some objects at an earlier program point (before the end of the atomic section) without violating the locking protocol. We now describe the conditions under which we perform such an early lock release. It is safe to move any instance of “if(x!=null) x.unlockAll()” occurring at the end of the atomic section to a program point l if the following conditions are satisfied: (1) The object pointed to by x is not used between l and the end of the atomic section. (2) No object is locked between l and the end of the atomic section. (3) Every path to the end of the atomic section passes through l or ends with a null value for x. The compiler tries to find the earliest program point l (as measured by the length of the shortest path from the beginning of the atomic section to l) that satisfies these conditions. If such a point l is found, we move the code “if(x!=null) x.unlockAll()” to location l. Because of (1) and (2), we know that the protocol rules are not violated. Because of (3), we know that the relevant object will eventually be unlocked. Figure 4.18 shows the code from Figure 4.17 after applying this optimization. Note that, the unlocking code of ”queue” has been moved to line 9 of Figure 4.18. Removing Redundant If-Statements In some cases, the inserted if-condition ”if(x!=null)” is not needed. For any location l and variable x for which we can prove that x is never null at l, we remove the condition “if(x!=null)” from l. Figure 4.15 shows the code from Figure 4.18 after applying this optimization. 98 C HAPTER 4. C OMPOSITION OF T RANSACTIONAL L IBRARIES VIA S EMANTIC L OCKING Choosing a Good Locking-Order When sorting the acyclic restrictions-graph using a topological sort (Section 4.2.3) several orders are possible. The chosen order may affect the generated locking. For example, for Figure 4.1’s code, if “queue” precedes “map” then the Queue will be locked before the Map — as a result, it may act as a global lock. In order to find a good ordering, we consider several possible orders of the restrictions-graph (using a variant ofthe topological sort algorithm in [83]). Given an order, we count the number of variables y for which we insert locking code for y (e.g., LV (y)) before a statement x.f(...) where y 6≡ x (in this case y < x). Intuitively, this number represents the number of “early lockings”. We choose a locking-order for which this number is minimal. Note that this is only a heuristic. User hints or dynamic analysis (as in Autotuner [49]) can be potentially used to find better orderings. 4.2.5 Handling Cycles via Coarse-Grained Locking We now show how we can enforce atomicity and deadlock-freedom when the restrictions-graph G(S) has cycles. We say that an atomic section a is a serial-section, if G({a}) is cyclic; Figure 4.9 shows an example for a serial-section. The Idea The idea is to find n + 1 disjoint sets of atomic sections S1 , . . . , Sn and B such that: S = S1 ∪ . . . ∪ Sn ∪ B and for every 1 ≤ i ≤ n the graph G(Si ) is acyclic. (Note that any serial-section must be included in the set B). For each set Si we enforce the locking protocol by using the technique in the previous sections — hence, the protocol is independently enforced for every set Si . We further add coarse-grained synchronization to make sure that transactions executing atomic sections a and a0 are allowed to execute concurrently iff there exists Si such that a, a0 ∈ Si . This ensures that all concurrent transactions always lock ADT instances in the same order and, hence, follow OS2PL. Any atomic section in B is always executed in isolation (hence, all the serial-sections are executed in isolation). Simplified Version via Global Read/Write Lock In our compiler we have implemented a simplified version in which we only find two sets S1 and B (i.e., n = 1), and use the technique in Section 4.2.3 on the atomic sections in S1 . If B is not empty, we use a global read/write lock (denoted by RW ) as follows. At the beginning of an atomic section in S1 we acquire RW in a read-mode; and at the beginning of an atomic section in B we acquire RW in a write-mode. In both cases, we release RW at the end of the atomic section. This ensures that atomic sections from S1 can be executed in parallel (while following the protocol), and atomic sections from B cannot be executed in parallel. Discussion Note that with the above modification, the generated code does not follow OS2PL. Instead, it follows a generalization of the OS2PL protocol that is sufficient to guarantee both serializability as well as deadlock-freedom. 4.3. U SING S PECIALIZED L OCKING O PERATIONS 99 void lockReadOnly(); @{(size),(contains,*)} void lockAdd(); @{(add,*)} void lockValue(int i); @{(add,i),(remove,i),(contains,i)} Figure 4.19: Locking methods from Figure 4.3 and their locking specifications. Each method is annotated with a symbolic set that describes the operations locked by the method. 4.3 Using Specialized Locking Operations The algorithm presented in Section 4.2 utilizes only the simple locking operations lockAll() and unlockAll(). In this section, we describe how to extend the earlier algorithm to utilize more fine- grained locking operations, such as lockAdd() and lockValue(1), which can be used to lock a subset of an ADT’s operations. For example, this algorithm generates the locking code shown in Figure 4.2 (instead of the locking code in Figure 4.15). Annotation Language We use the annotation language from Section 3.4.1. Each locking method p (except lockAll and unlockAll) is annotated with a symbolic set. This symbolic set defines the operations locked by an invocation of p (their meaning is identical to Section 3.4.1). Example 4.3.1 The methods in Figure 4.19 are annotated with the following symbolic sets: SY1 = {(size), (contains, ∗)} SY2 = {(add , ∗)} SY3 = {(add , i ), (remove, i ), (contains, i )}. The method lockReadOnly() is annotated with SY1 ; hence, the locking operation (lockReadOnly) locks all the operations with the methods size, and contains. The method lockAdd() is annotated with SY2 ; hence the locking operation (lockAdd) locks all the operations with the method add. The method lockValue(int i) is annotated with SY3 ; hence, for any integer v, the locking operation (lockValue,v) locks the operations (add,v),(remove,v), and (contains,v). Inferring Calls to Locking Operations Our algorithm for inferring the specialized locking operations to be inserted into an atomic section consists of multiple steps. In the first step, we analyze the atomic sections, to infer for every pointer variable x and code location l, a symbolic set SYx,l that conservatively describes the set of future ADT operations that may be invoked on the ADT instance that x points to. We use a simple backward analysis (abstract interpretation) to compute this information. As in Section 4.2, we do not distinguish between pointer variables belonging to the same equivalence class (i.e., the information computed is the same for all variables in the same equivalence class). 100 C HAPTER 4. C OMPOSITION OF T RANSACTIONAL L IBRARIES VIA S EMANTIC L OCKING 1 void g(Map m, int key1, int key2, Queue q) { 2 atomic { //{lockAdd()} 3 Set s1 = m.get(key1); //{lockAdd()} 4 Set s2 = m.get(key2); //{lockAdd()} 5 if(s1!=null && s2!=null) { //{lockAdd()} 6 s1.add(1); //{lockAdd(), lockValue(2)} 7 s2.add(2); 8 q.enqueue(s1); 9 } 10 }} Figure 4.20: The code section from Figure 4.7 annotated with the calls inferred for the variables s1 and s2. In the second step, we use this information to identify a suitable semantic locking operation opx,l . The essential requirement is that the set of operations locked by operation opx,l should be a superset of the operations denoted by SYx,l . In general, there may be many locking operations that satisfy this constraint. Hence, in general, the algorithm may infer several candidates for opx,l , any one of which may be used. These two steps are equivalent to the static algorithm used in Section 3.4.2. But, in contrast to Section 3.4.2 in which all shared memory is treated as a single library, we separately repeat these steps for each equivalence class. Figure 4.20 illustrates the inferred candidate locking operations for the code section from Figure 4.7 for the single equivalence class consisting of variables s1 and s2. Note that two possible candidates are inferred for the location between line 6 and line 7. Another example is shown in Figure 4.21: this figure illustrates the inferred candidate locking operations for the code section from Figure 4.7 for the equivalence class consisting of the variable m. Finally, recall that the algorithm presented in Section 4.2 identifies a set of pairs (x, l) such that we insert a locking operation on variable x at program location l. We use the same algorithm now, except that we insert an invocation of the locking operation opx,l instead of lockAll(). Specifically, the locking code macro (Figure 4.5) is modified to take the locking operation to be invoked as an extra parameter. At every place where a call to this macro is inserted, for variable x at location l (by Section 4.2’s algorithm), we use the inferred call opx,l . For example, in Figure 4.7, for s1 at the location before line 6, we infer the call lockAdd(). Hence, we insert a (conditional) call to s1.lockAdd() at this point. (Instead of inserting a call to s1.lockAll().) 4.4. I MPLEMENTING ADT S WITH S EMANTIC L OCKING 101 1 void g(Map m, int key1, int key2, Queue q) { 2 atomic { //{lockReadOnly()} 3 Set s1 = m.get(key1); //{lockReadOnly(),lockKey(key2)} 4 Set s2 = m.get(key2); 5 if(s1!=null && s2!=null) { 6 s1.add(1); 7 s2.add(2); 8 q.enqueue(s1); 9 } 10 }} Figure 4.21: The code section from Figure 4.7 annotated with the calls inferred for the variable m. The Map ADT contains a locking method lockReadOnly() (locks its read-only operations), and a locking method lockKey(int k) (locks the Map operations on key k). 4.4 Implementing ADTs with Semantic Locking An ADT with semantic locking can be implemented in several different ways. For example, an ADT with the API lockAll(), lockReadonly(), unlockAll() can be implemented by using a single read-write lock in a straightforward manner1 . Using the Technique Presented in Section 3.5. The technique from Section 3.5 can be used to implement ADTs with semantic locking. In order to use this technique, we need to avoid cases in which two threads are simultaneously allowed to invoke non-commutative operations. So, we change the technique from Section 3.5 as follows: • At the end of a locking operation m (i.e., at the end of a mayUse operation m), we insert code that sometimes blocks the current thread. Let n be the node which is owned by the current thread. We insert code (at the end of m) that waits until all nodes in the following set are unlocked: {n0 | n0 6= n ∧ n0 is reachable from n} This ensures that different threads are never simultaneously allowed to invoke non-commutative operations. • Since different threads are never simultaneously allowed to invoke non-commutative operations, the optimizations from Section 3.5.2 will not be useful. Hence, it is not required to insert code at the beginning of the standard (base) operations (as described in Section 3.5.2). • Notice that, the extensions from Section 3.5.3 and Section 3.5.4 can be used in a straightforward manner. 1 Let RW be the read-write lock. lockAll() will lock RW in a write-mode, lockReadonly() will lock RW in a read-mode, and unlockAll() will unlock RW. 102 C HAPTER 4. C OMPOSITION OF T RANSACTIONAL L IBRARIES VIA S EMANTIC L OCKING Semantic-2PL 1500 400 Graph 70% Find successors 20% Insert 10% Remove 1000 300 500 1 0.8 0.6 0.4 0.2 0 0 1 2 4 8 Global ForesightLib 2000 Graph 50% Insert 50% Remove 1500 200 1000 100 500 0 16 45% Find successors 45% Find predecessors 9% Insert 1 22 4 4 8 8 16 600 2 Cache 800 90% Get, 10% Put Size=50K 600 16 8 16 8 16 Cache 90% Get, 10% Put Size=5000K 400 200 0 0 8 4 (c) 400 4 1 16 1000 200 2 35% Find successors 35% Find predecessors 20% Insert 10% Remove (b) 800 Graph 1 Graph 0 1 (a) 3000 2500 2000 1500 1000 500 0 2PL 1 (d) 2 4 8 16 1 (e) 2 4 (f) Figure 4.22: Graph and Cache: throughput (operations/millisecond) as a function of the number of threads (1-16). 4.5 Performance Evaluation In this section we present an experimental evaluation of the approach presented in this chapter. We compare the performance of the following synchronization approaches: 1. the approach presented in this chapter (denoted by Semantic-2PL) 2. a single global lock (denoted by Global) 3. a standard deadlock-free two-phase locking which is created by only using the algorithm in Section 4.2 (denoted by 2PL)2 4. the approach (and the implementation) presented in Chapter 3 (denoted by ForesightLib) 4.5.1 Benchmarks We use benchmarks in which the atomic sections (composite operations) manipulate several ADTs. We use the 3 benchmarks from Section 3.6.3 (Graph, Cache, and GossipRouter). We also use a Java version of the Intruder benchmark [7, 26]. This is a multi-threaded application that emulates an algorithm for signature-based network intrusion detection [47]. For our study we use its Java implementation from [7] in which atomic sections are already annotated. 2 This represents an implementation of a common two phase locking in which each ADT is associated with a single standard lock. 4.5. P ERFORMANCE E VALUATION 103 ADT Implementations For the evaluation, we have created 3 types of Maps (a Standard Map, a Weak Map and a Multi Map3 ) by changing the implementation discussed in Section 3.7 (as described in Section 4.4). Using the same technique, we have created a Set ADT (its synchronization API is similar to Figure 4.3), and a Queue ADT (which only supports lockAll and unlockAll). The Set and the Queue ADTs are only used in the Intruder benchmark. For each ADT, we have also created a simpler version that only supports lockAll and unlockAll (by using a simple Java lock) — this version is used for the realizations of the standard deadlock-free two-phase locking (2PL). Methodology We use the evaluation methodology and workloads described in Section 3.6. 4.5.2 Performance Graph The results for the graph benchmark are shown in Figure 4.22(a)–(d). For all workloads, Semantic-2PL is faster than ForesightLib. Both Semantic-2PL and ForesightLib outperform Global and 2PL. Interestingly, in this benchmark (and all benchmarks bellow) 2PL even slower than Global; this can be explained by the overhead of the multiple locks used by the two-phase locking realization. Tomcat’s Cache The results for the Cache benchmark are shown in Figure 4.22(e)–(f) In this benchmark, ForesightLib is much faster than the other synchronization approaches. ForesightLib is faster than Semantic-2PL because of the optimization in Section 3.5.2 (see Section 3.7.3) — without this optimization, ForesightLib is almost identical to Semantic-2PL. GossipRouter The results for the GossipRouter benchmark are shown in Figure 4.23. In these results, Semantic-2PL is slightly faster than ForesightLib. Both Semantic-2PL and ForesightLib outperform Global and 2PL. Interestingly, in this benchmark Semantic-2PL is realized by (also) using coarse-grained synchronization (as discussed in Section 4.2.5). In fact, some of the atomic sections are never run in parallel (our compiler identifies them as serial-sections). According the results, our approach still provides scalable performance. This can be explained by the fact the serial-sections are rarely executed. Intruder In the Intruder benchmark, we have used the workload which is represented by the configuration ”-a 10 -l 256 -n 16384 -s 1” (see [7]). The results for this benchmark are shown in Figure 4.24. These results show that Semantic-2PL is much faster than the other synchronization approaches. 3 See Section 3.7.1. 104 C HAPTER 4. C OMPOSITION OF T RANSACTIONAL L IBRARIES VIA S EMANTIC L OCKING Semantic-2PL Speedup 600% 400% Global 2PL ForesightLib 5000 Messages per client 16 Clients 200% 0% 1 2 4 Cores 8 16 Figure 4.23: GossipRouter. Speedup over a single-core execution. Semantic-2PL Speedup 600% Global 2PL ForesightLib -a 10 -l 246 -n 16384 -s 1 400% 200% 0% 1 2 4 Threads 8 16 Figure 4.24: Intruder. Speedup over a single-threaded execution. Since the Intruder benchmark uses Sets and Queues, the implementation from Section 3.7 is not applicable for this benchmark. Hence, in order to compare to ForesightLib, we have extended the implementation from Section 3.7 to support Sets and Queues. As shown in the results, this extension does not provide scalable performance. We have not found a way to extend Section 3.7’s implementation in a way that provides better performance. This is because in most of the transaction’s execution time, they may call to operations of a global Queue which are not right-mover with each other (this is similar to the Queue shown in Figure 4.1) — as a result, in most of the transaction’s execution time, they are not able to run in parallel. Chapter 5 Related Work 5.1 Synchronization Protocols Synchronization protocols are used in databases and shared memory systems to guarantee correctness of concurrently executing transactions [21, 86]. Many of them are based on locks (and therefore are sometimes called locking protocols). Two-Phase Locking Protocol A widely used locking protocol is the two-phase locking (2PL) protocol [38] which guarantees atomicity of transactions, but does not guarantee deadlock-freedom. In the 2PL protocol, locking is done in two phases, in the first phase locks are only allowed to be acquired (releasing locks is forbidden); in the second phase locks are only allowed to be released. These restrictions require that locks are held until the final lock is obtained, thus preventing early release of a lock even when locking it is no longer required. This limits parallelism, especially in the presence of long transactions. (e.g., a tree traversal must hold the lock on the root until the final node is reached.) Semantic-aware variants of the 2PL protocol are also described in the literature (e.g., see [21, 86]): these variants are designed to exploit semantic properties of the shared state for the sake of improved performance. The approach described in Chapter 4 is based on such a variant of 2PL — in particular, the compiler described in Chapter 4 is the first general-purpose compiler (which does not rely on rollback mechanisms) that enforces a semantic-aware variant of the 2PL protocol. Several other existing approaches [51, 61, 62] are also based on similar variants of the 2PL protocols, but these approaches are based on speculative executions and rollback mechanisms. Non-Two-Phase Locking Protocol Other (non 2PL) locking protocols rely on the shape of the shared objects graph — each node of this graph represent a disjoint data item which may be concurrently accessed by several threads (or processes). Most of non-two-phase protocols (e.g.[59, 80]) were designed for databases in which the shape of shared objects graph does not change during a transaction, and thus 105 106 C HAPTER 5. R ELATED W ORK are not suitable for more general cases with dynamically changing graphs. [17, 28, 63] show non-2PL protocols that can handle dynamically changing graphs, but each one of these protocols is applicable to either a tree or a DAG. In contrast, our domination locking protocol (Chapter 2) is more general and it does not explicitly require any specific shape. Interestingly, even when the shape of shared graph is a tree or a DAG, the technique in Chapter 2 does not guarantee any protocol from [17, 28, 63]. Therefore the domination locking protocol is still required to automatically handle libraries in which the shared memory is a single tree. Moreover, none of the existing techniques are able to enforce any protocol from [17, 28, 63] on dynamic data structures (like the data structures mentioned in Chapter 2). Hence, we believe that the technique in Chapter 2 is the first automatic technique to add non-speculative fine-grain synchronization to such dynamic data structures. Commutativity and Movers The approach in Chapter 3 is based on right-movers whereas most synchronization approaches are based on commutativity. Indeed, [61] shows that many synchronization schemes can be based on either right-movers or left-movers — in [61], they use a variant of a ”static” right-mover which is a special case of our definition for dynamic-right-mover. Locking Mechanisms Locking mechanisms are widely used for synchronization, some of them utilize semantics properties of shared operations (e.g., [30, 60]). Usually these mechanisms do not allow several threads to hold locks which correspond to non-commutative operations. An interesting locking mechanism is shared-ordered locking [13] which allow several threads to hold locks which correspond to non-commutative operations. Such locking mechanisms can be seen as special cases of libraries with foresight-based synchronization. 5.2 Concurrent Data Structures Many sophisticated concurrent data structures (e.g., [4, 54, 70]) were developed and integrated into modern software libraries (e.g., see [8, 9, 11, 35]). These data structures ensure atomicity of their basic operations, while hiding the complexity of synchronization inside their libraries. Unfortunately as shown in [79] employing concurrent data structures in client code is error prone. The problem stems from the inability of concurrent data structures to ensure atomicity of client operations composed from several data structure operations. In Chapter 3 and Chapter 4 we focus on enabling efficient and automatic atomicity of client operations composed from several data structure operations. This prevents the errors reported in [79] without the need for the library to directly support composite operations as suggested in [3]. 5.3. AUTOMATIC S YNCHRONIZATION 5.3 107 Automatic Synchronization Automatic Lock Inference There has been a lot of work on inferring locks for implementing atomic sections. Most of the algorithms in the literature infer locks for following the 2PL locking protocol [29, 31, 37, 44, 57, 68]. The algorithms in [37, 57, 68] employ a 2PL variant in which all locks are released at the end of a transaction. In these algorithms, deadlock is prevented by statically ordering locks and rejecting certain programs. The algorithms in [29, 44] use a 2PL variant in which all locks are acquired at the beginning of transactions and released at the end of transactions. In these algorithms, deadlock is prevented by using a customized locking protocol at the beginning of atomic sections. As described above, 2PL limits parallelism as all locks must be held until the final lock is acquired. Our algorithms for inferring mayUse operations (Chapter 3 and Chapter 4) are similar to these algorithms; still with the following differences: (i) we deal mayUse operations which can be seen as generalizations of lock-acquire and lock-release operations — this enables utilizing semantic properties of shared operations; (ii) lock inference algorithms usually need to consider the structure of a dynamically manipulated state, in Chapter 3 we avoid this by considering a single shared library that can be statistically identified. Deadlock Avoidance Wang et al. [85] describe a static analysis and accompanying runtime instrumentation that eliminates the possibility of deadlock from multi-threaded programs using locks. Their tool adds additional locks that dominate any potential locking cycle, but it requires as a starting point a program that already has the locks necessary for atomicity. Ownership Types Boyapati et al. [22] describe an ownership type system that guarantees data race freedom and deadlock freedom, but still not atomicity. Their approach can prevent deadlocks by relying on partial-order of objects, and also permit to dynamically change this partial-order. Interestingly, the domination locking protocol also relies on the intuition of dynamic ownership where exposed objects dominate hidden objects. Semantic Conflict Detection In data-based approaches to conflict detection, a dependence is inferred between two transactions if they both access the same location, and at least one of the accesses is a write — such approaches are often imprecise and can lead to spurious conflicts/dependences [51]. A semantics-based approach (e.g., one that identifies two high-level operations as commuting even though they access and modify the same data) to identifying dependences/conflicts between transactions can enable greater parallelism. This idea is quite old and was proposed early on for database transaction implementations (e.g., see [21, 77, 86]). Similar ideas have also motivated the development of various software synchronization techniques (e.g.,[51, 61, 62]) — all these approaches require the use of rollback mechanism. The approaches in Chapter 3 and Chapter 4 are both semantics-based approaches that 108 C HAPTER 5. R ELATED W ORK do not use any rollback mechanism. The approach in [51] is a semantics-based approach which utilizes a sematic aware variant of the 2PL protocol. This approach is similar to the approach presented in Chapter 4. A notable difference between them is the way deadlocks are handled: in [51] deadlocks are dynamically detected and resolved by aborting transactions (and using a rollback mechanism), whereas in Chapter 4 the compile statically ensures deadlock-freedom and never uses a rollback mechanism. In [51], an invocation of an operation p implicitly locks a set of operations (that contains p and an operation that cancels the effect of p), whereas in Chapter 4 a transaction is able to lock operations regardless of the operations that have already been invoked by this transaction — this enables a more versatile interaction between the automatic synchronization algorithm and the locking that is manually implemented inside the data structure; moreover, the data structure designer does not have to make sure that the effect of each operation can be canceled by invoking a single inverse operation. Additionally, in [51] all locks are released at the end of the transactions, while the algorithm in Chapter 4 permits unlocking operations in an early point of the transactions. In contrast to [51] and Chapter 4 that permit using several independent libraries (and data structures), in Chapter 3 the the shared state has to be represented as a single data structure. Moreover, the synchronization in Chapter 3 is practically implemented as a variant of the tree locking protocol [86], this is in contrast to the existing semantics-based approach which are typically based on simple variants of the 2PL protocol. Transactional Memory Transactional memory approaches (TMs) dynamically resolve inconsisten- cies and deadlocks by rolling back partially completed transactions. The TM programming model can be implemented as an extension to the cache coherence protocol [53] or as a code transformation [52]. Preserving the ability to roll back requires that transactions be isolated from the rest of the system, which prohibits them from performing irrevocable actions such as I/O and operating-system calls. Software transactions are also prohibited from calling libraries that have not been transformed by the TM. Ad-hoc proposals for specific forms of I/O are present in many TMs [71], but in the general case at most one transaction at a time can safely perform an irrevocable action [87]. Rollback-free concurrency control schemes such as ours, in contrast, do not limit concurrent I/O (and other irrevocable actions). A rollback-free TM was recently purposed by [67]. But this TM does not permit concurrent execution of write transactions — write transactions are always executed sequentially in a manner similar to [87]. Unfortunately, in spite of a lot of effort and many TM implementations (e.g., see [48]), existing TMs have not been widely adopted for practical concurrent programming [27, 34, 36, 69, 89]. Chapter 6 Conclusions and Future Work In this thesis, we show three novel approaches for enforcing atomicity by using fine-grained synchronization. We show that our approaches are applicable to a selection of real-life concurrent programs. The approaches enable efficient and scalable synchronization by combining compilation techniques, run-time techniques, and libraries with specialized synchronization. All our approaches utilize properties of the programs: in Chapter 2 the shape of shared memory is utilized; whereas the approaches in Chapter 3 and Chapter 4 utilize information about future invocations of library operations (foresight). Particularly, we show that such preliminary information is useful for effective synchronization. Generic Locks In many cases, the combination of a lock (e.g., a read-write lock) and the memory protected by the lock can be seen as a library with foresight-based synchronization. Hence, in a sense, the synchronization presented in Chapter 3 generalizes the standard notion of a lock. In the future, it might be interesting to develop additional synchronization techniques that are based on such generic forms of locks. Using Speculations We showed approaches that avoid any speculation (i.e., rollbacks and aborts are never used). This indicates that non-speculative synchronization can be effective (despite the fact that many modern synchronization techniques are based on speculation). In the future, it might be interesting to investigate the applicability of our approaches to synchronization that utilizes speculation. One potential method to use speculations is to combine domination locking with optimistic synchronization for read-only transactions. Such a method can, for example, add version numbers to shared objects. Read-write transactions would synchronize between themselves via domination locking, without any rollbacks, whereas read-only transactions would use the version numbers to ensure consistent reads. Version numbers could either be managed locally, by incrementing them on each commit (e.g., as in [25]), or globally using some timestamp scheme (e.g., as in [32]). The local scheme would pro109 110 C HAPTER 6. C ONCLUSIONS AND F UTURE W ORK vide better scalability for writers, while the global scheme admits very efficient read-only transactions. Contention management in such approach would be easier than in purely optimistic schemes such as transactional memory [48], because read-only transactions can fall back to domination locking after experiencing too many rollbacks. New Synchronization Protocols The existence of novel synchronization protocols (e.g., in Chap- ter 2 and Chapter 3) indicate that in spite of a lot of research on synchronization protocols (see, for example [20, 86]), there is still room for new synchronization protocols. Automatic Libraries with Foresight In Chapter 3 and Chapter 4, internal library synchronization is implemented manually. It might be useful to develop static algorithms that automatically extend a given library implementation with internal synchronization. In [43] we show an example for a semi-automatic algorithm that produces the semantic locking discussed in Chapter 4. Software Verification Novel synchronization protocols can provide a basis for software verification. For example, [17] describes a verification technique based on special cases of domination locking (dynamic DAG and Tree locking); by using domination locking, their analysis may be simplified and extended thanks to of the weaker conditions of domination locking. Our approaches and their realizations can be seen as a set of tools to deal with common programming scenarios in which effective synchronization is required. We believe that similar synchronization approaches are a promising research direction, in particular semi-automatic approaches which are based on combinations of compile-time and run-time techniques. Bibliography [1] http://wala.sourceforge.net. [2] sourceforge.net/projects/tammi. [3] gee.cs.oswego.edu/dl/jsr166/dist/jsr166edocs/jsr166e/ConcurrentHashMapV8.html. [4] docs.oracle.com/javase/6/docs/api/java/util/concurrent/ConcurrentHashMap.html. [5] www.devdaily.com/java/jwarehouse/apache-tomcat-6.0.16/java/org/apache/el/util/ConcurrentCache.java.shtml. [6] docs.oracle.com/javase/6/docs/api/java/util/WeakHashMap.html. [7] sites.google.com/site/deucestm/. [8] guava-libraries. code.google.com/p/guava-libraries/. [9] Java, api specification. http://docs.oracle.com/javase/7/docs/api/. [10] Jgroups toolkit. www.jgroups.org/index.html. [11] libcds, concurrent data structure library. http://libcds.sourceforge.net/. [12] Wala. http://wala.sourceforge.net. [13] D. Agrawal and A. El Abbadi. Constrained shared locks for increasing concurrency in databases. In Selected papers of the ACM SIGMOD symposium on Principles of database systems, 1995. [14] Rakesh Agrawal, Heikki Mannila, Ramakrishnan Srikant, Hannu Toivonen, and Inkeri Verkamo. Advances in knowledge discovery and data mining. chapter Fast discovery of association rules, pages 307–328. American Association for Artificial Intelligence, Menlo Park, CA, USA, 1996. [15] Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman. Compilers: Principles, Techniques, and Tools (2nd Edition). Addison Wesley, 2006. [16] C.R. Aragon and R.G. Seidel. Randomized search trees. Foundations of Computer Science, Annual IEEE Symposium on, 0:540–545, 1989. 111 112 BIBLIOGRAPHY [17] H. Attiya, G. Ramalingam, and N. Rinetzky. Sequential verification of serializability. In POPL ’10: Proceedings of the 37th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pages 31–42, New York, NY, USA, 2010. ACM. [18] J. Barnes and P. Hut. A hierarchical O(N log N) force-calculation algorithm. Nature, 324:446–449, December 1986. [19] R. Bayer and M. Schkolnick. Concurrency of operations on B-Trees. Acta Informatica, 9:1–21, 1977. [20] Philip A. Bernstein, Vassos Hadzilacos, and Nathan Goodman. Concurrency Control and Recovery in Database Systems. Addison-Wesley, 1987. [21] Philip A. Bernstein, Vassos Hadzilacos, and Nathan Goodman. Concurrency Control and Recovery in Database Systems. Addison-Wesley, 1987. [22] Chandrasekhar Boyapati, Robert Lee, and Martin Rinard. Ownership types for safe programming: preventing data races and deadlocks. In Proceedings of the 17th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications, OOPSLA ’02, pages 211– 230, New York, NY, USA, 2002. ACM. [23] Peter Brass. Advanced Data Structures. Cambridge University Press, New York, NY, USA, 2008. [24] Nathan Bronson. Composable Operations on High-Performance Concurrent Collections. PhD thesis, Stanford University, December 2011. [25] Nathan Grasso Bronson, Jared Casper, Hassan Chafi, and Kunle Olukotun. A practical concurrent binary search tree. In PPOPP, pages 257–268, 2010. [26] Chi Cao Minh, JaeWoong Chung, Christos Kozyrakis, and Kunle Olukotun. STAMP: Stanford transactional applications for multi-processing. In IISWC, 2008. [27] Calin Cascaval, Colin Blundell, Maged Michael, Harold W. Cain, Peng Wu, Stefanie Chiras, and Siddhartha Chatterjee. Software transactional memory: Why is it only a research toy? Queue, 6(5):46–58, September 2008. [28] Vinay K. Chaudhri and Vassos Hadzilacos. Safe locking policies for dynamic databases. In PODS ’95: Proceedings of the fourteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, pages 233–244, New York, NY, USA, 1995. ACM. [29] Sigmund Cherem, Trishul Chilimbi, and Sumit Gulwani. Inferring locks for atomic sections. In PLDI, 2008. BIBLIOGRAPHY 113 [30] P. J. Courtois, F. Heymans, and D. L. Parnas. Concurrent control with readers and writers. Commun. ACM, 14(10), October 1971. [31] Dave Cunningham, Khilan Gudka, and Susan Eisenbach. Keep off the grass: Locking the right path for atomicity. In CC, pages 276–290. 2008. [32] David Dice, Ori Shalev, and Nir Shavit. Transactional locking ii. In DISC, pages 194–208, 2006. [33] Simon Doherty, David L. Detlefs, Lindsay Groves, Christine H. Flood, Victor Luchangco, Paul A. Martin, Mark Moir, Nir Shavit, and Guy L. Steele, Jr. Dcas is not a silver bullet for nonblocking algorithm design. In SPAA ’04: Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures, pages 216–224, New York, NY, USA, 2004. ACM. [34] Aleksandar Dragojević, Pascal Felber, Vincent Gramoli, and Rachid Guerraoui. Why stm can be more than a research toy. Commun. ACM, 54(4):70–77, April 2011. [35] Joe Duffy. Concurrent Programming on Windows. Addison-Wesley, 2008. [36] Joe Duffy. A (brief) retrospective on transactional memory. 2010. http://joeduffyblog. com/2010/01/03/a-brief-retrospective-on-transactional-memory/. [37] Michael Emmi, Jeffrey S. Fischer, Ranjit Jhala, and Rupak Majumdar. Lock allocation. In POPL, pages 291–296, 2007. [38] K. P. Eswaran, J. N. Gray, R. A. Lorie, and I. L. Traiger. The notions of consistency and predicate locks in a database system. Commun. ACM, 19:624–633, November 1976. [39] Erich Gamma, Richard Helm, Ralph E. Johnson, and John Vlissides. Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley, Reading, MA, 1995. [40] Guy Golan-Gueta, Nathan Bronson, Alex Aiken, G. Ramalingam, Mooly Sagiv, and Eran Yahav. Automatic fine-grain locking using shape properties. In OOPSLA, 2011. [41] Guy Golan-Gueta, G. Ramalingam, Mooly Sagiv, and Eran Yahav. Concurrent libraries with foresight. In PLDI, 2013. [42] Guy Golan-Gueta, G. Ramalingam, Mooly Sagiv, and Eran Yahav. Automatic semantic locking. In PPOPP, 2014. [43] Guy Golan-Gueta, G. Ramalingam, Mooly Sagiv, and Eran Yahav. Automatic scalable atomicity via semantic locking. In PPOPP, 2015. [44] Khilan Gudka, Tim Harris, and Susan Eisenbach. Lock inference in the presence of large libraries. In ECOOP 2012–Object-Oriented Programming, pages 308–332. Springer, 2012. 114 BIBLIOGRAPHY [45] Rachid Guerraoui and Michal Kapalka. Principles of Transactional Memory. Synthesis Lectures on Distributed Computing Theory. Morgan & Claypool Publishers, 2010. [46] Leo J. Guibas and Robert Sedgewick. A dichromatic framework for balanced trees. In Proceedings of the 19th Annual Symposium on Foundations of Computer Science, pages 8–21, Washington, DC, USA, 1978. IEEE Computer Society. [47] Bart Haagdorens, Tim Vermeiren, and Marnix Goossens. Improving the performance of signaturebased network intrusion detection sensors by multi-threading. Information Security Applications, pages 188–203, 2005. [48] Tim Harris, James Larus, and Ravi Rajwar. Transactional memory, 2nd edition. Synthesis Lectures on Computer Architecture, 5(1), 2010. [49] Peter Hawkins, Alex Aiken, Kathleen Fisher, Martin Rinard, and Mooly Sagiv. Concurrent data representation synthesis. In PLDI, 2012. [50] M. Herlihy, Y. Lev, V. Luchangco, and N. Shavit. A provably correct scalable concurrent skip list. In OPODIS, 2006. [51] Maurice Herlihy and Eric Koskinen. Transactional boosting: a methodology for highly-concurrent transactional objects. In PPoPP, pages 207–216, 2008. [52] Maurice Herlihy, Victor Luchangco, Mark Moir, and William N. Scherer III. Software transactional memory for dynamic-sized data structures. In PODC, pages 92–101, 2003. [53] Maurice Herlihy and J. Eliot B. Moss. Transactional memory: Architectural support for lock-free data structures. In ISCA, pages 289–300, 1993. [54] Maurice Herlihy and Nir Shavit. The Art of Multiprocessor Programming. Morgan Kauffman, February 2008. [55] Maurice P. Herlihy and Jeannette M. Wing. Linearizability: a correctness condition for concurrent objects. ACM Trans. Program. Lang. Syst., 12, July 1990. [56] Maurice P. Herlihy and Jeannette M. Wing. Linearizability: a correctness condition for concurrent objects. Proc. of ACM TOPLAS, 12(3):463–492, 1990. [57] Michael Hicks, Jeffrey S. Foster, and Polyvios Prattikakis. Lock inference for atomic sections. In Proceedings of the First ACM SIGPLAN Workshop on Languages, Compilers, and Hardware Support for Transactional Computing, June 2006. BIBLIOGRAPHY 115 [58] Guoliang Jin, Wei Zhang, Dongdong Deng, Ben Liblit, and Shan Lu. Automated concurrency-bug fixing. In OSDI, 2012. [59] Zvi M. Kedem and Abraham Silberschatz. A characterization of database graphs admitting a simple locking protocol. Acta Inf., 16:1–13, 1981. [60] Henry F. Korth. Locking primitives in a database system. J. ACM, 30:55–79, January 1983. [61] Eric Koskinen, Matthew Parkinson, and Maurice Herlihy. Coarse-grained transactions. In POPL, pages 19–30, 2010. [62] Milind Kulkarni, Keshav Pingali, Bruce Walter, Ganesh Ramanarayanan, Kavita Bala, and L. Paul Chew. Optimistic parallelism requires abstractions. In PLDI, 2007. [63] Vladimir Lanin and Dennis Shasha. Tree locking on changing trees. Technical report, 1990. [64] T. Lev-Ami and M. Sagiv. TVLA: A framework for Kleene based static analysis. In Saskatchewan, volume 1824 of Lecture Notes in Computer Science, pages 280–301. Springer-Verlag, 2000. [65] Ondřej Lhoták and Laurie Hendren. Scaling java points-to analysis using spark. In Proceedings of the 12th international conference on Compiler construction, CC’03, 2003. [66] Richard J. Lipton. Reduction: a method of proving properties of parallel programs. Commun. ACM, 18(12):717–721, December 1975. [67] Alexander Matveev and Nir Shavit. Towards a fully pessimistic stm model. In TRANSACT 2012 Workshop, 2012. [68] Bill McCloskey, Feng Zhou, David Gay, and Eric Brewer. Autolocker: synchronization inference for atomic sections. In POPL, pages 346–358, 2006. [69] Paul E McKenney. Is parallel programming hard, and, if so, what can you do about it? Linux Technology Center, IBM Beaverton, August 2012. [70] Maged M. Michael and Michael L. Scott. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In PODC, pages 267–275, 1996. [71] J. Eliot B. Moss. Open Nested Transactions: Semantics and Support. In Poster at the 4th Workshop on Memory Performance Issues (WMPI-2006). February 2006. [72] Ramanathan Narayanan, Berkin zis. Ikyilmaz, Joseph Zambreno, Gokhan Memik, and Alok Choudhary. Minebench: A benchmark suite for data mining workloads. In 2006 IEEE International Symposium on Workload Characterization, pages 182–188, 2006. 116 BIBLIOGRAPHY [73] Flemming Nielson, Hanne R. Nielson, and Chris Hankin. Principles of Program Analysis. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 1999. [74] Otto Nurmi and Eljas Soisalon-Soininen. Uncoupling updating and rebalancing in chromatic binary search trees. In Proceedings of the tenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, PODS ’91, pages 192–198, New York, NY, USA, 1991. ACM. [75] Christos H. Papadimitriou. The serializability of concurrent database updates. J. ACM, 26(4):631– 653, 1979. [76] M. Sagiv, T. Reps, and R. Wilhelm. Parametric Shape Analysis via 3-valued Logic. ACM Trans. on Prog. Lang. and Systems (TOPLAS), 24(3):217–298, 2002. [77] Peter M. Schwarz and Alfred Z. Spector. Synchronizing shared abstract types. ACM Trans. Comput. Syst., 2(3):223–250, August 1984. [78] Ohad Shacham. Verifying Atomicity of Composed Concurrent Operations. PhD thesis, Tel Aviv University, 2012. [79] Ohad Shacham, Nathan Bronson, Alex Aiken, Mooly Sagiv, Martin Vechev, and Eran Yahav. Testing atomicity of composed concurrent operations. In OOPSLA, 2011. [80] A. Silberschatz and Z.M. Kedam. A family of locking protocols for database systems that are modeled by directed graphs. Software Engineering, IEEE Transactions on, SE-8(6):558 – 562, November 1982. [81] Daniel Dominic Sleator and Robert Endre Tarjan. Self adjusting heaps. SIAM J. Comput., 15:52– 69, February 1986. [82] Alexandru Sălcianu and Martin Rinard. Purity and side effect analysis for Java programs. In VMCAI, pages 199–215, 2005. [83] Yaakov L. Varol and Doron Rotem. An algorithm to generate all topological sorting arrangements. The Computer Journal, 24(1):83–84, 1981. [84] Jons-Tobias Wamhoff, Christof Fetzer, Pascal Felber, Etienne Rivière, and Gilles Muller. Fastlane: Improving performance of software transactional memory for low thread counts. In PPoPP, 2013. [85] Yin Wang, Stéphane Lafortune, Terence Kelly, Manjunath Kudlur, and Scott A. Mahlke. The theory of deadlock avoidance via discrete control. In POPL, pages 252–263, 2009. [86] Gerhard Weikum and Gottfried Vossen. Transactional information systems: theory, algorithms, and the practice of concurrency control and recovery. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2001. BIBLIOGRAPHY 117 [87] Adam Welc, Bratin Saha, and Ali-Reza Adl-Tabatabai. Irrevocable transactions and their applications. In SPAA, pages 285–296, 2008. [88] Hongseok Yang, Oukseh Lee, Josh Berdine, Cristiano Calcagno, Byron Cook, and Dino Distefano. Scalable shape analysis for systems code. In In CAV, 2008. [89] Richard M. Yoo, Yang Ni, Adam Welc, Bratin Saha, Ali-Reza Adl-Tabatabai, and Hsien-Hsin S. Lee. Kicking the tires of software transactional memory: Why the going gets tough. In SPAA, 2008. אוניברסיטת תל אביב הפקולטה למדעים מדויקים ע"ש ריימונד ובברלי סאקלר בית הספר למדעי המחשב סנכרון עדין אוטומטי של תוכניות מקביליות גיא גולן גואטה בהנחיית פרופ' מולי שגיב ופרופ' ערן יהב ובייעוצו של ד"ר ג .רמלינגהם חיבור לשם קבלת תואר דוקטור לפילוסופיה הוגש לסנאט של אוניברסיטת תל אביב אפריל 0215 0 תמצית אחד האתגרים המרכזיים בכתיבת תוכנה מקבילית הוא סנכרון ( :)synchronizationהבטחה שגישות ועדכונים מקבילים למידע משותף אינם מתנגשים אחד עם השני באופן שגורם לתוצאה לא רצויה .אטומיות ( )atomicityהיא תכונת נכונות מרכזית של סנכרון .תכונת האטומיות מתייחסת לקטעי קוד והיא אומרת שבכל ריצה של התוכנית קטעים אלו יראו כאילו הם בוצעו באופן אטומי .במקרים רבים ,זה נחשב קשה לממש סנכרון שמבטיח אטומיות באופן יעיל וסקלאבילי (.)scalable בתזה זו אנחנו עוסקים בבעיה של אכיפת תכונת האטומיות על ידי בנייה של מספר פרוטוקולי סנכרון ופיתוח של שיטות לאכוף אותם על קטעי קוד .הטכניקות שאנחנו מפעילים עושות שימוש במידע סטטי ודינמי של התוכניות ומבוססות על הפעלות שיטות קומפילציה ושיטות העובדות בזמן-ריצה .בפרט ,אנחנו מציגים שלוש גישות לביצוע סנכרון )1( :גישה שבה מנצלים את הצורה של הזיכרון המשותף על מנת להפוך ספרייה סדרתית (כזו הלא מכילה סנכרון) לספרייה אטומית (ספרייה שבה מנקודת המבט של הלקוח שלה ,כל פעולה נראית כאילו היא בוצעה באופן אטומי) .גישה זו מבוססת על הפרוטוקול ,Domination Locking פרוטוקול זה הינו פרוטוקול המאפשר סנכרון עדין ( )fine-grained synchronizationהוא מתוכנן עבור סנכרון של תוכנת מונחית אובייקטים הכוללת מניפולציות של מצביעים לאובייקטים .אנחנו מראים דרך לאכוף את Domination Lockingבספריות שבהן לזיכרון המשותף יש צורה של יער דינמי )2( .גישה שבה הופכים ספרייה אטומית לספרייה טרנזקציונית .ספרייה טרנזקציונית הינה ספרייה שמאפשרת להבטיח אטומיות של סדרה של פעולות (בניגוד לספרייה אטומית שבה האטומיות המובטחת היא של פעולות בודדות בלבד) .הרעיון בגישה זו הוא לבנות ספרייה המנצלת מידע המתקבל מהלקוחות שלה לגבי האופן בו הם ישתמשו בה .מידע זה מגביל את המקרים שהספרייה צריכה לקחת בחשבון וכך מאפשר לסנכרן בצורה אפקטיבית את הסדרות של הפעולות .גישה זו מבוססת על פרוטוקול סנכרון חדש המבוסס על הרעיון של מוזז-ימינה דינמי ( .)dynamic right-moverבנוסף ,אנחנו מראים שבהרבה מקרים הקיימים בתוכניות Javaמעשיות ,אפשר באמצעות קומפיילר לחשב באופן אוטומטי את המידע שהלקוח צריך להעביר לספרייה. ( )3גישה שמאפשרת לעבוד עם מספר ספריות טרנזקציוניות יחד .גישה זו עובדת על מקרה פרטי של ספריות טרנזקציוניות שבו הסנכרון מבוסס על מנעולים המנצלים תכונות סמנטיות של פעולות הספרייה. בתזה אנחנו מנסחים את הגישות בצורה פורמאלית ומראים שהן מבטיחות אטומיות מבלי ליצר שגיאות סנכרון כמו קיפאון ( .)deadlockבנוסף אנחנו מממשים את הגישות ומראים שהן מובילות לסנכרון יעיל וסקלאבילי. 3 4 תקציר פרק – 1הקדמה מקביליות הינה תכונה נפוצה של מערכות תוכנה בגלל שהיא מאפשרת לקצר זמני תגובה של פעולות ,להעלות תפוקה של פעולות ,ולספק נצילות גבוהה יותר של מכונות מרובי ליבות .למרות זאת ,כתיבת תוכנה אפקטיבית המכילה מקביליות נחשבת משימה קשה המועדת לשגיאות .זאת ,בשל הצורך לקחת בחשבון אינטראקציות רבות ועדינות של חלקי התוכנית הרצים במקביל. אטומיות .אטומיות הינה תכונת נכונות יסודית של קטעי קוד בתוכניות מקביליות .אינטואיטיבית ,קטע קוד S הוא אטומי בתוכנית ,אם לכל ריצה של התוכנית קיימת לתוכנית ריצה (זהה או שונה) עם התנהגות שקולה שבה קטע הקוד Sמתבצע באופן רציף באופן שאינו משולב עם חלקים אחרים של התוכנית .במילים אחרות, קטע קוד אטומי הינו קטע קוד שניתן לראותו כאילו הוא מתבצע באופן מבודד משאר חלקי התוכנית .קטעים אטומיים משפרים את היכולת להבין תוכניות מקביליות מכיוון שניתן להניח (מבחינת נכונות התוכנית) שקטעים אלה מתבצעים תמיד באופן רציף (כאילו הם מבוצעים ברצף ללא הפרעה). הספרות הרלוונטית (של מערכות בסיסי נתונים ומערכות זיכרון משותף) מגדירה ועוסקת במספר גרסאות שונות של אטומיות ,כאשר לכל גרסה יש את התכונות הסמנטיות שלה והיא מיועדת לקבוצה מסוימת של התנהגויות ושיקולים .לדוגמא ,לינאריזביליות ( )linearizablityהינה גרסה של תכונת אטומיות המשמשת לתיאור מימושים של ספריות משותפות (ובאופן זהה ,אובייקטים משותפים) :אינטואיטיבית ,פעולה היא לינאריזבילית אם ניתן לראותה כאילו היא מתבצעת בבת-אחת בנקודה מסוימת בין התחלת הפעולה לסיום הפעולה .תכונת הלינאריזביליות לא מתייחסת (באופן ישיר) לאופן המימוש של הספרייה ,במקום זאת היא מתיחסת להתנהגות הספרייה מנקודת המבט של הלקוח שלה. הבעיה .בתזה זו אנחנו עוסקים באכיפת אטומיות של קטעי קוד ,על ידי יצירה אוטומטית של סנכרון יעיל וסקלאבילי ( .)scalableאחד האתגרים המרכזיים בבעיה זו הינו להבטיח אטומיות בצורה סקלאבילית באופן שמגביל את המקביליות רק היכן שיש צורך בכך .הסנכרון צריך להיות עם תקורה נמוכה מספיק על מנת שיהיה כדאי להשתמש בו – כלומר ,הוא צריך להיות יעיל יותר מאלטרנטיביות פשוטות של סנכרון (כמו מנעול גלובלי פשוט המשמש לאכיפת אטומיות על ידי מניעת ביצוע מקבילי של הקטעים האטומיים). פתרונות סנכרון לאכיפת אטומיות ,הממומשים באופן מעשי ,הם בדרך כלל פתרונות הנכתבים ספציפית באופן ידני לתוכניות מסוימות – הבעיה עם פתרונות אלה היא שהם נוטים להכיל באגים רבים .שיטות סנכרון אוטומטיות מאפשרות למתכנת לסמן את הקטעים שצריכים להיות אטומיות ,ואז הקומפיילר (עם שיתוף פעולה של סביבת זמן-הריצה) אוכף את האטומיות שלהם באופן אוטומטי .למרות הנוחות הפוטנציאלית בשיטות האוטומטיות הקיימות ,שיטות אלה אינן בשימוש נפוץ בגלל סיבות שונות הכוללות :תקורת זמן ריצה גבוהה ,ביצועים גרועים ,ויכולת מוגבלת לתמוך בפעולות לא-הפיכות (כגון פעולות קלט/פלט). סינכרוניזציה ייעודית .בתזה זו אנחנו מציגים מספר גישות לאכיפה אוטומטית של אטומיות ,כאשר כל גישה מיועדת עבור מחלקה מסוימת של תוכניות .הרעיון הוא לייצר סנכרון שאוכף אטומיות על ידי ניצול תכונות ספציפיות של התוכניות .לכל גישה אנחנו מתארים פרוטוקול סנכרון ייעודי ,ומראים דרך להבטיח אותו על ידי שילוב שיטות קומפילציה ושיטות זמן-ריצה .פרוטוקולי הסנכרון מתוכננים להבטיח אטומיות באופן יעיל 5 וסקלאבילי מבלי להוביל לשגיאות סנכרון כמו קיפאון ( .)deadlockכל זאת מבלי להשתמש במנגנוני ספקולציה ומנגנוני הרצה-מחדש (.)rollback mechanisms פרק – 2הוספה אוטומטית של נעילות עדינות בפרק 0אנחנו מציגים גישה שמנצלת את הצורה של הזיכרון המשותף על מנת להפוך ספרייה סדרתית (כזו הלא מכילה סנכרון) לספרייה אטומית (בפרט ,לספרייה לינאריזבילית) .גישה זו מבוססת על המאמר ” “Automatic Fine-Grain Locking using Shape Propertiesשהוצג בכנס .OOPSLA’2011 ספרייה הינה מודול שכומס ( )encapsulatesזיכרון משותף עם קבוצה של פרוצדורות היכולות להיקרא על ידי חוטים ( )threadsהמתבצעים במקביל .בהינתן קוד של מימוש ספרייה ,המטרה שלנו היא להוסיף ,באופן אוטומטי ,נעילות עדינות ( )fine-grained lockingשמבטיחות שהקוד של כל פרוצדורה יהיה אטומי תוך כדי כך שהוא מאפשר רמה גבוהה של מקביליות בביצוע הפרוצדורות .בפרט ,אנחנו מעוניינים בסנכרון שבו לכל אובייקט משותף יש מנעול משלו ומנעולים יכולים להיות משוחררים לפני סיום ביצוע הפרוצדורות (יכולת זו לפעמים נקראת .)early-lock-releaseהרעיון המרכזי של גישה זו הוא להשתמש בצורה המוגבלת של גרף האובייקטים על מנת לאפשר את היצירה האוטומטית של הנעילות העדינות. הפרוטוקול .Domination Lockingהגישה מבוססת על פרוטוקול נעילות עדינות חדש הנקרא ( Domination Lockingבקיצור .)DLהפרוטוקול DLהינו קבוצה של תנאים המבטיחים אטומיות והעדר קיפאון .הפרוטוקול DLמיועד לטפל במבני נתונים רקורסיביים (עם מניפולציות דינמיות) על ידי שימוש בתכונות של מסלולים במבני נתונים כאלו. הפרוטוקול DLמבחין בין שני סוגי אובייקטים :אובייקטים חשופים ואובייקטים מוחבאים .אובייקטים חשופים משמשים כתווך בין הקוד של הספרייה לקוד לקוח שלה (הקוד שמשתמש בספרייה) ,מצביעים לאובייקטים אלו עוברים בין קוד הספרייה לקוד הלקוח – דוגמא נפוצה לאובייקט כזה היא אובייקט השורש של מבנה נתונים .לעומת זאת ,אובייקטים מוחבאים הינם אובייקטים המנוהלים על ידי הספרייה שקיומם אינו חשוף ללקוח של הספרייה – דוגמא נפוצה היא האובייקטים מתחת לשורש של מבנה הנתונים. הפרוטוקול מנצל את העובדה שכל פרוצדורה צריכה להתחיל עם אובייקטים חשופים (אחד הוא יותר) על מנת לעבור על גרף האובייקטים ולהגיע לאובייקטים מוחבאים. הפרוטוקול דורש שאובייקטים מוחבאים העוברים כפרמטרים לפרוצדורה ינעלו באופן דומה לפרוטוקול ה- .two-phase-lockingלעומת זאת ,אובייקטים מוחבאים מטופלים באופן שונה .חוט tרשאי לנעול אובייקט מוחבא uרק אם האובייקטים ש tמחזיק את המנעול שלהם הם דומיננטיים על ( uאנחנו אומרים שקבוצה S של אובייקטים הם דומיננטיים על אובייקט uאם כל המסלולים ,בגרף האובייקטים ,מאובייקט חשוף ל u מכילים אובייקט ב .)Sבפרט ,אובייקטים מוחבאים יכולים להינעל על ידי פרוצדורה אפילו אחרי שהיא שיחררה מנעולים מאובייקטים אחרים (באופן זה הפרוטוקול מאפשר שחרור מוקדם של מנעולים). פרוטוקול זה הינו הכללה ממשית של מספר פרוטוקולי נעילות עדינות קשורים כמוdynamic tree : lockingו .DAG locking אכיפה אוטומטית של .Domination Lockingבפרק אנחנו מציגים טכניקה אוטומטית לאכוף את הפרוטוקול DLעל קוד של ספרייה סדרתית נתונה .הטכניקה הינה ישימה לספריות בהן הזיכרון המשותף 6 הינו גרף אובייקטים שיש לו צורה של יער דינמי .הטכניקה מתירה לצורה של גרף האובייקטים להשתנות באופן דינמי כל עוד שהצורה שלו היא יער בין הפעלות שונות של הפרוצדורות (מספיק שזה יתקיים עבור הספרייה המקורית בריצות ללא מקביליות). הטכניקה מבצעת את סכמת הנעילות הבאה (תיאור לא פורמאלי) :בזמן ריצה ,פרוצדורה מחזיקה מנעול על כל אחד מהאובייקטים המוצבעים ישירות מהמשתנים הלוקליים שלה (קבוצה זו של אובייקטים נקראת הסקופ המיידי) .כאשר אובייקט יוצא מהסקופ המיידי ,הפרוצדורה צריכה לשחרר את המנעול על האובייקט אם יש לו (לכל היותר) אובייקט קודם אחד בגרף האובייקטים (כלומר ,במידה והוא לא מפר את תכונת היער של גרף האובייקטים) .במידה ויש לאובייקט כזה מספר אובייקטים קודמים אז אנחנו יודעים שבנקודה מסוימת, בהמשך הריצה ,מצב זה ישתנה ולכן לבסוף האובייקט ישוחרר (בגלל שלספרייה יש צורה של יער בסוף ריצה סדרתית של פרוצדורה) .מימוש סכמת הנעילות הנ"ל מתבצע על ידי טרנספורמציית קוד פשוטה המוסיפה מוני יחוס ( )reference countersלקוד של הספרייה. אבלואציה .אנחנו מראים שטכניקה זו מוסיפה נעילות עדינות המובילות לסנכרון יעיל וסקלאבילי ,עבור מספר מבני נתונים בהם מאוד קשה לייצר נעילות כאלה באופן ידני .אנחנו מדגימים את הישימות שלה על שני עצים מאוזנים ( Treapועץ אדום שחור) ,ערמה מסוג skew-heapושני מימושים של מבני נתונים ייעודיים לאפליקציות שלהם. פרק – 3ספריות טרנזקציונית ספריות לינאריזביליות מבטיחות שהפעולות שלהן יראו כפעולות המתבצעות באופן אטומי .לעיתים קרובות לקוחות של ספריות זקוקים לכך שסדרה של מספר פעולות תיראה כאילו היא בוצעה באופן אטומי (קוד המייצר סדרה של מספר פעולות נקרא פעולה מורכבת) .בפרק 3אנחנו עוסקים בהרחבה של ספריות לינאריזביליות לספריות התומכות באטומיות של פעולות מורכבות כלשהן (הקוד שמחליט על סדרת הפעולה שייך ללקוח של הספרייה) .אנחנו מציגים גישה חדשה שבה ספרייה מבטיחה אטומיות של פעולות מורכבות על ידי ניצול מידע המתקבל מקוד הלקוח .אנחנו קוראים לספריות אלו ספריות טרנזקציוניות. המתודולוגיה הבסיסית שלנו דורשת שהלקוח יסמן את קטע הקוד של הפעולה המורכבת שעבורו נדרשת האטומיות וייתן לספרייה מידע הצהרתי על הפעולות ספרייה (פרוצדורות) שבה הוא הולך להשתמש בנקודות השונות של קוד הפעולה המורכבת – מידע זה הינו מידע לגבי הפעולות העתידיות הפוטנציאליות שיכולת להתבצע במסגרת הפעולה המורכבת .הספרייה אחראית להבטיח את האטומיות של הפעולה המורכבת ,והיא רשאית לנצל את המידע על הפעולות העתידיות לטובת סנכרון אפקטיבי. הגישה המוצגת בפרק זה מבוססת על המאמר ” “Concurrent Libraries with Foresightשהוצג בכנס .PLDI’2013 סנכרון על בסיס מידע על העתיד .בפרק זה אנחנו מציגים פורמליזם של הגישה .אנחנו מציגים באופן פורמאלי את המטרות של הגישה ונותנים תנאי נכונות מספיקים עבורה .כל עוד הלקוח והספרייה מקיימים את התנאים ,מובטח שכל הפעולות המורכבות יתבצעו באופן אטומי מבלי להוביל לקיפאון .תנאי הנכונות שלנו הינם כלליים ומאפשרים מרחב גדול של מימושים .תנאים אלה מבוססים על המושג של מוזז-ימינה דינמי ( )dynamic right-moverהמכליל מושגים מקובלים כמו מוזז-ימינה סטטי ()static right-mover וקומוטטיביות ( .)commutativityהגישה שלנו מפרידה בין מימוש הספרייה ללקוח שלה .לפיכך ,נכונות 7 הלקוח אינה תלויה באופן שבו מימוש הספרייה משתמש במידע על העתיד .הלקוח צריך רק לוודא שהמידע המועבר לספרייה הטרנזקציונית הינו מידע נכון. הסקה אוטומטית .בנוסף ,אנחנו מציגים אנליזה סטטית המסיקה את המידע הנדרש עבור הגישה שלנו. אנליזה זו מאפשרת לקומפיילר להוסיף (באופן אוטומטי) לקוד הלקוח קריאות המעבירות לספרייה את המידע הנדרש על הפעולות העתידיות .דבר זה מפשט את העבודה של כתיבת הפעולות המורכבות. טכניקת מימוש .בפרק זה גם אנחנו מציגים טכניקה כללית המאפשרת להרחיב ספריות לינאריזביליות קיימות לספריות טרנזקציוניות .הטכניקה מבוססת על וריאציה חדשה של פרוטוקול ה tree-lockingשבו מבנה העץ נקבע על ידי המשמעות הסמנטית של פעולות הספרייה. אנחנו משתמשים בטכניקת המימוש על מנת לבנות ספריית Javaלשימוש כללי התומכת במספר מבני נתונים של מפות .הספרייה שלנו מאפשרת פעולות מורכבות העובדות בו-זמנית עם מספר מופעים של מפות (בשלב זה אנחנו מתמקדים במפות בגלל עבודות קודמות שהראו שפעולות מורכבות על מפות הן מאוד נפוצות בקוד של מערכות תוכנה). אבלואציה.אנחנו משתמשים בספריית המפות ובאנליזה הסטטית על מנת לאכוף אטומיות של פעולות מורכבות ,כולל כאלה העובדות עם מספר מופעים של מפות .אנחנו מראים שהגישה שלנו ישימה למגוון פעולות מורכבות הקיימות בתוכניות ( Javaהשייכות לפרויקטי קוד-פתוח) .בנוסף אנחנו מראים את הפוטנציאל הביצועי של הגישה על ידי כך שאנחנו מראים שהיא מספקת סנכרון יעיל וסקלאבילי. פרק – 4הרכבת ספריות טרנזקציונית באמצעות נעילות סמנטיות בפרק 4אנחנו מציגים גישה לטיפול בפעולות מורכבות המשתמשות במספר ספריות טרנזקציוניות יחד. חלק מהפרק מבוסס על המאמר הקצר ” “Automatic Semantic Lockingשהוצג בכנס .PPOPP’2014 הגישה המוצגת הינה שילוב של הגישה מפרק 3וגישות ידועות להסקה סטטית של מנעולים .בגישה זו ,אנחנו מגבילים את הסנכרון שיכול להיות ממומש בספריות הטרנזקציוניות לסנכרון עם מאפיינים הדומים לנעילות – סנכרון זה הינו דומה לסנכרון המבוסס על מנעולים סמנטיים כמתואר בספרות של בסיסי נתונים. בפרק זה המידע על הפעולות הפוטנציאליות העתידיות מתבטא על ידי נעילה של קבוצת הפעולות שיתכן ויופעלו (בהקשר זה ,פעולה מזוהה על ידי פרוצדורה עם ערכים ספציפיים המועברים לה כפרמטרים). הסנכרון בספרייה הטרנזקציונית צריך לוודא שלא יתכנו בו-זמנית שתי פעולות מורכבות המחזיקות מנעול על פעולות לא-קומוטטיביות. בפרק זה אנחנו מתארים אלגוריתם סטטי שאוכף אטומיות של קטעי קוד המשתמשים במספר ספריות טרנזקציוניות כאלה .אלגוריתם זה מתבסס על גרסה סמנטית של הפרוטוקול .two-phase-locking אנחנו מממשים את האלגוריתם הסטטי ומראים שהוא מייצר סנכרון יעיל וסקלאבילי. 8 פרק – 5עבודות קודמות פרק זה סוקר עבודות קודמות רלוונטיות ומסביר קשרים לחומר המוצג בתזה זו. פרק – 6מסקנות בתזה אנחנו מציגים שלוש גישות חדשות לאכיפת אוטומטית של אטומיות על ידי הוספת סנכרון עדין לקטעי קוד .אנחנו מראים שהגישות הן ישימות למגוון תוכניות מקביליות מהעולם האמיתי. הגישות שלנו מייצרות סנכרון יעיל וסקלאבילי על ידי שילוב שיטות קומפילציה ,שיטות זמן-ריצה ,וספריות עם סנכרון ייעודי. כל הגישות שאנחנו מציגים מנצלות תכנות סמנטיות של תוכניות :הגישה בפרק 0מנצלת את הצורה של הזיכרון המשותף ,הגישות בפרקים 3ו 4מנצלות מידע על הפעולות הפוטנציאליות העתידיות של הספרייה הנקראות מקוד הלקוח שלה .בפרט ,אנחנו מראים ששימוש במידע סמנטי מקדים על תוכניות הוא שימושי ומאפשר סנכרון אפקטיבי. במקרים רבים ,השילוב של מנעול (לדוגמא ,מנעול המאפשר כותב-בודד והרבה קוראים) וזיכרון משותף המוגן על ידו ,יכול להראות כספרייה טרנזקציונית .לפיכך ,במובן מסוים ,הסנכרון המוצג בפרק 3מכליל את הרעיון הסטנדרטי של מנעול. בנוסף ,הגישות שלנו אינן מבוססות על סנכרון ספקולטיבי (בפרט ,אין שימוש במנגנוני הרצה-מחדש .)rollbackדבר זה מראה שסנכרון ללא ספקולציה יכול להיות אפקטיבי (אף על פי ששיטות סנכרון מודרניות רבות מבוססות על ספקולציות) .בעתיד ,יתכן ויהיה מעניין להבין את הישימות והתועלת של הרעיונות שלנו תוך שילוב שימוש בספקולציות. 9