Combining Techniques Application for Tree Search Structures
Transcription
Combining Techniques Application for Tree Search Structures
RAYMOND AND BEVERLY SACKLER FACULTY OF EXACT SCIENCES BLAVATNIK SCHOOL OF COMPUTER SCIENCE Combining Techniques Application for Tree Search Structures Thesis submitted in partial fulfillment of requirements for the M. Sc. degree in the School of Computer Science, Tel-Aviv University by Vladimir Budovsky The research work for this thesis has been carried out at Tel-Aviv University under the supervision of Prof. Yehuda Afek and Prof. Nir Shavit June 2010 CONTENTS 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Flat Combining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Skip Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 7 11 13 . . . . JDK . . . . . . . . 16 4. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2. The 2.1 2.2 2.3 Flat Combined Skip Lists . . . . . . . Naive Flat Combined Skip List . . . . Flat Combined Skip List with Multiple Flat Combined Skip List with ”Hints” . . . . . . . . . . . . . . Combiners . . . . . . . . . . . . . . . . . . . . . . . 3. Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Performance Comparison of Flat Combined Skip Lists vs ConcurrentSkipListSet . . . . . . . . . . . . . . . . . . . . 3.2 Flat Combining Mechanism Experimental Verifications. . . . . . . . . . . . . . 1 1 2 17 22 LIST OF FIGURES 1.1 1.2 2.1 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 Skip list of heights 4. May be considered either as collection of ”fat” nodes or 2-d list . . . . . . . . . . . . . . . . . . . . . . . . Skip list traversal with key 12. Traversed predecessors are shown. start level is 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Multi-combiner skip list. Every node with height ≥ 3 is a combiner node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2 Naive FC skip list implementation vs JDK lock-free ConcurrentSkipListSet, uniform keys distribution . . . . . . . . . . . . . 18 Naive FC skip list implementation vs JDK lock-free ConcurrentSkipListSet, high access locality . . . . . . . . . . . . . . . . 19 Hints FC skip list implementation vs JDK lock-free ConcurrentSkipListSet, uniform keys distribution . . . . . . . . . . . . . . . . . . . . 20 Hints FC skip list implementation vs JDK lock-free ConcurrentSkipListSet, high access locality . . . . . . . . . . . . . . . . . . . . . . . 21 FC skip list implementation vs multi-lock one, naive implementations, uniform keys distribution . . . . . . . . . . . . . . . . . . 24 FC skip list implementation vs multi-lock one, naive implementations, high access locality . . . . . . . . . . . . . . . . . . . . . 25 FC skip list implementation vs multi-lock one, hints implementations, uniform keys distribution . . . . . . . . . . . . . . . . . . 26 FC skip list implementation vs multi-lock one, hints implementations, high access locality . . . . . . . . . . . . . . . . . . . . . 27 Ideal hints FC skip list implementation vs JDK lock-free ConcurrentSkipListSet, uniform keys distribution . . . . . . . . . . . . . 28 Ideal hints FC skip list implementation vs JDK lock-free ConcurrentSkipListSet, high access locality . . . . . . . . . . . . . . . . 29 Hints mechanism success rate for pure update workloads . . . . . 30 The connection between FC intensity to throughput per thread for pure update workloads . . . . . . . . . . . . . . . . . . . . . . 30 Lock-free skip list CAS per update, CAS success rate and throughput per thread for pure update workloads . . . . . . . . . . . . . 31 LISTINGS 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 3.1 Set of Integers Interface . . . . . . . . . . . . . . . . . . . . Flat combining definitions . . . . . . . . . . . . . . . . . . . Node definition . . . . . . . . . . . . . . . . . . . . . . . . . Wait free contains is the same for all skip lists . . . . . . . add Naive implementation . . . . . . . . . . . . . . . . . . . scanAndCombine common implementation . . . . . . . . . . Physical add and remove Naive implementation . . . . . . . Multi-combiner remove implementation . . . . . . . . . . . Optimistic (hinted) FCrequest and add implementation . . Optimistic (hinted) doAdd and verify implementation . . . Optimistic (hinted) multi-lock add method implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 6 6 7 8 8 9 12 13 14 22 ACKNOWLEDGEMENTS I would like to thank all those who made this thesis possible. I am extremely grateful to my advisors Prof. Yehuda Afek and Prof. Nir Shavit who introduced me to the world of multiprocessors and distributed algorithms and whose supervision and support enabled me to advance my understanding of the subject. My sincere thanks to Ms. Moran Tzafrir for teaching me what everyday researcher’s work is about and for supplying me with the arsenal of essential tools for my work. Finally, I am grateful to my family and especially to my sister Elena for the patience and encouragement. ABSTRACT Flat combining (FC) is a new synchronization paradigm allowing to reduce dramatically the synchronization costs. Use of this technique, as it was recently shown, brings significant performance gain for several popular parallel data structures, such as stacks, queues, shared counters, etc. Besides, the combining paradigm application makes a code as simple as one synchronized via single global lock. However, the question about applicability for other classes of parallel data structures has not been answered yet. This work deals with FC paradigm application to binary tree-like data structures. As it is shown below, combining is hardly suitable for these cases. The limits for FC uses have been studied, and criterion for its applicability has been justified. 1. INTRODUCTION Multi and many core computers appear more and more common these days. We witness recent developments of computer chips with tens of cores that consume no more space and energy than a desktop processor. In the light of this trend, the development of scalable and correct data structures becomes extremely important. The most simple and straightforward solution is to devise concurrent data structure from sequential one using global lock as synchronization primitive. Unfortunately, this solution does not scale even for relatively small number of cores. Another approach is to design fine-grained synchronization schemes using multiple locks or non-blocking read-modify-write atomic operations. This method usually requests full algorithm redesign and implementation. Additional drawback of fine-grained and, especially, lock-free synchronization is theirs high complexity. It is very difficult to formally prove the correctness of such data structures (See, for example, [3] and [4] proofs). 1.1 Flat Combining Flat combining [7] programming paradigm allows to achieve high level of concurrency while preserving of code simplicity. The main idea behind the flat combining is to attach the public actions registry to existing sequential data structure. Each thread, before accessing to shared data, publishes its action request in the registry, and then tries to access the global lock. The winning thread becomes ”combiner”, scans the registry and performs all found requests. Other threads simply wait for theirs fulfilled actions results, spinning on thread local Done flag. There are several benefits of this strategy: • Low synchronization cost, comparing to global lock since there is only one competition round for acquiring the shared lock, and every thread winning or missing - returns with its request performed. • The combiner can use its knowledge about all requests and fulfill part of them without access to data structure. For stack, for example, the combiner may collect push/pop pairs and to return the results to appropriate callers. This technique is well-known and called elimination. For shared counter, the combiner can calculate the total counter change and update data structure only once. This technique, called combining, is also widely used. The variants of FC algorithm are described in details in Chapter 2 (The Flat Combined Skip Lists). The flat combining is proven very efficient for data structures with ”hot spots”, such as stack head, queue ends, priority queue head, and so on. It also shows good results when synchronization costs are high. For 1. Introduction 2 Fig. 1.1: Skip list of heights 4. May be considered either as collection of ”fat” nodes or 2-d list example, lock-free synchronous queues [16] demonstrate good throughput but moderate scalability, which can be improved using elimination or FC techniques. However, the question of FC usefulness for data structures without emphasized bottlenecks and high synchronization costs remains opened. This work studies the flat combining applicability for binary tree-like data structures, the ones with O(log n) access time, allowing range operations. 1.2 Skip Lists Tree search structures are, probably, the most popular and widespread ones. It is hard to find computer science or software programming area that does not use them. Their practical applications start with the most popular red-black tree [6], which is used nearly in any algorithms library, including C++ STL library [17] and JavaTM SDK [18], and AVL tree [1], which is very popular for search dominated workloads, then continue with various B-trees [2], which are useful for block-organized memories, and finish with specialized suffix tries, splay trees, spatial search trees, persistent trees, etc. Since all of the above algorithms deal with large amount of data, and many of them run inside various operating systems or used as various search indexes inside databases, the distributed and multi-threaded decisions for search trees are in focus of many researches and commercial projects. The comprehensive survey of concurrent binary search trees is given in [13]. The common problem with all search trees mentioned above is that they either static (do not allow add/remove without full rebuild) or need re-balancing mechanism after updates in order to preserve logarithmic access time. In most of the cases, the re-balance scope is unknown prior to update action, and that makes the design of fine-grained synchronization for binary search trees very complicated task. That is why the skip lists were chosen as the basic data structure for the research. There were several reasons for the decision: • Skip list is simple and has no re-balancing overheads, which simplifies measures. 1. Introduction 3 Fig. 1.2: Skip list traversal with key 12. Traversed predecessors are shown. start level is 3. • Skip list is the only known concurrent lock-free binary search structure. Skip list was invented [15] in 1990 as a probabilistic alternative for binary search trees. Skip list is a linked list of ”fat” nodes (Figure 1.1), where each one has randomly chosen height (number of levels). Every node has a unique key, and the nodes appear ordered in the list. Each node is connected at each level with the successor at the same level. The random level is chosen using geometrical distribution: the probability that the node has layer i, i ≥ 0 is 1 pi , p > 1. So, every node has layer 0, and, if node has layer i, it, with the probability of p1 , has also layer i+1. In practice, p is usually chosen between 2 to 4. Such distribution gives O(log N ) skip list maximal node height expectation, and between every two nodes of height k, p-1 nodes of height k-1 are expected to appear. It is useful to add two immutable nodes head and tail with highest possible level and to manage real highest level (start level ) on every add or delete. Alternatively, the skip list may be represented as a collection of sorted lists with unique keys L1 , L2 , ..., Lk , such as i > j ⇒ Li ⊆ Lj and all nodes with equal keys form ”vertical” lists. The later representation is especially convenient for lock-free implementations, where all of the updates are implemented through atomic read-and-updates operations. Denote the next node to node n at level l as nextl (n), and the key of n as key(N) The simple sequential list works in the following way: • Initially, empty list contains head and tail with keys of −∞ and +∞ correspondingly. The head node is connected to tail at every possible level, and actual start level is 0. • List traversal with key k starts from node n = head at level l = start level, and proceeds at this level searching the pair of nodes (pred, succ), such that nextl (pred) == succ and key(pred) < k ≤ key(succ). Set l = l − 1 and n = pred and repeat the search. The process continues until 0 level achieved. Figure 1.2 illustrates the pred nodes observed during traversal with key 12. • contains(k) simply calls the traversal with key k. It is unnecessary to 1. Introduction 4 proceed to the bottom, once the desired key is found, the traversal is interrupted and found node is returned. • add(k) starts from generating random height h, as described above. After that, the traversal algorithm is performed, collecting h bottom pred and suss nodes. Once the node is not found (for pure set implementation), the new node of height h with key k is linked to collected nodes. • remove(k) starts from the traversal run. Once, the node suss with key(suss) == k is observed on the highest level h, all traversed pred nodes are collected. After reaching the bottom level, all collected nexti (pred) references are set to nexti (suss), and suss node memory is freed. • After every update operation, start level is verified and updated, if needed. There are two cases - when adding the node with height h > start level, start level is set to h, and, when removing the node of start level height, to find the highest level h such that nexth ( head) 6= tail, and to set start level to h. Note, that the traversal algorithm performs O(1) expected steps at each level, and that the number of levels expected to be logarithmic to nodes’ number, and therefore, skip list has expected logarithmic access time. The above schema, short of some small variations, is used in the most of lock based concurrent skip lists, and our implementations use it as well. The differences in the implementations ([14], [8], [11]) are concerning locking schemes and state flags devised for consistency, linearizability [10] and skip list invariants preserving. Lock free skip lists, in contrast, cannot maintain skip list invariants - this approach needs multiple locations read-and-update atomic operations, unsupported on the most of existing platforms. The lock-free implementations ([5], [9]) use relaxed skip list algorithms, where the question about node existence is answered only on the bottom list level, and the other levels are regarded as sort of index allowing to reach the bottom level in expected logarithmic time, and skip list structure can be violated at the particular execution moments. 2. THE FLAT COMBINED SKIP LISTS All our FC skip list variants are implemented both in Java and C++ with minimal differences. C++ implementations require memory management and explicit memory barriers, while in Java implementations the memory barriers are introduced implicitly through volatile flags’ store/load operations. We have chosen to present only Java implementations in order to avoid memory management issues and to have clear and standard competitor - all performance comparisons use Java SDK lock-free ConcurrentSkipListSet [18]. The flat combined skip list implements the simplest integers’ set interface: Listing 2.1: Set of Integers Interface 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 public i n t e r f a c e S i m p l e I n t S e t { /∗ ∗ ∗ Add item t o map ∗ @param key − key t o add ; ∗ @return t r u e i f added , ∗ f a l s e i f t h e key a l r e a d y e x i s t s on t h e map ∗/ boolean add ( int key ) ; /∗ ∗ ∗ Removes item from t h e map ∗ @param key − key t o remove ; ∗ @return t r u e i f removed , ∗ f a l s e i f t h e key d o e s not e x i s t on t h e map ∗/ boolean remove ( int key ) ; /∗ ∗ ∗ V e r i f y i f t h e item i s on t h e map ∗ @param t h r e a d i d ∗ @param key ∗ @return t r u e i f item e x i s t s , f a l s e o t h e r w i s e ∗/ boolean c o n t a i n s ( int key ) ; } The add and remove methods use flat combining paradigm, while contains method is implemented wait free. The coexistence of flat combining and wait free methods requires special treatment for linearization points, since flat combining data is invisible for lock-free contains. Define FCData and FCRequest: 2. The Flat Combined Skip Lists 6 Listing 2.2: Flat combining definitions 1 2 3 4 5 6 7 8 9 10 c l a s s FCRequest { int k e y ; // Key boolean r e s p o n s e ; // O p e r a t i o n r e s u l t v o l a t i l e int o p c o d e = NONE; // Action } c l a s s FCData { public FCRequest r e q u e s t s [ ] ; // S u b m i t t e d r e q u e s t s public A t o m i c I n t e g e r l o c k ; // FC node l o c k } The FCData may be attached to one or several skip list nodes The skip list node class is: Listing 2.3: Node definition 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 c l a s s Link { . . . public Link n e x t ; public Node node ; public Link up ; public Link down ; } c l a s s Node { . . . public int numLevels ( ) { // Node h e i g h t return l i n k s . l e n g t h ; } // Node i s FC when i t has FC d a t a public boolean isFCNode ( ) { return f c d a t a != null ; } public Link a t ( int i n d e x ) { // Get l i n k a t l e v e l return l i n k s [ i n d e x ] ; } public Link bottom ( ) { // The bottom l i n k return l i n k s [ 0 ] ; } public Link top ( ) { // The t o p l i n k return l i n k s [ l i n k s . l e n g t h − 1 ] ; } public f i n a l int k e y ; public v o l a t i l e boolean d e l e t e d = f a l s e ; public v o l a t i l e boolean f u l l y c o n n e c t e d = f a l s e ; public FCData f c d a t a ; // 2D l i s t o f l i n k s w i t h random a c c e s s // Link c o n t a i n s r e f e r e n c e t o next , up and down l i n k s private Link [ ] links ; } 2. The Flat Combined Skip Lists 7 Till now, the skip list is the regular single threaded one, save for two details - deleted and fully connected flags and FCData reference (which is not null for flat combining nodes). The contais method is also very similar to single threaded implementation: Listing 2.4: Wait free contains is the same for all skip lists 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 public boolean c o n t a i n s ( ) { int l e v e l = s t a r t l e v e l ; // A d o p t a b l e s t a r t l e v e l Link pred = head . a t ( l e v e l ) ; Link c u r r = null ; f or ( ; l e v e l >= 0 ; −−l e v e l , pred = pred . down ) { c u r r = pred . n e x t ; while ( inKey > c u r r . node . k e y ) { pred = c u r r ; c u r r = pred . n e x t ; } i f ( inKey == c u r r . node . key ) return ( ! c u r r . node . d e l e t e d && c u r r . node . f u l l y c o n n e c t e d ) ; } return f a l s e ; } The only distinguishing detail is the check of deleted and fully connected flags. The difference comes with add and remove implementations. We will present implementations for several flat combined lists variants. 2.1 Naive Flat Combined Skip List The first simplest implementation is Naive FC list. It has exactly one combiner node (the head one). The thread performing add or remove action: 1. Puts its FCRequest into head node FCData. 2. Tries to acquire lock. 3. If succeeded, scans and fulfills the requests 4. Else, the thread spins on its own request completion flag and checks lock state. If request fulfilled, the thread returns with desired result, otherwise, if lock is unlocked, continue from 2. The Listing 2.5 presents add method implementation. 2. The Flat Combined Skip Lists 8 Listing 2.5: add Naive implementation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 public boolean add ( int key ) { // Put my r e q u e s t t o node ’ s f c d a t a FCRequest m y r e q u e s t = head . f c d a t a . r e q a r y [ ThreadId . g e t T h r e a I d ( ) ] ; m y r e q u e s t . key = key ; // V o l a t i l e w r i t e , from h e r e combiner s e e s i t m y r e q u e s t . o p c o d e = ADD; AtomicInteger lock = fc node . f c d a t a . l o c k ; do { i f ( 0 == l o c k . g e t ( ) && // TTAS l o c k l o c k . compareAndSet ( 0 , 0xFF ) ) { // Perform a l l found r e q u e s t s scanAndCombine ( f c n o d e ) ; l o c k . s e t ( 0 ) ; // Unlock return m y r e q u e s t . r e s p o n s e ; } else { do { Thread . y i e l d ( ) ; // Give up p r o c e s s o r // Somebody d i d my work i f ( m y r e q u e s t . o p c o d e == NONE) return m y r e q u e s t . r e s p o n s e ; } while ( 0 != l o c k . g e t ( ) ) ; } } while ( true ) ; } The remove method differs from the above one only by REMOVE opcode All the work is performed within scanAndCombine method, which is the same for all following implementations: Listing 2.6: scanAndCombine common implementation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 protected void scanAndCombine ( Node f c n o d e ) { f or ( FCRequest c u r r r e q : f c n o d e . f c d a t a . r e q u e s t s ) { switch ( c u r r r e q . o p c o d e ) { case ADD: c u r r r e q . r e s p o n s e = doAdd ( f c n o d e , c u r r r e q . key , curr req . pred ary , curr req . succ ary ) ; c u r r r e q . o p c o d e = NONE; // R e l e a s e w a i t i n g t h r e a d break ; case REMOVE: c u r r r e q . r e s p o n s e=doRemove ( f c n o d e , c u r r r e q . key , curr req . pred ary , curr req . succ ary ) ; c u r r r e q . o p c o d e=NONE; // R e l e a s e w a i t i n g t h r e a d break ; } } } 2. The Flat Combined Skip Lists 9 Here, the combiner thread scans all requests and performs modifications. Both doAdd and doRemove methods receive the containers for predecessors and successors nodes - technical detail which allows reusing of the memory in case of Naive list, but which is used in different way in other implementations. Beside this, fc node parameter indicates the start node for search - it is not relevant for single combiner list, but important to multi-combiner one, described below. The doAdd/doRemove methods act exactly as in case of single threaded skip list: Listing 2.7: Physical add and remove Naive implementation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 private boolean doAdd ( Node f c n o d e , int key , RandomAccessList<Link> p r e d a r y , RandomAccessList<Link> s u c c a r y ) { // New node h e i g h t has t o be known i n advance // i n o r d e r t o r e s t r i c t nodes ’ c o l l e c t i o n . int t o p l e v e l = randomLevel ( ) ; // Find p l a c e m e n t and nodes t o c o n n e c t . Node f o u n d n o d e = f i n d ( f c n o d e , key , p r e d a r y , s u c c a r y , t o p l e v e l , true ) ; i f ( f o u n d n o d e == null ) { // Node not on map Node new node = new Node ( key , top level , false ) ; Link n e w l i n k = new node . bottom ( ) ; RandomAccessList<Link >. B i D i r I t e r a t o r p r e d I t e r = pred ary . begin ( ) ; RandomAccessList<Link >. B i D i r I t e r a t o r s u c c I t e r = succ ary . begin ( ) ; // Connect new node f or ( int l e v e l = 0 ; l e v e l < t o p l e v e l ; ++l e v e l , n e w l i n k = n e w l i n k . up ) { new link . next = s u c c I t e r . data ; predIter . data . next = new link ; p r e d I t e r = p r e d I t e r . next ( ) ; s u c c I t e r = s u c c I t e r . next ( ) ; } // L i n e a r i z a t i o n p o i n t new node . f u l l y c o n n e c t e d = true ; return true ; } return f a l s e ; } private boolean doRemove ( Node f c n o d e , int key , RandomAccessList<Link> p r e d a r y , RandomAccessList<Link> s u c c a r y ) { // Find node t o d e l e t e and i t s p r e d e c e s s o r s . Node f o u n d n o d e = f i n d ( f c n o d e , key , p r e d a r y , succ ary , fc node . num levels () , false ) ; i f ( f o u n d n o d e != null ) { int t o p l e v e l = f o u n d n o d e . n u m l e v e l s ( ) ; 2. The Flat Combined Skip Lists 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 10 // Get l i n k on t o p l e v e l Link l n k = f o u n d n o d e . top ( ) ; // Topmost p r e d e c e s s o r RandomAccessList<Link >. B i D i r I t e r a t o r p r e d I t e r = pred ary . rbegin ( ) ; f o u n d n o d e . d e l e t e d = true ; // L o g i c a l d e l e t e f or ( int l e v e l = 0 ; l e v e l < t o p l e v e l ; ++l e v e l , l n k = l n k . down , p r e d I t e r = p r e d I t e r . prev ( ) ) { // P h y s i c a l d e l e t e predIter . data . next = lnk . next ; } return true ; } return f a l s e ; } In this implementation we use fast random number generator described in [12], the similar one is adopted in JDK’s lock-free list. Consider the properties of the above skip list implementation. Property 2.1.1. Naive skip list is deadlock free. Proof. The implementation uses only one lock. Therefore, the deadlock free implementation of the lock implies deadlock freedom of the data structure. Property 2.1.2. Naive skip list update operations do not overlap each other and have strict total order. Proof. Consider two arbitrary update operations on the list. All modification are performed by the combiner thread during combining session (Listing 2.6). The combining sections are strictly ordered by single lock and do not overlap, and, so, if the operations belong to different sessions, the order is defined by the lock acquiring order. Otherwise, if the updates belong to the same session, the order is defined by combine algorithm - the combiner performs updates sequentially, and any two modifications do not overlap. Proposition 2.1.1. Naive skip list is linearizable. Proof. Select linearization points for skip list updates: • For add : the row 27 (Listing 2.7), where fully connected flag is set to true. • For remove: the row 46 (Listing 2.7), where deleted flag is set to true. Use linearizability of OptimisticSkipList proved in [8]. Note, that by Property 2.1.2, all updates performed on our skip list may be regarded as performed by single dedicated thread. Therefore, since initial preconditions are identical for both OptimisticSkipList and Naive one, modifications of the next references and deleted and fully connected flags appear in program order exactly as in OptimisticSkipList, the Naive skip list state may be considered exactly equal to OptimisticSkipList one, where all modifications on the least are performed by single thread. Then, for each possible concurrent run on Naive skip list, 2. The Flat Combined Skip Lists 11 Fig. 2.1: Multi-combiner skip list. Every node with height ≥ 3 is a combiner node there is a run on OptimisticSkipList, where both skip lists’ states defined by the next references and flags are identical at every point of time, and so, the OptimisticSkipList linearization order is applicable to Naive skip list As expected, the flat combining in this implementation exposes the sequential bottleneck, very comparable to the global lock. In Section 3 (Performance) this estimation is verified. 2.2 Flat Combined Skip List with Multiple Combiners The second attempt is the introduction of several combiners, that allow to make several modifications simultaneously and, therefore, to improve scalability. The multi-combiner skip list is implemented with statically distributed immutable combiners. The idea is to divide the skip list into non-intersecting parts, such that every part is managed by some combiner node. The multi-combiner skip list is shown on Figure 2.1. Suppose, that we start from initially filled skip list of size N and have to add c < N combiners. We choose some heights hc such that number of nodes with height h ≥ hc is at least c, and make them to be combiner nodes by adding FCData to each one. In this work, only static multi-combiner skip lists are studied. The dynamic lists may be devised by alternating hc value - the process requires consecutive locking of all FC nodes layers, converting of needed layer to combiners/non-combiners and re-scheduling of all pending combining requests. Since, by its essence, flat combining has to use a very small number of combiners (otherwise, it does not differ from sort of fine-grained synchronization), the process is rare and do not expensive. Multi-combiner skip list acts very similar to single-combiner one. As it was mentioned early, the contains method is exactly the same, while add/remove single difference is that the requests are placed to appropriate combiner nodes instead of head one. The updating thread: 1. Finds combiner node fc node responsible to modification area. 2. Puts its FCRequest into fc node’s FCData. 2. The Flat Combined Skip Lists 12 3. Tries to acquire FCData lock. 4. If succeeded, scans and fulfills the requests 5. Else, spins on its own request completion flag and checks lock state. If request is fulfilled, returns with desired result, otherwise, if lock is unlocked, continue from 3. Listing 2.8: Multi-combiner remove implementation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 public boolean remove ( int key ) { // Get r e s p o n s i b l e combiner Node f c n o d e = findCombiner ( key ) ; // Put my r e q u e s r t t o node ’ s f c d a t a FCRequest m y r e q u e s t = f c n o d e . f c d a t a . r e q a r y [ ThreadId . g e t T h r e a I d ( ) ] ; m y r e q u e s t . key = key ; // V o l a t i l e w r i t e , from h e r e combiner s e e s i t m y r e q u e s t . o p c o d e = REMOVE; AtomicInteger lock = fc node . f c d a t a . l o c k ; do { // TTAS l o c k i f ( 0 == l o c k . g e t ( ) && l o c k . compareAndSet ( 0 , 0xFF ) ) { // Perform a l l found r e q u e s t s scanAndCombine ( f c n o d e ) ; // Unlock lock . set ( 0) ; return m y r e q u e s t . r e s p o n s e ; } else { do { Thread . y i e l d ( ) ; // Somebody d i d my work i f ( m y r e q u e s t . o p c o d e == NONE) return m y r e q u e s t . r e s p o n s e ; } while ( 0 != l o c k . g e t ( ) ) ; } } while ( true ) ; } The method findCombiner is wait-free and is implemented similar to contains. It has three differences 1. The search goes down to the lowest combiners level and does not proceed to the bottom. 2. The search returns the lowest combiner predecessor of the key. 3. Since combiners are immutable, there is no need to check their deleted flag. The multi-combiner skip list properties are similar to Naive list ones. 2. The Flat Combined Skip Lists 13 Property 2.2.1. Multi-combiner skip list is deadlock free. Proof. As it follows from the algorithm, no thread try to hold more than one lock simultaneously. Then, the deadlock is impossible. Practically, the multi-combiner design divides the data structure into disjoint set of single combiner Naive lists. Call these lists combining clusters and the combiner, responsible for the cluster cluster head. Then, the properties of Naive FC lists are applicable for every combining cluster. Instead of strict total order, all updates operations of multi-combiner list form strict partial order, where operations on different clusters are commutative - the operations can be reordered without affecting the final state of data structure. Proposition 2.2.1. Multi-combiner skip list is linearizable. Proof. Follows from linearizability of each cluster and the fact that linearizability is compositional (Theorem 1 from [10]) The multi-combiner skip list scales much better than single-combiner one, but still perform a lot of work sequentially. The next try is to reduce this part of the execution by hints mechanism. 2.3 Flat Combined Skip List with ”Hints” Hints mechanism is inspired by optimistic skip list [8]. The idea is to collect in wait-free ”optimistic” manner the links that have to be updated, to acquire the lock, verify (and re-find, if needed) the links and then to perform update. The Listing 2.9 shows FCrequest structure supplemented with ”hints” and add method. Listing 2.9: Optimistic (hinted) FCrequest and add implementation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 c l a s s FCRequest { int key ; // Key boolean r e s p o n s e ; // O p e r a t i o n r e s u l t v o l a t i l e int opcode = NONE; // Action int t o p l e v e l // h i n t s s i z e RandomAccessList<Link> p r e d a r y ; // C o l l e c t e d h i n t s RandomAccessList<Link> s u c c a r y ; // C o l l e c t e d h i n t s } public boolean add ( int key ) { // Get r e s p o n s i b l e combiner Node f c n o d e = findCombiner ( key ) ; FCRequest m y r e q u e s t = f c n o d e . f c d a t a . r e q a r y [ ThreadId . g e t T h r e a I d ( ) ] ; // We have t o know l e v e l p r i o r t o f i n d i n o r d e r // t o r e s t r i c t h i n t s s i z e int t o p l e v e l = randomLevel ( ) ; Node f o u n d n o d e ; do{ // Find p l a c e m e n t and f i l l h i n t s d a t a 2. The Flat Combined Skip Lists 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 14 f o u n d n o d e = f i n d ( f c n o d e , key , m y r e q u e s t . p r e d a r y , m y r e q u e s t . s u c c a r y , t o p l e v e l , true , true ) ; } while ( f o u n d n o d e != null && f o u n d n o d e . d e l e t e d ) ; // Node a l r e a d y e x i s t s i f ( f o u n d n o d e != null ) return f a l s e ; // Put my r e q u e s t t o node ’ s f c d a t a my request . t o p l e v e l = t o p l e v e l ; m y r e q u e s t . k e y = key ; // V o l a t i l e w r i t e , from h e r e combiner s e e s i t m y r e q u e s t . o p c o d e = ADD; AtomicInteger lock = fc node . f c d a t a . l o c k ; do { // TTAS l o c k i f ( 0 == l o c k . g e t ( ) && l o c k . compareAndSet ( 0 , 0xFF ) ) { // Perform a l l found r e q u e s t s scanAndCombine ( f c n o d e ) ; // Unlock lock . set ( 0) ; return m y r e q u e s t . r e s p o n s e ; } else { do { Thread . y i e l d ( ) ; // Somebody d i d my work i f ( m y r e q u e s t . o p c o d e == NONE) return m y r e q u e s t . r e s p o n s e ; } while ( 0 != l o c k . g e t ( ) ) ; } } while ( true ) ; } The internal doAdd and doDelete (Listing 2.10) methods are also slightly modified, since we have to verify and re-fill, if needed, the collections of the predecessors and the successors. The verify method checks if all collected nodes are correct, i. e. they are non-deleted and connected, and each predecessor’s next reference points to the appropriate successor, and collected nodes keys suit the requested key. Listing 2.10: Optimistic (hinted) doAdd and verify implementation 1 2 3 4 5 6 7 8 9 10 private boolean doAdd ( Node f c n o d e , int key , int t o p l e v e l RandomAccessList<Link> p r e d a r y , RandomAccessList<Link> s u c c a r y ) { Node f o u n d n o d e = null ; // V e r i f y d a t a and re− f i l l i f needed i f ( ! v e r i f y ( key , p r e d a r y , s u c c a r y , top level )){ f o u n d n o d e = f i n d ( f c n o d e , key , p r e d a r y , s u c c a r y , t o p l e v e l , true , f a l s e ) ; } // From here , as i n \ t e x t i t { Naive } l i s t 2. The Flat Combined Skip Lists 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 15 . . . } protected boolean v e r i f y ( int key , RandomAccessList<Link> predAry , RandomAccessList<Link> succAry , int t o p l e v e l ) { RandomAccessList<Link >. B i D i r I t e r a t o r p r e d I t e r = predAry . b e g i n ( ) ; RandomAccessList<Link >. B i D i r I t e r a t o r s u c c I t e r = succAry . b e g i n ( ) ; f or ( int i L e v e l = 0 ; i L e v e l < t o p l e v e l ; ++i L e v e l , p r e d I t e r = p r e d I t e r . next ( ) , s u c c I t e r = s u c c I t e r . next ( ) ) { Link pred = p r e d I t e r . d a t a ; Link next = s u c c I t e r . d a t a ; i f ( pred . node . d e l e t e d | | next . node . d e l e t e d | | ! pred . node . f u l l y c o n n e c t e d | | ! next . node . f u l l y c o n n e c t e d | | pred . n e x t != next | | pred . node . k e y >= key | | next . node . k e y < key ) return f a l s e ; } return true ; } As its predecessors, the hinted skip list is deadlock free and linearizable. The deadlock freedom is obvious, since this implementation uses exactly the same locking scheme as previous ones. The linearizability may be devised from the fact that if verify fails, the hints skip list algorithm is identical to naive one. Otherwise verify success guarantees that the state of all memory that has to be updated is identical to one when data was collected, and therefore, all preconditions, mentioned in linearizability proof for OptimisticSkipList hold, and the proof is applicable also for hints skip list. The hints mechanism is applied to both single- and multi-combiners lists. As it is shown in Chapter 3 (Performance), the optimistic approach is very efficient, especially when update rate is not high. 3. PERFORMANCE For the performance verifications, we use the skip lists described above and several additional data structures designed to verify flat combining impact. The JDK ConcurrentSkipListSet by Doug Lea is used as a main competitor - by now, it is a one of the most efficient and scalable skip list implementations. R Enterprise T5140 server Computations were performed on SunTM SPARC powered by two UltraSPARC T2 Plus processors. Each processor contains eight cores running eight hardware threads, which gives 128 total hardware threads per system. The benchmarked algorithms notation is: FC-Naive-0 - ”Naive” FC-list with 0 non-head combiners. FC-Hints-64 - ”Hinted” FC-list with at least 64 non-head combiners - the combiners distribution algorithm was described in Section 2.2 JDK - JDK ConcurrentSkipListSet (based on ConcurrentSkipListMap). ML-0, ML-64 - ”Multi-lock” skip lists with 0 and 64 non-head locks correspondingly - the data structure, designed to isolate combining effect from combiners distribution one. Generally, it is multi-combiners skip list, where the FCData structures are substituted with simple locks. The updating thread locks appropriate ”locking” node, makes the update and releases lock - instead of making all the combining algorithm. ML-hints-0, ML-hints-64 - ”Multi-lock” optimistic skip lists with 0 and 64 nonhead locks correspondingly using hints mechanism exactly as flat combining one does. FC-Ideal-64 - The artificial FC-list made from FC-list with hints. Here, we assumed, that hints are always successful, and the combiner only work is to update the next references. This data structure gives an indication about maximal FC skip list performance, when the combiner fulfills all its requests sequentially. Experiments were performed on data structures with initial size of about 20000 keys. Actually, before selecting this size, the base skip list implementations were roughly benchmarked for wide range of sizes - from one hundred to few millions. The relations between run times for different skip list implementations were very similar for different sizes, and therefore, every initial size was representative enough to show qualitative differences between algorithms. The access locality factor was introduced to simulate different workloads. Suppose that the experiment is performed for keys space S = {1, 2, ..., N }. The access locality factor k, 1 ≤ k ≤ N is defined in the following way: the keys in the 3. Performance 17 benchmark are uniformly selected from the Sk = {t, t + 1, ..., t + N/k}, where t is selected uniformly from S at the start of the run, and is changed slowly during the execution. The access locality factor of 1, therefore, corresponds to uniformly distributed keys from S. The factor increase means that the keys are selected from the smaller interval, and so the contention increases. 3.1 Performance Comparison of Flat Combined Skip Lists vs JDK ConcurrentSkipListSet The first group of benchmarks compares the flat combining skip list implementations throughput with JDK ConcurrentSkipListSet’s one. Figure 3.1 presents the benchmark results for ”Naive” flat combining using uniformly distributed values. The graphs show that single combiner implementation fails to compete with SDK list even for read-dominate loads, when implementation with 64 combiners shows scalability even for write only loads. The picture changes dramatically when workload locality increases. Figure 3.2 depicts the same data structures, where all requests are selected from 1/128 of total keys space. In this case, naive FC skip list lose to SDK one even for read dominated workloads - when number of running threads increases enough, and multiple combiners do not help. The next group of runs deals with improved optimistic skip list, using ”hints” mechanism described above in Chapter 2 (The Flat Combined Skip Lists). Figure 3.3 shows the benchmark results for uniformly distributed requests, when Figure 3.4 depicts the runs with high locality access. The presented graphs show significant performance gain due to optimistic approach. For read-dominated workloads, both single- and multi-combiner lists perform better than SDK for all workload localities. For higher update operations rate, multi-combiners list competes well with SDK data structure, while single combiner one shows lack of scalability, especially for high access locality. So far, we can conclude that at least ”hinted” variant of combining skip list is simple and effective alternative to SDK decision. It is clear enough that for read-dominated workloads lock-free list performs worse than ones with lock protected updates and lock-free contains. The first reason for more effective read is that FC lists contains (Listing 2.4) performs only two volatile reads, while lock-free implementations require all next references to be volatile, and therefore, need log N volatile reads. The second reason is that all known lockfree skip list implementations conclude about node presence only after reaching the bottom skip list level, when our implementation stops if node with desired key is found on any level. However, it remains not clear yet what the combiner mechanism impact on the presented results. 3. Performance 18 Fig. 3.1: Naive FC skip list implementation vs JDK lock-free ConcurrentSkipListSet, uniform keys distribution 3. Performance 19 Fig. 3.2: Naive FC skip list implementation vs JDK lock-free ConcurrentSkipListSet, high access locality 3. Performance 20 Fig. 3.3: Hints FC skip list implementation vs JDK lock-free ConcurrentSkipListSet, uniform keys distribution 3. Performance 21 Fig. 3.4: Hints FC skip list implementation vs JDK lock-free ConcurrentSkipListSet, high access locality 3. Performance 22 3.2 Flat Combining Mechanism Experimental Verifications. In this section we experimentally verify in depth the FC impact on skip list behavior. The first experiments compare flat combining implementations with especially designed multi-lock skip list. Multi-lock skip list is devised from flat combining one by replacing FCData by simple lock. It has single- and multilocks implementation, exactly as FC skip list has, and may be extended with ”hints” mechanism as well. The multi-lock skip list with hints add method is shown at Listing 3.1. The method doAdd called at row 26 is identical to flat combined one presented at Listing 2.9 Listing 3.1: Optimistic (hinted) multi-lock add method implementation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 public boolean add ( int key ) { // Get r e s p o n s i b l e combiner Node l o c k n o d e = findLockNode ( key ) ; // We have t o know l e v e l p r i o r t o f i n d i n o r d e r // t o r e s t r i c t h i n t s s i z e int t o p l e v e l = randomLevel ( ) ; // Thread l o c a l h i n t s l i s t s int t h r e a d i d = ThreadId . getThreadId ( ) ; RandomAccessList<f c j a v a . MultiLockSkipListFH . Link> succ ary = succ ary [ thread id ] ; RandomAccessList<f c j a v a . MultiLockSkipListFH . Link> pred ary = pred ary [ thread id ] ; Node f o u n d n o d e ; do{ f o u n d n o d e = f i n d ( l o c k n o d e , key , p r e d a r y , s u c c a r y , t o p l e v e l , true , true ) ; } while ( f o u n d n o d e != null && f o u n d n o d e . d e l e t e d ) ; i f ( f o u n d n o d e != null ) return f a l s e ; // A c q u i r e l o c k and perform m o d i f i c a t i o n AtomicInteger lock = lock node . node lock ; do { // TTAS l o c k i f ( 0 == l o c k . g e t ( ) && l o c k . compareAndSet ( 0 , 0xFF ) ) { doAdd ( t h r e a d i d , l o c k n o d e , key , p r e d a r y , succ ary , t o p l e v e l ) ; // R e l e a s e l o c k lock . set ( 0) ; return true ; } e l s e // Give up p r o c e s s o r Thread . y i e l d ( ) ; } while ( true ) ; } Instead of placing the request and running the flat combining algorithm, the 3. Performance 23 updating thread finds appropriate lock node, acquires the lock and performs the change. The following graphs compare between multi-lock to FC Naive skip lists. We can see that for both low (Figure 3.5) and high (Figure 3.6) localities, and for any update rates both lists behave very similar. The multi-lock skip list even tends to perform slightly better for low access locality than its FC counterpart. It may be explained by additional overheads that flat combining introduces the combiner thread has to read and maintain the FC registry and to write back the operations result. All this, if not compensated by FC gains that were described above, leads to performance decrease. The benchmarks of Hints versions of multi-lock and FC skip lists are shown on Figures 3.7 and 3.8 for low and high access locality. The hints mechanism introduction improves performance of both lists, but does not change the ratio between algorithms - both behave very similar with light preference to multi-lock skip list for low access locality. As it is mentioned before, flat combining, besides opening contention bottleneck, allows using the knowledge about all pending request for optimizing data structure updates. For tree-like data structures, and for skip lists in particular, the elimination and combining techniques can be applied for optimizing the data structure traversal, but it is very hard to use them for optimizing data structure update. For the next group experiments, we assumed that the traversal is perfectly optimized, i.e. our hints mechanism never fails. In practice, we replaced the verify method in Listing 2.10 with one always returning true, and supplied every nodes with additional dummy next references. The combiner, instead of writing to real next references, updates the equal quantity of dummy next ones. These benchmarks are presented on Figures 3.9 and 3.9, and show that FC skip list with ideal hints mechanism competes well with lock-free one, and fails only for high access locality and more than 50% update rate, and, so, hints mechanism verification and improvement makes sense. The next graph (Figure 3.11) shows our hints mechanism efficiency. As it follows from the graph, the hints are very close to ideal for uniform access and fall to about 50% failures, when threads number grows to 64. This result explains the scalability turning point between 16 to 32 threads for high access locality and high update rate. Note, that for ideal hints list the turning point also exists, but appears slightly later and is not so sharp. So, the problematic scalability of FC list caused, probably, by the flat combining itself. 3. Performance 24 Fig. 3.5: FC skip list implementation vs multi-lock one, naive implementations, uniform keys distribution 3. Performance 25 Fig. 3.6: FC skip list implementation vs multi-lock one, naive implementations, high access locality 3. Performance 26 Fig. 3.7: FC skip list implementation vs multi-lock one, hints implementations, uniform keys distribution 3. Performance 27 Fig. 3.8: FC skip list implementation vs multi-lock one, hints implementations, high access locality 3. Performance 28 Fig. 3.9: Ideal hints FC skip list implementation vs JDK lock-free ConcurrentSkipListSet, uniform keys distribution 3. Performance 29 Fig. 3.10: Ideal hints FC skip list implementation vs JDK lock-free ConcurrentSkipListSet, high access locality 3. Performance 30 Fig. 3.11: Hints mechanism success rate for pure update workloads Fig. 3.12: The connection between FC intensity to throughput per thread for pure update workloads The next two benchmarks, performed for pure update workload, intend to answer the question why lock-free list scales better than FC one. For flat combining loading estimation we introduce the FC intensity - the factor showing 3. Performance 31 additional combiner work. It is calculated in following way: < F C intensity >= < F ulf illed requests per F C session > −1 < N umber of threads > This number is 0 for single threaded execution, and tend to 1 for large number of threads, when one combiner fulfills the requests of all other threads. The Figure 3.12 shows FC intensity together with throughput per thread for different number of combiners and workload localities, and the FC intensity increase is followed by throughput decrease (note, that the ideal scalability is horizontal line). The jump of intensity between 16 to 32 threads corresponds well with graphs on Figures 3.3 and 3.4 for 50% add / 50% remove workload. The jump may be explained in following way: starting from some number of threads, the combiner has no time to complete all the requests during the period, when the released thread prepares the new request, and so, the competition for lock newer interrupts. On the other hand, for 64 combiners and low locality, the jump has not happened, and algorithm is scalable. The Figure 3.13 shows lock-free list statistics for pure update workload. As it follows from the graphs, the CAS success rate never drops below 75% and CAS number is as small as 1.5 - 2.5 CAS per update, which explains good algorithm scalability. Fig. 3.13: Lock-free skip list CAS per update, CAS success rate and throughput per thread for pure update workloads 4. CONCLUSIONS We studied several approaches for flat combining technique application to skip list based maps. As it was shown on skip list example, for the structures allowing concurrent updates, the fine-grained and especially lock-free synchronizations are preferable to FC. This conclusion does not completely deny usefulness of the FC application for such structures since for read dominated workloads and for several update request distributions flat combining behaves better than lock-free synchronization. It is also possible that for different hardware the FC approach will show better scalability. The breakthrough can also come from FC algorithm improvements. It is possible, for example, to transform FC into some sort of job dispatcher: having all the requests, it can form mutually non-conflicting groups, so the waiting threads can execute them without synchronization. Such design faces the problems with additional FC overhead for sorting and analyzing the requests, but may be applicable for NUMA or client-server architectures. It is interesting also to study the FC implementation for other popular data structures - such as B-trees or Red-Black trees, where lock-free alternatives do not exist, and fine-grained locking requires complicated read-write locks. The FC’s benefit of simplicity and proved linearizability may be valuable for these cases. Another, albeit auxiliary, data structure - multi-lock skip list - may be interesting by itself. It showed characteristics as good as FC skip list, but it is simpler, needs less memory and gives more uniform latency for update requests. The idea to build the small index, protected by locks (locked or FC layers), and entirely wait-free data structure body can replace hand-by-hand fine-grained synchronization schemes for tree-like structures. BIBLIOGRAPHY [1] Adelson-Velskii, G. M., and Landis, E. M. An algorithm for the organization of information. Soviet Math. Doklady, 3 (1962), 1259–1263. [2] Bayer, R., and McCreight, E. Organization and maintenance of large ordered indices. In SIGFIDET ’70: Proceedings of the 1970 ACM SIGFIDET (now SIGMOD) Workshop on Data Description, Access and Control (New York, NY, USA, 1970), ACM, pp. 107–141. [3] Colvin, R., Groves, L., Luchangco, V., and Moir, M. Formal verification of a lazy concurrent list-based set algorithm. In CAV (2006), pp. 475–488. [4] Doherty, S., Groves, L., Luchangco, V., and Moir, M. Formal verification of a practical lock-free queue algorithm. In In FORTE (2004), Springer, pp. 97–114. [5] Fraser, K. Practical lock freedom. PhD thesis, Cambridge University Computer Laboratory, 2003. Also available as Technical Report UCAMCL-TR-579. [6] Guibas, L. J., and Sedgewick, R. A dichromatic framework for balanced trees. In SFCS ’78: Proceedings of the 19th Annual Symposium on Foundations of Computer Science (Washington, DC, USA, 1978), IEEE Computer Society, pp. 8–21. [7] Hendler, D., Incze, I., Shavit, N., and Tzafrir, M. Flat combining and the synchronization-parallelism tradeoff. In SPAA (2010), pp. 355–364. [8] Herlihy, M., Lev, Y., Luchangco, V., and Shavit, N. A simple optimistic skiplist algorithm. In SIROCCO’07: Proceedings of the 14th international conference on Structural information and communication complexity (Berlin, Heidelberg, 2007), Springer-Verlag, pp. 124–138. [9] Herlihy, M., and Shavit, N. The art of multiprocessor programming. Morgan Kaufmann, 2008. [10] Herlihy, M. P., and Wing, J. M. Linearizability: a correctness condition for concurrent objects. ACM Transactions on Programming Languages and Systems 12 (1990), 463–492. [11] Lotan, I., and Shavit., N. Skiplist-based concurrent priority queues. In Proc. of the 14th International Parallel and Distributed Processing Symposium (IPDPS) (2000), pp. 263–268. Bibliography 34 [12] Marsaglia, G. Xorshift rngs. Journal of Statistical Software 8, 14 (7 2003), 1–6. [13] Moir, M., and Shavit, N. Concurrent data structures. In Handbook of Data Structures and Applications, D. Metha and S. Sahni Editors (2007), pp. 47–14 47–30. Chapman and Hall/CRC Press. [14] Pugh, W. Concurrent maintenance of skip lists. Tech. rep., University of Maryland at College Park, College Park, MD, USA, 1990. [15] Pugh, W. Skip lists: a probabilistic alternative to balanced trees. Commun. ACM 33 (June 1990), 668–676. [16] Scherer, III, W. N., Lea, D., and Scott, M. L. Scalable synchronous queues. Commun. ACM 52, 5 (2009), 100–111. [17] Stepanov, A., and Lee, M. The standard template library. Tech. rep., WG21/N0482, ISO Programming Language C++ Project, 1995. [18] SUN MICROSYSTEMS, INC. JAVA PLATFORM, STANDARD EDITION, Version 6. 4150 Network Circle, Santa Clara, CA 95054, U.S.A, 2006.