Scalable Transaction Processing through Data-Oriented
Transcription
Scalable Transaction Processing through Data-Oriented
Scalable Transaction Processing through Data-oriented Execution Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical and Computer Engineering Ippokratis Pandis Diploma, Computer Engineering & Informatics, University of Patras, Greece M.Sc., Information Networking, Carnegie Mellon University Carnegie Mellon University Pittsburgh, PA May 2011 ii Keywords: Database management systems, transaction processing, multicore and multisocket hardware, scalability, contention, data-oriented execution, physiological partitioning. iii Dedicated to my family. Andreas, Niki, Titina and Soula. iv v Abstract Data management technology changes the world we live in by providing efficient access to huge volumes of constantly changing data and by enabling sophisticated analysis of those data. Recently there has been an unprecedented increase in the demand for data management services. In parallel, we have witnessed a tremendous shift in the underlying hardware technology toward highly parallel multicore processors. In order to cope with the increased demand and user expectations, data management systems need to fully exploit abundantly available hardware parallelism. Transaction processing is one of the most important and challenging database workloads and this dissertation contributes to the quest for scalable transaction processing software. Our research shows that in a highly parallel multicore landscape, rather than improving single-thread performance, system designers should prioritize the reduction of critical sections where hardware parallelism increases contention unbounded. In addition, this thesis describes solid improvements in conventional transaction processing technology. New transaction processing mechanisms show gains by avoiding the execution of unbounded critical sections in the lock manager through caching, and in the log manager by downgrading the critical sections to composable ones. More importantly, this dissertation shows that conventional transaction processing has inherent scalability limitations due to the unpredictable access patterns caused by the requestoriented execution model it follows. Instead, it proposes adopting a data-oriented execution model, and shows that transaction processing systems designed around data-oriented execution break the inherent limitations of conventional execution. The data-oriented design paves the way for transaction processing systems to maintain scalability as parallelism increases for the foreseeable future; as hardware parallelism increases, the benefits will only increase. In addition, the principles used to achieve scalability can be generally applied to other software systems facing similar scalability challenges as the shift to multicore hardware continues. vi vii Acknowledgments This dissertation wouldn’t have been completed without significant contribution of multiple people to whom I owe a lot. Below is a list of people that helped me during the past six years. The list is long but most probably missing some people to whom I am indebted. First and foremost, I would like to thank my academic advisor Natassa Ailamaki. I cannot find words to describe how influential Natassa has been for me. She was an excellent, extremely patient and inspirational advisor as well as a person I could trust and rely on. Her energy and passion for the field will always be an example for me. Natassa, that coffee at Starbucks changed my life. Thank you. Before asking Goetz Graefe to join my committee as the external member, I knew him only as a prominent member of the database systems research community with very important contributions in the field. Goetz’s interest in my thesis significantly improved the overall work, especially this document. Greg Ganger and Christos Faloutsos not only were valuable members of my committee, but also helped me in a rough moment of my PhD. I remained at school and I owe much of that to Greg, Christos and the members of their research groups (PDL and DB@CMU). Babak Falsafi and Stavros Harizopoulos have been excellent collaborators and two people I would seek for their advise. Jingren Zhou and Shimin Chen both honored me by selecting me to spend a summer working with them and learning a lot at Microsoft Research and Intel Research respectively. Various collaborators contributed significantly to this work: Ryan Johnson, Nikos Hardavellas, Pınar Tözün, Naju Mancheril, Debabrata Dash, Manos Athanassoulis, Radu Stoica, Miguel Branco, Danica Porobic, and Dimitris Karampinas. Ryan especially has been a key support in most of my work, both directly through joint research and indirectly through innumerable and round the clock discussions. Members of labs at CMU and EPFL have provided valuable feedback and support on various papers: Michael Abd El Malek, Kyriaki Levanti, Mike Fredman, Tom Wenisch, Brian Gold, Ioannis Alagiannis and others. In addition, three members of the administrative stuff of CMU helped a lot: Joan Digney, Karen Lindenfelser and Charlotte Yano. Joan helped to improve the quality of this dissertation even though it was not her responsibility. Some special people honored me with their friendship and support during the years in Pittsburgh and Lausanne: Kyriaki Levanti, Michael Abd El Malek, Panickos Neofytou, Thodoris Strigkos, Leonidas Georgakopoulos, and Iris Safaka. As well as, numerous friends from back home: Angela Paschali, Vassilis Papadatos, Haris Markakis, Dora Kaggeli, Valentinos Georgiou to name just few. viii Most importantly, I would like to thank my family. They encouraged and supported me during this long and stressful period. They mean the world to me. Thesis Committee: Anastasia Ailamaki (CMU & EPFL), Chair Christos Faloutsos (CMU) Gregory Ganger (CMU) Goetz Graefe (HP Labs) This research has been supported by grants and equipment from Intel and Sun; a Sloan research fellowship; an IBM faculty partnership award; NSF grants CCR-0205544, CCR-0509356, IIS-0133686, and IIS-0713409; an ESF EurYI award; and Swiss National Foundation funds. ix Contents Abstract v Table of Contents ix List of Figures xv List of Tables xix 1 Introduction I 1 1.1 Data management and transaction processing . . . . . . . . . . . . . . . . . 1 1.2 The emergence of multicore hardware . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Limitations of conventional transaction processing . . . . . . . . . . . . . . . 4 1.4 Focus of this dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.5 Not all serial computations are the same . . . . . . . . . . . . . . . . . . . . 7 1.6 Improving the scalability of conventional designs . . . . . . . . . . . . . . . . 8 1.7 Data-oriented transaction execution . . . . . . . . . . . . . . . . . . . . . . . 10 1.8 Thesis statement and contributions . . . . . . . . . . . . . . . . . . . . . . . 11 1.9 Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Scalability of transaction processing systems 2 Background: Transaction Processing 15 17 2.1 The concept of transaction and transaction processing . . . . . . . . . . . . . 17 2.2 A typical transaction processing engine . . . . . . . . . . . . . . . . . . . . . 18 2.2.1 Transaction management . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.2 Logging and recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.3 Access methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2.4 Metadata management . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2.5 Buffer pool management . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.6 Concurrency control . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 CONTENTS x 2.3 2.4 I/O activities in transaction processing . . . . . . . . . . . . . . . . . . . . . 23 2.3.1 Logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3.2 On-demand reads & evictions . . . . . . . . . . . . . . . . . . . . . . 25 2.3.3 Dirty page write-backs . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3.4 Summary and a note about experimental setups . . . . . . . . . . . . 27 OLTP Workloads and Benchmarks . . . . . . . . . . . . . . . . . . . . . . . 28 2.4.1 TPC-A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.4.2 TPC-B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.4.3 TPC-C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.4.4 TPC-E . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.4.5 TATP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3 Scalability Problems in Database Engines 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2 Critical sections inside a database engine . . . . . . . . . . . . . . . . . . . . 35 3.3 Scalability of existing engines . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.3.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3.2 Evaluation of performance and scalability . . . . . . . . . . . . . . . 39 3.3.3 Ramifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.4 II 33 Addressing scalability bottlenecks 4 Critical Sections 45 47 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2 Communication patterns and critical sections . . . . . . . . . . . . . . . . . 49 4.2.1 Types of communication . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2.2 Categories of critical sections . . . . . . . . . . . . . . . . . . . . . . 49 4.2.3 How to predict and improve scalability . . . . . . . . . . . . . . . . . 52 Enforcing critical sections . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.3.1 Synchronization primitives . . . . . . . . . . . . . . . . . . . . . . . . 53 4.3.2 Alternatives to locking . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.3.3 Choosing the right approach . . . . . . . . . . . . . . . . . . . . . . . 55 4.3.4 Discussion and open issues . . . . . . . . . . . . . . . . . . . . . . . . 59 Handling problematic critical sections . . . . . . . . . . . . . . . . . . . . . . 60 4.3 4.4 CONTENTS 4.5 xi 4.4.1 Algorithmic changes . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.4.2 Changing synchronization primitives . . . . . . . . . . . . . . . . . . 61 4.4.3 Both are needed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5 Attacking Un-scalable Critical Sections 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.2 Shore-MT: a reliable baseline . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.2.1 Critical section anatomy of Shore-MT . . . . . . . . . . . . . . . . . . 69 Avoid un-scalable critical sections with SLI . . . . . . . . . . . . . . . . . . . 70 5.3.1 Speculative lock inheritance . . . . . . . . . . . . . . . . . . . . . . . 72 5.3.2 Evaluation of SLI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Downgrading log buffer insertions . . . . . . . . . . . . . . . . . . . . . . . . 84 5.4.1 Log buffer designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.4.2 Evaluation of log buffer re-design . . . . . . . . . . . . . . . . . . . . 88 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.5.1 Reducing lock overhead and contention . . . . . . . . . . . . . . . . . 94 5.5.2 Handling logging-related overheads . . . . . . . . . . . . . . . . . . . 95 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.3 5.4 5.5 5.6 III 65 Re-architecting transaction processing 6 Data-oriented Transaction Execution 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 99 99 6.1.1 Thread-to-transaction vs. Thread-to-data . . . . . . . . . . . . . . . 101 6.1.2 When DORA is needed . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.1.3 Contributions and chapter organization . . . . . . . . . . . . . . . . . 104 6.2 Contention in the lock manager . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.3 A Data-ORiented Architecture for OLTP . . . . . . . . . . . . . . . . . . . . 107 6.4 6.3.1 Design overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 6.3.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 6.3.3 Improving I/O and microarchitectural behavior . . . . . . . . . . . . 116 6.3.4 Prototype Implementation . . . . . . . . . . . . . . . . . . . . . . . . 117 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 6.4.1 Experimental Setup and Workloads . . . . . . . . . . . . . . . . . . . 118 CONTENTS xii 6.4.2 Eliminating Contention in the Lock Manager . . . . . . . . . . . . . . 120 6.4.3 Intra-transaction Parallelism . . . . . . . . . . . . . . . . . . . . . . . 122 6.4.4 Maximizing Throughput . . . . . . . . . . . . . . . . . . . . . . . . . 125 6.4.5 Secondary index accesses . . . . . . . . . . . . . . . . . . . . . . . . . 126 6.4.6 Transactions with joins . . . . . . . . . . . . . . . . . . . . . . . . . . 128 6.4.7 Limited hardware parallelism . . . . . . . . . . . . . . . . . . . . . . 129 6.4.8 Anatomy of critical sections . . . . . . . . . . . . . . . . . . . . . . . 132 6.5 Weaknesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 6.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 6.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 7 Physiological Partitioning 7.1 139 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 7.1.1 Multi-rooted B+Trees . . . . . . . . . . . . . . . . . . . . . . . . . . 141 7.1.2 Dynamically-balanced physiological partitioning . . . . . . . . . . . . 141 7.1.3 Contributions and organization . . . . . . . . . . . . . . . . . . . . . 142 7.2 Communication patterns in OLTP . . . . . . . . . . . . . . . . . . . . . . . . 143 7.3 Shared-everything vs. physical vs. logical partitioning . . . . . . . . . . . . . 145 7.4 Physiological partitioning 7.5 7.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 7.4.1 Design overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 7.4.2 Multi-rooted B+Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 7.4.3 Heap page accesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 7.4.4 Page cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 7.4.5 Benefits of physiological partitioning . . . . . . . . . . . . . . . . . . 151 Need and cost of dynamic repartitioning . . . . . . . . . . . . . . . . . . . . 153 7.5.1 Static partitioning and skew . . . . . . . . . . . . . . . . . . . . . . . 153 7.5.2 Repartitioning cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 7.5.3 Splitting non-clustered indexes . . . . . . . . . . . . . . . . . . . . . . 155 7.5.4 Splitting clustered indexes . . . . . . . . . . . . . . . . . . . . . . . . 159 7.5.5 Moving fewer records . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 7.5.6 Example of repartitioning cost . . . . . . . . . . . . . . . . . . . . . . 160 7.5.7 Cost of merging two partitions . . . . . . . . . . . . . . . . . . . . . . 161 A dynamic load balancing mechanism for PLP . . . . . . . . . . . . . . . . . 162 7.6.1 Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 7.6.2 Deciding new partitioning . . . . . . . . . . . . . . . . . . . . . . . . 164 CONTENTS 7.7 7.8 7.9 xiii 7.6.3 Using control theory for load balancing . . . . . . . . . Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7.1 Experimental setup . . . . . . . . . . . . . . . . . . . . 7.7.2 Page latches and critical sections . . . . . . . . . . . . 7.7.3 Reducing index and heap page latch contention . . . . 7.7.4 Impact on scalability and performance . . . . . . . . . 7.7.5 MRBTrees in non-PLP systems . . . . . . . . . . . . . 7.7.6 Transactions with joins in PLP . . . . . . . . . . . . . 7.7.7 Secondary index accesses . . . . . . . . . . . . . . . . . 7.7.8 Fragmentation overhead . . . . . . . . . . . . . . . . . 7.7.9 Overhead and effectiveness of DLB . . . . . . . . . . . 7.7.10 Overhead of updating secondary indexes for DLB . . . 7.7.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.8.1 Critical Sections . . . . . . . . . . . . . . . . . . . . . . 7.8.2 B+Trees and alternative concurrency control protocols 7.8.3 Load balancing . . . . . . . . . . . . . . . . . . . . . . 7.8.4 PLP and future hardware . . . . . . . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Future Direction and Concluding Remarks 8.1 Hardware/data-oriented software co-design . 8.1.1 Hardware enhancements . . . . . . . 8.1.2 Co-design for energy-efficiency . . . . 8.2 Summary and conclusion . . . . . . . . . . . Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 167 168 169 170 171 172 173 175 177 178 182 183 183 183 184 185 186 187 . . . . 189 189 189 189 190 193 xiv CONTENTS xv List of Figures 1.1 Number of hardware contexts per chip . . . . . . . . . . . . . . . . . . . . . 3 1.2 Conventional and data-oriented access patterns . . . . . . . . . . . . . . . . 5 1.3 Dissertation roadmap based on the number and type of critical sections . . . 9 2.1 Components of a transaction processing engine. . . . . . . . . . . . . . . . . 19 2.2 An OLTP installation and I/O activities . . . . . . . . . . . . . . . . . . . . 22 3.1 Scalability of four popular open-source database engines . . . . . . . . . . . 34 3.2 Efficiency comparison for several storage engines. . . . . . . . . . . . . . . . 40 3.3 Accuracy of Amdahl’s Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.1 Communication patterns and types of critical sections . . . . . . . . . . . . . 50 4.2 Mutex sensitivity analysis – contention . . . . . . . . . . . . . . . . . . . . . 56 4.3 Mutex sensitivity analysis – duration . . . . . . . . . . . . . . . . . . . . . . 57 4.4 Reader-writer lock sensitivity analysis . . . . . . . . . . . . . . . . . . . . . . 58 4.5 Usage space of critical section types . . . . . . . . . . . . . . . . . . . . . . . 59 4.6 Algorithmic changes and tuning combine to give best performance. . . . . . 61 5.1 Shore-MT scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.2 Efficiency on TPC-C transactions . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.3 Breakdown of critical sections of the conventional designs . . . . . . . . . . . 70 5.4 Contention and overhead in the lock manager . . . . . . . . . . . . . . . . . 71 5.5 SLI in a nutshell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.6 Example of SLI-induced deadlock. . . . . . . . . . . . . . . . . . . . . . . . . 75 5.7 Breakdown of overhead due to lock manager vs rest of system . . . . . . . . 78 5.8 Lock manager bottleneck . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.9 Suitability of locks for use with SLI . . . . . . . . . . . . . . . . . . . . . . . 80 5.10 Analysis of SLI-eligible locks . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.11 CPU utilization breakdown with SLI active . . . . . . . . . . . . . . . . . . . 82 5.12 Performance improvement due to SLI, for TATP and TPC-B. . . . . . . . . . . 84 5.13 Log buffer designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 LIST OF FIGURES xvi 5.14 Contention in the baseline log buffer . . . . . . . . . . . . . . . . . . . . . . 90 5.15 Sensitivity analysis of the C-Array . . . . . . . . . . . . . . . . . . . . . . . 91 5.16 Sensitivity to the number of slots in C-Array . . . . . . . . . . . . . . . . . . 92 5.17 Performance improvement by hybrid log buffer design . . . . . . . . . . . . . 93 6.1 Comparison of access patterns. . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.2 Throughput per hardware context for baseline and DORA. . . . . . . . . . . 102 6.3 Time breakdown of baseline and DORA running as load increases. . . . . . . 103 6.4 Time breakdowns of baseline and DORA on TATP and TPC-C OrderStatus. 6.5 Inside the lock manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.6 Breakdown of time spent in the lock manager . . . . . . . . . . . . . . . . . 106 6.7 DORA as a layer on top of the storage manager . . . . . . . . . . . . . . . . 107 6.8 A transaction flow graph for TPC-C Payment. 6.9 Execution example of a transaction in DORA. . . . . . . . . . . . . . . . . . 112 104 . . . . . . . . . . . . . . . . . 109 6.10 Locks acquired, by type, in Baseline and DORA . . . . . . . . . . . . . . . . 121 6.11 Performance of baseline and DORA on TATP, TPC-B and TPC-C OrderStatus. 122 6.12 Single-transaction response times . . . . . . . . . . . . . . . . . . . . . . . . 123 6.13 Performance on a transaction with high abort rate . . . . . . . . . . . . . . . 124 6.14 Maximum throughput under perfect admission control . . . . . . . . . . . . 125 6.15 Performance on aligned secondary index scans . . . . . . . . . . . . . . . . . 127 6.16 Performance on non-aligned secondary index scans . . . . . . . . . . . . . . . 128 6.17 Transactions with joins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 6.18 Behavior on limited hardware parallelism . . . . . . . . . . . . . . . . . . . . 130 6.19 Context switches on limited hardware parallelism . . . . . . . . . . . . . . . 131 6.20 Anatomy of critical sections for Baseline and DORA . . . . . . . . . . . . . . 132 7.1 Breakdown of critical sections for the PLP variants . . . . . . . . . . . . . . 143 7.2 Page latch breakdown for three OLTP benchmarks . . . . . . . . . . . . . . 144 7.3 Shared-everything vs physical- vs. logical-partitioning . . . . . . . . . . . . . 145 7.4 Variations of physiological partitioning . . . . . . . . . . . . . . . . . . . . . 148 7.5 Throughput of a statically partitioned system. . . . . . . . . . . . . . . . . . 154 7.6 Splitting a partition in PLP-Leaf . . . . . . . . . . . . . . . . . . . . . . . . 157 7.7 Splitting a partition in PLP-Partition . . . . . . . . . . . . . . . . . . . . . . 158 7.8 A two-level histogram for MRBTrees and the aging algorithm . . . . . . . . 163 7.9 Deciding new partition ranges example . . . . . . . . . . . . . . . . . . . . . 166 7.10 Average number of page latches acquired . . . . . . . . . . . . . . . . . . . . 169 LIST OF FIGURES 7.11 7.12 7.13 7.14 7.15 7.16 7.17 7.18 7.19 7.20 7.21 7.22 7.23 Time breakdown per transaction in an insert/delete-heavy benchmark. . . . Time breakdown per TPC-C StockLevel transaction. . . . . . . . . . . . . . Time breakdown per transaction in TPC-B with false sharing on heap pages. Throughput in two multicore machines. . . . . . . . . . . . . . . . . . . . . . Impact of MRBTree in non-PLP systems. . . . . . . . . . . . . . . . . . . . . Time breakdown with frequent parallel SMOs . . . . . . . . . . . . . . . . . Throughput when running the TPC-C StockLevel transaction. . . . . . . . . Performance on transactions with secondary index scans. . . . . . . . . . . . Space overhead of the PLP variations. . . . . . . . . . . . . . . . . . . . . . Overhead of DLB under normal operation. . . . . . . . . . . . . . . . . . . . DLB in action. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Partitions Before & After the repartitioning . . . . . . . . . . . . . . . . . . Overhead of updating secondary indexes during repartitioning. . . . . . . . . xvii 170 171 172 173 174 174 175 176 177 178 179 180 182 xviii LIST OF FIGURES xix List of Tables 7.1 7.2 7.3 7.4 Repartitioning costs for splitting a partition into two . . . . . . . Cost when splitting a partition of 466MB in half . . . . . . . . . . Average index probe time for a hot record, as skew increases. . . . Average record probes per sec for a hot record, as skew increases. . . . . . . . . . . . . . . . . . . . . . . . . 156 160 181 181 xx LIST OF TABLES 1 Chapter 1 Introduction 1.1 Data management and transaction processing Mark Weiser, the father of ubiquitous computing, in his seminal article “The computer for the 21st century” wrote: “The most profound technologies are those that disappear. They weave themselves into the fabric of everyday life until they are indistinguishable from it.” [Wei99]. Data management is one of those technologies. Data processing and the dissemination of information, enabled and backed by data management technologies, are changing the world we live in. Consider the recent uprisings in the Arab world, which were greatly influenced by the social websites Facebook 1 and Twitter 2 [Bea11]. Or the fact that Wall Street now operates the majority of its trading at high frequency: trades are made automatically in data centers following analysis of large volumes of data with decisions being made almost instantaneously [ZL11]. Across all the world’s activities, we see that data analysis capabilities, enabled by data management systems, have changed centuries-old operations such as the way we practice medicine (e.g. [Lof96]), play sports [Mal11], or trade agricultural products [Kel11], to name just few. The database management market itself sees consistent yearly growth, with revenues of nearly $19 billion in 2008 [GSHS09]. At the same time, there is evidence that in the recent years the amount of data managed in various markets has increased almost exponentially. For example, in 2007, the social website Facebook collected 15TBs of data; 3 years later, in 2010, it collected 700TBs [TSJ+ 10]. The online auction and shopping website eBay 3 daily ingests 50TB of new data to its database [Raw10]. Databases become larger, they are used even more frequently, and increasingly more complex algorithms are employed for data processing. The pressure is on the data management systems which need to perform efficiently and respond to requests in a timely manner. 1 2 3 http://www.facebook.com http://www.twitter.com http://www.ebay.com CHAPTER 1. INTRODUCTION 2 While we experience an explosion in the amounts of data processed and in the number of data-centered applications available, the underlying hardware technologies are also changing tremendously. The rise of multicore hardware and the emergence of non-volatile storage technologies (such as flash-based devices and phase change memories) are fundamentally changing data processing capabilities across all types of applications. A primary target for change on the software side is the need to adjust and prioritize scalability in order to utilize abundantly available hardware parallelism. One of the most challenging database workloads is transaction processing [GR92]. The main characteristic of this type of workload is that it consists of a multitude of concurrent requests which typically touch only a small portion of a multi-gigabyte database in a largely unpredictable way. The concurrent requests need to complete consistently and in isolation from any other, while the changes made need to be durable. Transaction processing systems need to provide both high throughput and low response times. Unfortunately, the transaction processing model has remained largely the same for the past three decades, and that imposes some inherent scalability difficulties. This dissertation contributes to the quest for scalable transaction processing software within a multiprocessor node. It studies the interaction between modern hardware and transaction processing workloads; proposes a methodological way to analyze the scalability of software systems; and makes solid improvements in essential transaction processing components, such as locking and logging. More importantly, it shows that the conventional transaction execution model has fundamental scalability problems due to its chaotic data access patterns. It then shows that the data-oriented transaction execution model does not have such limitations and can maintain scalability as hardware parallelism increases. 1.2 The emergence of multicore hardware Let’s begin by looking at the evolution of computer hardware. Since 1965 processor technology has largely followed or outpaced Gordon Moore’s prediction that transistor counts within a single chip will double every year or two [Moo65]. Computer architects, in order to translate this biennial increase in the transistor budget into performance have followed two avenues: • Gradually increasing the complexity of the processors by employing aggressive microar- chitectural technologies (long execution pipelines, out-of-order execution, sophisticated branch prediction, super-scalar execution, etc.. [HP02]). 1.2. THE EMERGENCE OF MULTICORE HARDWARE 3 HW contexts/chip 256 Sun UltraSPARC 64 16 4 1 Oct-93 IBM POWER Intel Core/Nehalem Intel Itanium Intel Pentium AMD Opteron Apr-97 Oct-00 Apr-04 Oct-07 Apr-11 Year Figure 1.1: Evolution of the number of hardware contexts per chip for some major processor lines. We observe an exponential increase in on-chip parallelism since circa 2005. • Clocking the processors at ever higher frequencies, testing the endurance of hardware materials, such as silicon. Unfortunately, aggressive microarchitectural optimizations started giving diminishing results on commercial workloads, such as databases [ADHW99, RGAB98, BGB98, MDO94, LW92]. Even worse, sometime around 2005 we reached the limits of material science. Silicon transistors couldn’t be clocked at higher frequencies because they would melt, while each processor was (uneconomically) drawing 100 Watts or more of power. Due to the power and thermal caps, processor vendors made a historic change of course. Until around 2005, the focus of each chip design was single-thread performance (trying to accomplish a single task as efficiently as possibly using all the resources of the chip). From this time on, processor vendors had to rely on the thread-level parallelism of software systems to improve performance. Vendors began to use growing transistor budgets to exponentially increase the number of processing cores or hardware contexts per chip, rather than making single cores exponentially more complex [ONH+ 96, BGM+ 00, DLO05]. Figure 1.1 shows historic evidence of the explosion of on-chip parallelism in every major processor line (note the logarithmic scale in the y-axis). Multicore processors now dominate the hardware landscape and greatly affect software system performance. The pressure is now on the software manufactures, who can no longer expect the hardware to provide all the performance improvements. But, according to Am- CHAPTER 1. INTRODUCTION 4 dahl’s Law [Amd67], which says that the speedup a software system can achieve on parallel hardware depends on the fraction of its execution that can be parallelized, to achieve high performance, software must provide exponentially-increasing parallelism. Unfortunately, this is a very difficult task and software systems typically become bottlenecked long before they manage to saturate the underlying hardware (e.g. [BWCM+ 10]). 1.3 Limitations of conventional transaction processing Transaction processing systems experience scalability problems with the advent of exponentiallyincreasing hardware parallelism. As we will see in Chapter 3, open source transaction processing systems were caught off guard by the move to increased parallelism. There we study the scalability of the most popular open-source database engines4 when we use them to run a perfectly scalable transactional workload. We see that none of the database engines manages to scale its performance on a multicore chip with 32 hardware contexts. Even the most scalable of the database engines has a huge 8% serial component, which allows it utilize no more than ∼16 cores effectively. In the past this was a significant degree of parallelism, with parallel systems containing a limited number of processors (typically only 4-8). In contrast, emerging multicores may contain 64-512 hardware contexts [JN07, STH+ 10, KSSF10], with that number projected to continue increasing. This result is surprising because transaction processing is very well-suited for executing in parallel hardware. Transactional workloads exhibit abundant parallelism at the request level and over the past three decades the database systems community has done exceptional work: transaction processing systems excel at exploiting concurrency—support for multiple in-progress operations—to interleave the execution of a large number of transactions. Unfortunately, internal bottlenecks prevent the systems from translating their high concurrency to proportionally high execution parallelism [JPH+ 09]. There are two reasons why transaction processing systems face scalability problems and fail to exhibit unbounded execution parallelism. First, they are exceptionally complex software systems. In order to provide its core services a typical transaction processing system has tightly coupled components which interact with each other very frequently and sometimes measure thousands or even millions lines of code. Second, because of the way conventional transaction processing systems assign work to their worker threads, transactional workloads result in totally unpredictable data access patterns [PJHA10, SWH+ 04]. That is, under conventional execution, each incoming transaction 4 Throughout the thesis we use the terms “transaction processing system”, “database engine” and “storage manager” interchangeably. 1.3. LIMITATIONS OF CONVENTIONAL TRANSACTION PROCESSING DISTRICTS Thread-to-transaction (Conventional) Thread-to-data (DORA) 100 100 80 80 60 60 40 40 20 20 0 5 0 0.2 0.4 0.6 Time (secs) 0.8 0.2 0.4 0.6 Time (secs) 0.8 Figure 1.2: Comparison of the access patterns of the conventional and data-oriented execution. On the left are the accesses caused by the conventional thread-to-transaction assignment of work policy. On the right are the accesses caused by the (data-oriented) thread-to-data assignment of work policy. is assigned to a worker thread, a mechanism we refer to as thread-to-transaction assignment. The access pattern of each transaction, and consequently of each thread, however, is arbitrary and totally uncoordinated. The end result is that concurrent threads read and update data from the entire address space in a random fashion as shown in Figure 1.2 (left). This figure plots the concurrent thread accesses to the records of a table of a conventional transaction processing system as it runs a standardized transactional benchmark. Each access is color-coded indicating which thread is performing it.5 To ensure data integrity during shared, uncoordinated accesses, each thread enters a very large number of contentious critical sections in the short lifetime of each transaction it executes. For example, to complete one of the simplest transactions possible, which probes for a Customer and updates her balance, a modern conventional transaction processing system needs to enter more than 70 critical sections or points of serialization (see the left-most bar of Figure 1.3). Even though with huge effort and extremely careful and inspired engineering the complexity of transaction processing systems can be tamed, the unpredictability of the accesses remains. In other words, no matter how well engineered a conventional transaction processing system is, system designers are forced to be overly pessimistic and clutter the transaction processing codepaths with a very large number of points of serialization [JPA08, PTJA11]. This imposes a considerable overhead to single-thread performance [HAMS08]. Even worse, 5 More details about this figure are in Chapter 6. CHAPTER 1. INTRODUCTION 6 some of the serializations will eventually become impediments to scalability. Thus, we argue that because of inherent scalability problems conventional transaction execution is doomed, and there is a need to fundamentally change the way database engines process transactions. 1.4 Focus of this dissertation Given the difficulty of improving the scalability of conventional transaction processing systems, recently there has been an emergence of designs which exploit specific application characteristics in order to provide scalable performance. For example, many web applications, like Facebook and Twitter, can tolerate stale data and inconsistencies. Such applications can be served by systems that provide only eventual consistency guarantees [Vog09] or, in general, do not guarantee some of the ACID properties (atomicity, consistency, isolation, and durability) [GR92]. Other applications access data only by using record key identifiers. To serve such applications, several key-value stores [DHJ+ 07] have been implemented, including BigTable [CDG+ 06], HBase 6 , CouchDB 7 , Tokyo Cabinet 8 , Redis 9 , Cassandra 10 and DynamoDB [Vog12], to name just few. Those systems provide only a subset of the database functionality, provide a limited get()/put() key-value interface, and are designed for scalability, reliability, high availability, and easy of deployment on clusters of multiple nodes, rather than a single multicore node. This dissertation focuses on the scalability of “traditional” transaction processing systems within a multicore node. We are seeking transaction processing system designs which can replace existing systems without requiring changes in the legacy application code. We are interested on systems that maintain the ACID properties and they do not provide limited data management functionality (e.g. the ability to perform joins) or supported interface (e.g. not only key-value accesses). Also, we are interested in scaling up the transaction processing performance within a single multicore node. Scaling out to a cluster of nodes is a mostly orthogonal problem and out of the scope of this work. For example, one could easily exploit the results of this dissertation to implement the building blocks of a scale out solution. In terms of workloads, we are interested on transactional workloads that consist of multiple concurrent short-running transactions. Such workloads exhibit high concurrency at the application level and put pressure on the transaction processing system. Data analysis workloads that consist of few long-running queries, exhibit low concurrency, put pressure on 6 7 8 9 10 http://hbase.apache.org/ http://couchdb.apache.org/ http://fallabs.com/tokyocabinet/ http://redis.io/ http://cassandra.apache.org/ 1.5. NOT ALL SERIAL COMPUTATIONS ARE THE SAME 7 the query execution component, and are of no interest for this work. As a matter of fact, specialized data management systems are increasingly popular for serving such workloads (e.g. column stores [SAB+ 05, BZN05]). In addition, and in contrast with data analysis workloads, it is realistic to assume that transactional workloads are not I/O-bound [SMA+ 07, JPS+ 10]. That is, as main-memories become cheaper and larger the working set of most transactional workloads tends to be memory-resident with the only I/Os made to provide durability (flush the log buffer and write back dirty pages); contrary to “big data” analysis applications that operate of terabytes or petabytes of data and are often I/O-bound. In Part I, after an introduction to transaction processing systems (or database engines), we show that the performance of conventional open-source database engines suffers in highly parallel hardware due to their poor scalability. The rest of the dissertation consists of two main parts. The first discusses improvements to the scalability of conventional transaction processing designs.The second re-architects traditional transaction processing models in order to break the aforementioned inherent limitations. 1.5 Not all serial computations are the same On our quest for scalable transaction processing, one of the first challenges we met was discovering how to quantify the scalability of a system. It is impractical to start a system redesign based only on performance observations of an available parallel hardware machine. Given the rate at which hardware parallelism increases, once the software system redesign and implementation are completed, a new generation of more parallel processors will be available and new bottlenecks may have emerged. That is, not only do we need to identify the bottlenecks in current multicore hardware, but we need to be able to predict potential problems in future processor generations. One reliable way to predict the scalability of various transaction processing designs is by profiling the serial computations (or critical sections) executed during a single transaction and categorizing them based on their behavior. Behavior differs based on whether the contention for a specific critical section increases with the number of processing cores (or running threads in the system) or not. Using this criterion we see there are two main types of critical sections: those whose contention remains steady (or fixed) no matter how many processing cores are in the system, and those whose contention grows without bounds as hardware parallelism increases. We refer to the latter type of critical sections as unbounded. A third, special, type of critical sections are composable. As first observed by Moir et al. [MNSS05], in certain cases multiple CHAPTER 1. INTRODUCTION 8 threads can combine their operations and enter a critical only once, whereas normally the critical section would have been entered by each thread individually. For example, consider a concurrent stack where threads can push() and pop() items. While accessing the stack is a critical section—if two threads concurrently modify the head of the stack, behavior will be unpredictable—a push() and a pop() can combine their requests off the critical path without the need to execute the critical section. Of the three types of critical sections it is clear that only the unbounded ones impose threat to the scalability of the system. The other two types, fixed and composable, aggravate only the single-thread performance. Furthermore, employing the wrong synchronization primitive may have severe impact on both performance and scalability. Thus, the keys to scalable software designs are (a) to reduce the number of unbounded serial computations through algorithmic changes, and (b) to enforce the serial computations using appropriate synchronization primitives. The main message of Chapter 4 is that the numerous critical sections in transaction processing impose significant overhead even in single-thread performance. At the same time, not all critical sections are the same; different types impose different threats, if any, to scalability. To achieve a scalable design we need to drastically reduce the number of unbounded serial computations. By the end of this thesis, we will provide evidence that analyzing the number and type of critical sections is a reliable indicator of the scalability of systems. More elaborate (and scalable) designs execute on average fewer unbounded critical sections, as shown in Figure 1.3. 1.6 Improving the scalability of conventional designs There are three ways to reduce the frequency of unbounded critical sections in a software system and improve its performance: • Avoid. The system can avoid executing unbounded critical sections, for example through caching. Section 5.3 of Part II presents Speculative Lock Inheritance (or SLI), an example of avoiding the execution of unbounded critical sections in the lock manager through caching. SLI detects, at run-time, which database locks are “hot” (where there is contention for acquiring and releasing them) and makes sure the transaction executing threads cache those “hot” locks across transactions. The execution model does not change, since each thread in the system still executes the same codepaths and tries to acquire the same database locks from the centralized lock manager. It just happens to find 1.6. IMPROVING THE SCALABILITY OF CONVENTIONAL DESIGNS 80 Uncategorized 70 CSs per Transaction 9 Message passing 60 Xct mgr 50 40 Aether log mgr 30 Log mgr 20 Metadata 10 Bpool 0 Page Latches Chapter 5a Chapter 5b Chapter 6 Chapter 7 Conventional SLI & Aether Data-oriented Physiological Lock mgr Figure 1.3: Comparison of the number and type of critical sections executed for the completion of a very simple transaction from the various designs presented in this dissertation. The unbounded critical sections are the bars with solid fills. The fewer the unbounded critical sections, the more scalable the corresponding design is. the “hot” locks stored in a thread-local cache and avoids interaction with the centralized lock manager, reducing contention. • Downgrade. Unbounded critical sections can be downgraded into fixed or composable ones. By doing so, single-thread performance is not affected, but the scalability improves. Section 5.4 of Part II presents a concrete example of downgrading a class of unbounded critical sections from the log manager. An essential component of any transaction processing system, the log manager records all the changes made in a database and ensures that the system can recover in the event of a crash. However, if treated naively, the log buffer inserts can become a bottleneck (all the concurrent threads need to record their changes to the same main memory log buffer). But, because requests to append entries into a log buffer can be combined to form requests for larger appends, we are able to downgrade unbounded critical sections to composable ones and achieve better scalability. • Re-architect. The most drastic measure we can take to improve scalability is to com- pletely eliminate the need to execute contention-prone codepaths (codepaths that enter many unbounded critical sections) by modifying the entire execution model. We follow this direction in Part III, which is the main contribution of this dissertation. Part II is dedicated to improving the scalability of conventional transaction processing. We make two solid improvements to essential components of any transaction processing system. 10 CHAPTER 1. INTRODUCTION However, the second bar of Figure 1.3 suggests that the problem remains; no matter the optimizations, the conventional system still executes a large number of unbounded critical sections (Figure 1.3 shows the execution of over 35 unbounded critical sections) with the danger of some of them becoming bottlenecks. 1.7 Data-oriented transaction execution In a highly parallel multicore landscape, we need to approach transaction processing from a different perspective. One radical approach proposed by Stonebraker et al. [SMA+ 07], is HStore. H-Store is a shared-nothing design [Sto86] of single-threaded main-memory database instances within a node which rely on replication to maintain durability. Since each database is accessed by a single thread, H-Store eliminates critical sections altogether. Unfortunately, since shared-nothing systems physically partition data, H-Store delivers poor performance when the workload triggers distributed transactions [Hel07, JAM10, CJZM10, PJZ11] or when skew causes load imbalance [CJZM10, PJZ11]. Further, repartitioning to rebalance load requires the system to physically move and reorganize all affected data. These weaknesses become especially problematic as partitions become smaller and more numerous in response to multicore hardware. Thus, aggressive shared-nothing designs, such as H-Store, solve the scalability problems of only a limited set of applications, for example applications whose access patterns do no exhibit sudden changes and are easily partition-able. Because aggressive shared-nothing designs cannot adequately serve all transactional workloads, we need a design that maintains the desired shared-everything properties (e.g. all the data in a single address space, no need to execute distributed transactions), but also allows us to drastically reduce the number of unbounded critical sections. Based on the observation that uncoordinated accesses to data leads to scalability problems in conventional shared-everything designs, we propose a thread-to-data policy of assigning work to threads. Under this policy, each transaction is decomposed to a set of smaller actions according to the data region each action accesses. Then each action is routed to a thread responsible for that data region. Transactions flow from one thread to another as they access different data. In essence, instead of pulling data (database records) to the computation (transaction), the thread-to-data policy distributes the computation to wherever the data is mapped (or pushes computation to data). This simple change in the execution model breaks the limitations of conventional processing. Once the system is guaranteed that a single thread will access a specific region of the database during a period of time, it can be overly optimistic and avoid executing all 1.8. THESIS STATEMENT AND CONTRIBUTIONS 11 the critical sections normally would. Physical separation of data is also avoided, since the partitioning is only logical. Figure 1.2 clearly visualizes the difference between conventional thread-to-transaction and thread-to-data execution models; the former results in chaotic uncoordinated accesses, while the latter’s accesses are coordinated and easy to predict. In Part III we present two designs that eliminate significant sources of unbounded critical sections. Both optimizations are enabled once we adopt the thread-to-data execution model and exploit the resulting coordinated accesses. In particular, Chapter 6 shows how we can distribute the formerly centralized lock management service and make it thread-local (see the third bar in Figure 1.3). Chapter 7 extends the data-oriented design with modifications to the physical layout of the database, so that physical accesses map to logical partitioning. The physiologically partitioned or PLP design eliminates the need to employ unbounded critical sections for both logical operations (in the lock manager), as well as physical operations, such as page latching. To eliminate the need to acquire page latches, PLP employs a new access method, which we call MRBTree. The MRBTree access method consists of multiple independent sub-trees (regular B+trees [BM70]) connected to a “root” page that maps the logical partitioning of the data-oriented system. Overall, PLP executes almost an order of magnitude fewer unbounded critical sections than the most scalable conventional design, as shown in the right-most bar of Figure 1.3. In addition, because data-oriented execution is based on logical-only partitioning, systems that adopt it can react relatively quickly and balance load in response to changes in the access patterns. In Chapter 7 we also show how a system that follows the thread-to-data policy can dynamically and efficiently adapt to load changes, a big advantage over shared-nothing designs which employ physical partitioning. 1.8 Thesis statement and contributions This dissertation contributes in the quest for scalable transaction processing. The thesis statement is simple and reads as follows: THESIS STATEMENT To break the inherent scalability limitations of conventional transaction processing, systems should depart from the traditional thread-to-transaction execution model and adopt a data-oriented one. As hardware parallelism increases, data-oriented design paves the way for transaction processing systems to maintain scalability. The principles used to achieve scalability can CHAPTER 1. INTRODUCTION 12 also be applied to other software systems facing similar scalability challenges as the shift to multicore hardware continues. This thesis makes the following main contributions: • We show that in a highly parallel multicore landscape, system designers should primarily focus on reducing the number of unbounded critical sections in their systems, rather than on improving single-thread performance. • We make two solid improvements in conventional transaction processing technology by avoiding the execution of certain unbounded critical sections in the lock manager through caching, and by downgrading log buffer inserts from unbounded to composable critical sections. • We show that conventional transaction processing has inherent scalability limitations due to the unpredictable access patterns it exhibits, due to the request-oriented execution model it follows. Instead we suggest a data-oriented execution model and show that it breaks the inherent limitations of conventional processing. The final design eliminates the need to execute critical sections related to both logical operations (such as for locking) and physical operations (such as page latching). 1.9 Roadmap The overall goal of this thesis is to improve the scalability of transaction processing systems. Most chapters attack specific problems and are therefore fairly self-contained. The thesis is divided into three parts. In Part I we introduce background information about transaction processing systems and provide evidence that conventional transaction processing systems face significant scalability problems. These are due to the complexity and unpredictability of access patterns inherent to the conventional transaction processing execution model. Readers familiar with transaction processing should skim through Chapter 2. Part II is dedicated to improving the scalability of conventional transaction processing. We first describe the various types of critical sections and underline the need to enforce the appropriate synchronization primitive at each critical section (Chapter 4). Next we attack specific scalability problems of conventional database engines. Chapter 5 presents two concrete examples of how to remove bottlenecks from within two significant database engine components: the centralized lock manager, which enforces concurrency control, and the log manager, which is responsible for the recovery of the system in case of crashes. All prototyping and evaluations are done using the Shore-MT storage manager [JPH+ 09], a multithreaded storage manager we implemented for the needs of our research. 1.9. ROADMAP 13 The main part of this thesis is Part III (Chapters 6–7), which makes the case for dataoriented transaction execution. We first argue that conventional transaction processing and its thread-to-transaction assignment of work policy have inherent scalability limitations. We then make a case in favor of a thread-to-data work assignment policy and design a dataoriented system that eliminates major source of unbounded critical sections: those related to the centralized lock manager. Chapter 7 extends the design to eliminate page latching—the biggest remaining component of unbounded critical sections (see the third bar in Figure 1.3). Chapter 7 also shows that systems based on the data-oriented transaction execution model can easily re-partition at run time and adapt to load changes, a big advantage over designs that apply physical partitioning to the data. 14 CHAPTER 1. INTRODUCTION Part I Scalability of transaction processing systems 15 17 Chapter 2 Transaction processing: properties, workloads and typical system design This chapter presents background information about transaction processing. It defines the concept of transaction, describes briefly the structure of a typical transaction processing system and the various I/O activities that take place when processing transactions. Also, it discusses the requirements of transaction processing workloads, presenting five representative transaction processing benchmarks.1 2.1 The concept of transaction and transaction processing In general, transaction processing refers to the database operations corresponding to a business transaction. These operations range from tiny to fairly complex, are often fixed, and execute concurrently with many other requests. A database transaction is the basic unit of change in a database. According to Ramakrishnan and Gehrke [RG03]: “A transaction is any one execution of a user program in a DBMS. (Executing the same program several times will generate several transactions.) This is the basic unit of change as seen by the DBMS: Partial transactions are not allowed, and the effect of a group of transactions is equivalent to some serial execution of all transactions.” A transaction processing system is a software system that serves database transactions. Such a transaction processing system is expected to maintain four important properties of database transactions, known with the acronym ACID [GR92]. When ACID is maintained, every database transaction obeys the following properties: 1. Atomicity. Either all the effects (modifications) of the transaction remain when that is completed, or none. The atomicity requirement has the ”all or nothing” semantics. To the outside, from the application to even possibly the concurrently running transactions, a committed transaction appears to be indivisible and atomic; while the effects of an 1 This chapter draws material from various sources, most notably [RG03, GR92, HSH07] . CHAPTER 2. BACKGROUND: TRANSACTION PROCESSING 18 aborted transaction are like that it never happened, no matter what the transaction did until the abort. 2. Consistency. Every transaction leaves the database in a consistent state.2 The execution of a transaction transforms the database from one consistent state to another, while aborted transactions do not change the state. 3. Isolation. Transactions cannot interfere with each other. Moreover, depending on the level of isolation the application has requested, the effects of an incomplete transaction may or may not be visible to another transactions. Providing isolation is the main goal of concurrency control. 4. Durability. The effects (modifications) of successfully completed transactions must persist. The concurrency of requests from the application and the need to maintain the ACID properties, complicate the design of transaction processing systems. 2.2 A typical transaction processing engine 3 The transaction processing engine (or storage manager) is the “heart” of any database management system and it is responsible for many of the core database management services. Briefly, the main services of a transaction processing engine are the following: • Transaction management: Begin, commit or abort transactions, atomically roll- backing their modifications. • Access methods: Provide different organizations of the data. • Logging and recovery: Make sure that the database can recover to a consistent state in the event of a crash, discarding partially-performed work without side effects. • Buffer pool management: Give the notion that the system has infinite memory at its disposal. These services, in turn, access even lower-level services. In general the transaction processing engine is the most complicated part of any database system. It consists of many sub-components which are very tightly-coupled with each other and often measure thousands lines of code. Figure 2.1 shows the major components of a typical transaction pro2 3 It is the responsibility of the application that issues the transaction to ensure that the transaction itself is correct. This subsection uses material presented in [JPH+ 09]. 2.2. A TYPICAL TRANSACTION PROCESSING ENGINE 19 Transaction Processing System Application Transaction Management Lock Manager Log Manager Metadata Manager Access Methods Free Space Management Memory Management Buffer Pool Latching Storage Figure 2.1: Components of a transaction processing engine. cessing system. The following subsections highlight these components, explaining briefly their functionality and how they are usually implemented. 2.2.1 Transaction management The transaction processing engine maintains information about all active transactions, especially the newest and oldest in the system, in order to coordinate services such as checkpointing and recovery. In addition, it allows thread to attach and detach from transaction contexts and does all the bookkeeping (e.g. a transaction cannot commit if more than two threads are attached to it). Checkpointing allows the log manager to discard old log entries, saving space and shortening recovery time. However, no transactions may begin or end during some phases of checkpoint generation, producing a potential bottleneck unless checkpoints are very fast. 2.2.2 Logging and recovery The log manager records all the operations performed in the database into the database log. Logging modifications ensures that they are not lost if the system database fails before the buffer pool flushes those changes to disk. The log also allows the database to roll back modifications in the event of a transaction abort. Most transaction processing engines follow the ARIES scheme [MHL+ 92], which weaves logging, buffer pool management, and concurrency control into a comprehensive recovery scheme. 20 CHAPTER 2. BACKGROUND: TRANSACTION PROCESSING The log usually consists of two parts: the persistently stored log file and one or more main memory log buffers. The transactions log their modifications in the main-memory log buffer(s) and flush regularly to disk. All the log entries of a transaction need to be flushed to disk before that transaction commits. 2.2.3 Access methods One of the most important functions of the transaction processing engine is to maintain the database’s various data structures on disk and in memory. The transaction processing engine must manage disk space efficiently across many insertions and deletions, in the same way that malloc() and free() manage memory. It is especially important that pages which are scanned regularly by transactions be allocated sequentially to improve disk access times; table reorganizations are occasionally necessary in order to improve data layout on disk. In general, there are three types of data to be managed: (a) heap files, (b) index structures, and (c) metadata. A proper selection of indexes (part of physical design) and an effective optimizer can reduce query execution times by factors of a thousand or more by avoiding unnecessary disk scans when only one value is needed. Heap data. Normal, unordered, database records are stored in heap files. These provide sequential access to unknown (sets of) records, or random access to records whose location is known through other means (such as the result of an index probe). Index data. Indexes provide key-based access to data, either based on a candidate key (unique) or a key attribute which could map to many tuples. The most common index types are B+trees [BM70] and hash-based structures [FNPS79, Lit80]. The former give O(logn) access time to ordered data, while the latter give O(1) access time to unordered data. B+trees can be used for range scans while the latter not. Thus, B+trees is most frequently use indexing technique. Metadata. The transaction processing engine maintains metadata about the physical layout of data on disk, especially as it relates to space management. This metadata is similar to file system metadata. 2.2.4 Metadata management The transaction processing engine stores meta-information about the different objects (heap files, index structures) that store the database. Applications make heavy use of this metadata. From the perspective of the transaction processing engine, database metadata (such as the data dictionary or catalog) is just another type of data. 2.2. A TYPICAL TRANSACTION PROCESSING ENGINE 21 The transaction processing engine ensures that changes to metadata and free space do not corrupt running transactions, while also servicing a high volume of requests, especially for metadata. One part of the metadata information it is updated very infrequently. That is, typically in a database, the number of objects (heap files and index structures) which are being used along with their structure (number of columns, their order and data type) gets updated very infrequently. Thus, usually systems are optimistic and cache the metadata information throughout the connection session. 2.2.5 Buffer pool management The buffer pool manager presents the rest of the system with the illusion that the entire database resides in main memory, similar to an operating system’s virtual memory manager. The buffer pool is a set of “frames,” each of which can hold one page of data from disk. When an application requests a database page not currently in memory, it must wait while the buffer pool manager fetches it from disk. If there are no free frames to put the newly fetched page, the buffer pool needs to evict another page, following a replacement policy (e.g. Least-Recently Used, CLOCK [Smi78], 2Q [KCK+ 00], or ARC [MM03]). Transactions “pin” in-use pages in the buffer pool to prevent them from being evicted, and unpin them when finished. As part of the logging and recovery protocol, the buffer pool manager and log manager are responsible to ensure that modified pages are flushed to disk (preferably in the background) so that changes to in-memory data become durable. In order to quickly find any database page requested, buffer pools are typically implemented as large hash tables. Operations within the hash table must be protected from concurrent structural changes caused by evictions, usually with per-bucket mutex locks. Hash collisions and hot pages can cause contention among threads for the hash buckets; growing memory capacities and hardware context counts increase the frequency of page requests, and hence the pressure, which the buffer pool must deal with. Finally, the buffer pool must flush dirty pages and identify suitable candidates for eviction without impacting negatively requests from applications (either by evicting the wrong candidates or by preventing them from accessing the page in memory). 2.2.6 Concurrency control Database engines must enforce logical consistency at the transaction level, ensuring that transactions do not interfere with the correctness of other concurrent transactions. One of the most intuitive (and restrictive) consistency models is “two phase locking” (2PL), which dictates that a transaction may not acquire any new locks once it has released any. This 22 CHAPTER 2. BACKGROUND: TRANSACTION PROCESSING CPU CPU CPU CPU Main Memory Buffer pool + Log buffer 2 3 Data Data Data 1 Log Figure 2.2: An OLTP installation. There are four main components: a (possible multicore) processor, ample main memory, a storage subsystem that stores the database, and a storage subsystem that maintains the database log. There are three main types of I/O activities: (1) logging, (2) write-backs of dirty pages, and (3) ad-hoc reads of random pages. scheme is sufficient to ensure that all transactions appear to execute in some serial order, though it can also restrict concurrency. An even more restricting alternative is the “strict two phase locking” (strict 2PL), which in addition to 2PL dictates that all locks (shared and exclusive) are released only after the transaction commits or aborts. In order to balance the overhead of locking with concurrency, the transaction processing engine also provides hierarchical locks. For example, to modify a single row a transaction acquires a database lock, table lock, and row lock; meanwhile transactions which access a large fraction of a table may reduce overhead by “escalating” to coarser-grained locking at the table level. An increasing number of database engines also support an alternative to pessimistic locking known as multiversioned buffer management [BG83], which provides the application with snapshot isolation [BJK+ 97]. These schemes allow writers to update copies of the data rather than waiting for readers to finish. Copying avoids the need for most low-level locking and latching because older versions remain available to readers. Multiversioning is highly effective for long queries which would otherwise conflict with many short-running update transactions, but performs poorly under contention. In addition, many transactions update only a few bytes per record accessed, and multiversioning imposes the cost of copying an entire database page per record. Finally, “snapshot isolation,” which suffers from certain non-intuitive isolation anomalies that are only partly addressed to date [JFRS07, AFR09]. 2.3. I/O ACTIVITIES IN TRANSACTION PROCESSING 2.3 23 I/O activities in transaction processing Transaction processing systems need to handle efficiently the various I/O activities that occur when processing transactions. As a matter of fact, handling efficiently I/Os had been one of the biggest concerns of transaction processing system designers over the past decades, especially when database servers where mostly uni-processors and mechanical hard drives very slow. But, the emergence of multicore processors and other technologies, such as flashbased storage devices, had brought up other issues, such as the scalability on highly parallel hardware, which is the main topic of this dissertation. In this section we analyze the possible I/O activities during transaction processing and make the case that for a significant range of transactional applications, our study is valid even though it is performed on machines that do not have very efficient (and expensive) I/O subsystems. In the following subsections, we identify the different I/O activities and categorize their resulting I/O patterns. The three main types of I/O activities during transaction processing, depicted in Figure 2.2, are the transactional logging, the ad-hoc reads and evictions of random pages, and the write-backs of dirty pages. 2.3.1 Logging The first type of I/O activity is the log writing. The transaction processing system maintains a log to ensure the ACID properties over crashes or hardware failures. The log is a file of modifications done to the database. Before a transaction commits, the redo log records generated by the transaction must be flushed to stable media [MHL+ 92]. The log is typically implemented as a circular buffer of constant size. When the log needs to wrap around (the stable storage allocated for log becomes full) the oldest log records will be truncated to make space for new records. For databases whose working set fits in main-memory, which is a common case in modern database servers with large main memories, the log flushing is the only I/O activity that needs to take place synchronously during the execution of a transaction. The system must ensure that a transaction’s log records reach non-volatile storage before committing. With access times in the order of milliseconds, a log flush to magnetic media can easily become the longest part of a transaction. Further, log flush delays become serial if the log device is overloaded by multiple small requests. Fortunately, The I/O pattern of the log writing are essentially sequential appends, which can be easily handled by modern solidstate drives. Thus, log flush I/O times become less important as fast solid-state drives gain popularity [BJB09, LMP+ 08, Che09], and when using techniques such as group commit [HSL+ 89, RD89]. 24 CHAPTER 2. BACKGROUND: TRANSACTION PROCESSING Even when the log storage device can sustain the write rate needed by the system, transactional log flushing causes at least two problems: (a) the actual I/O wait time during which all the locks are held possibly reducing concurrency, and (b) the context switches required to block and unblock the thread at either end of the flush. To that end, our work on Aether logging [JPS+ 10, JPS+ 11] presents two techniques that remove those potential problems. We briefly discuss those techniques in the next two paragraphs. For more details about those two mechanism and performance evaluation, the interested reader is referred to [JPS+ 10] and [JPS+ 11]. Early Lock Release. To handle the problem of long waits for locks held during log flushes, DeWitt et al. [DKO+ 84] observe that a transaction’s locks can be released before its commit record is written to disk, as long as it does not return results to the client before becoming durable. Other transactions which read data updated by a pre-committed transaction become dependant on it and must not be allowed to return results to the user until both their own and their predecessor’s log records have reached the disk. Serial log implementations preserve this property naturally, because the dependant transaction’s log records must always reach the log later than those of the pre-committed transaction and will therefore become durable later also. Formally, as shown in [SSY95], the system must meet two conditions for early lock release to preserve recover-ability: (a) Every dependant transaction’s commit log record is written to the disk after the corresponding log record of pre-committed transaction; and (b) when a pre-committed transaction is aborted all dependant transactions must also be aborted. Most systems meet this condition trivially; they do no work after inserting the commit record, except to release locks, and therefore can only abort during recovery when all uncommitted transactions roll back. Early Lock Release (ELR) removes log flush latency from the critical path by ensuring that only the committing transaction must wait for its commit operation to complete; having released all held database locks, others can acquire these locks immediately and continue executing. In spite of its potential benefits modern database engines do not implement ELR and to our knowledge this is the first paper to analyze empirically ELR’s performance. We hypothesize that this is largely due to the effectiveness of asynchronous commit [Ora05, Pos10], which obviates ELR and which nearly all major systems do provide. However, systems which do not sacrifice durability can benefit strongly from ELR under workloads which exhibit lock contention and/or long log flush times. Flush Pipelining. Optimizations such as group commit [RD89] focus on improving I/O wait time without addressing thread scheduling. On the other hand, ELR decreases the 2.3. I/O ACTIVITIES IN TRANSACTION PROCESSING 25 wait time of other transactions that wait on locks held by the transaction that does the log flush. Still the requesting transaction must still block for its log flush I/O and be rescheduled as the I/O completes. Unlike I/O wait time, which the OS can overlap with other work, each scheduling decision consumes several microseconds of CPU time which cannot be overlapped. To eliminate the scheduling bottleneck (and thereby increase CPU utilization and throughput), the database engine must decouple the transaction commit from thread scheduling. Flush Pipelining is a technique which allows agent threads to detach from transactions during log flush in order to execute other work, resuming the transaction once the flush is completed. Flush Pipelining operates as follows. First, agent threads commit transactions asynchronously (without waiting for the log flush to complete). However, unlike asynchronous commit they do not return immediately to the client but instead detach from the transaction, encode its state at the log and continue executing other transactions. A daemon thread triggers log flushes using policies similar to those used in group commit (e.g. “flush every X transactions, L bytes logged, or T time elapsed, whichever comes first”). After each I/O completion, the daemon notifies the agent threads of newly-hardened transactions, which eventually reattach to each transaction, finish the commit process and return results to the client. Transactions which abort after generating log records must also be hardened before rolling back. The agent threads handle this case as relatively rare under traditional (non-optimistic) concurrency control and do not pass the transaction to the flush daemon. 4 When combined with ELR, Flush Pipelining provides the same throughput as asynchronous commit without sacrificing any safety. In summary, fast solid-state drives and techniques such as Early Lock Release and Flush Pipelining handle the performance problems related to log flushing, which may be the only source of I/O transactional databases that fit in main memory, 2.3.2 On-demand reads & evictions The second type of I/O activity are the on-demand reads of random pages. If the OLTP database is larger than the buffer pool size, the database pages to be accessed by a transaction may be missing from the buffer pool. Under such situation, the system issues read I/O requests on demand to retrieve the required pages into main memory for processing. The execution of the transaction blocks until the required pages are brought to memory. The resulting I/O are random reads of small size, equal to the database page size. While the 4 Most transaction rollbacks not due to deadlocks arise because of invalid inputs; these usually abort before generating any log and do not have to be considered. 26 CHAPTER 2. BACKGROUND: TRANSACTION PROCESSING transaction is blocked, another transaction starts running on the CPU keeping it utilized. As long as there are many concurrent requests that can be served, the CPUs of the system will be fully utilized. Solid-state drives provide two orders of magnitude higher random read bandwidth than mechanical disk drives, thereby to keep a machine with solid-state drives fully utilized the number of outstanding transactions needs to be two orders of magnitude less than if the system had mechanical hard drives. If the context switches become a significant fraction of the execution time, a system can by-pass this problem by employing asynchronous non-blocking I/Os. There is also the case where the buffer pool is full and decides to evict a dirty page which needs to be written back. In that case the resulting I/O is again a random write. Such cases are less frequent than the random reads because typically the buffer pool manager will prefer to replace a clean rather than a dirty page. The reason is simple. In order to replace a clean page the system has to copy only the new page in the frame occupied by the old (clean) page. On the other hand, in order to replace a dirty page the system first has to write it back to stable storage doubling the I/Os performed for bringing a new page into the buffer pool. Since we are mostly interested on servers with ample main memory, we expect the evictions of dirty pages to be extremely rare. 2.3.3 Dirty page write-backs The final of I/O activity are the dirty page write-backs forced by a checkpoint or log truncation. A database page in the buffer pool becomes dirty if it is modified by a transaction. Dirty pages are not written back to stable storage synchronously during the transaction execution. Instead, page-cleaning threads perform the write-backs asynchronously in the background. In this way, multiple transactions may make modifications to (different parts) of the same page in the buffer pool, reducing the number of write-back I/Os. Additionally, those I/Os are not added in the response time of the transaction, which should be kept as low as possible. A dirty page, however, will eventually have to be written back if its associated redo log records are to be truncated because of a log wrap-around or a checkpoint call. A checkpoint is forced when the log exceeds the size that the database could recover in the amount of time defined as the acceptable recovery interval by the application 5 . Hence, the mean time to recovery, specified by the application, determines the size of the log and the checkpoint frequency. The higher the checkpoint frequency (or the smaller the log size) the larger the pressure on the underlying storage system and the larger the probability of the system 5 In case of a failure, the recovery starts from the last checkpoint [MHL+ 92]. 2.3. I/O ACTIVITIES IN TRANSACTION PROCESSING 27 becoming I/O-bound. On the other hand, the smaller the mean time to recovery the better for the application. Therefore the system should be able to apply frequent checkpoints without detriment to performance. The dirty pages are flushed to the same location on the stable storage where the came from. Since the dirty pages may be distributed across the entire database the resulting I/O activity is typically a set of small random writes. The page cleaners try to find consecutive dirty pages to write them to disk on larger blocks but their approach is opportunistic without guarantees. 2.3.4 Summary and a note about experimental setups To summarize, during OLTP we encounter three basic types of I/O activity: logging, which consists of sequential appends; random on-demand page reads on buffer pool misses and possible buffer pool evictions; and write-backs of dirty pages which page cleaners try to coalesce to larger writes of consecutive pages. Software techniques, such as Early Lock Release and Flush Pipelining, in combination with fast solid-state drives which provide enough sequential write bandwidth, remove I/O flush wait times from being the bottleneck in transactional workloads. For example, in our experiments the maximum log write rate we encountered was around 180MB/sec (in Section 5.4.2). This was achieved when all the 64 hardware contexts of a multicore machine were fully utilized running an update-heavy transactional workload. 6 Such a sequential write bandwidth can be easily handled by solid-state drives. For example, one of the recently announced solid-state drives is reported to provide up to 520MB/sec sequential write bandwidth [Int12]. Also, throughout our experimentation we didn’t observe a case where the page cleaners could not keep up with the log writing rate. At the same time, solid-state drives provide low latency and a large number of random I/O operations per second 7 so that the number of concurrent transactions needed to keep a multicore server fully utilized to be relatively small, two orders of magnitude smaller than on machines with magnetic hard drives. With the previous analysis in mind, we believe that I/Os won’t prevent transaction processing systems from being CPU-bound. That’s why this dissertation focuses on the performance and scalability of transaction processing systems when multicore = servers are fully utilized. The majority of the combinations of hardware and transaction processing systems we are using throughout this dissertation, are capable of delivering high performance on the benchmarks we described in the next section, as long as the I/O sub-system allows. 6 7 Most workloads contain also read-only transactions that reduce the pressure for log I/O. For example the Intel SSD 520 is reported to provide 80KIOPS [Int12] CHAPTER 2. BACKGROUND: TRANSACTION PROCESSING 28 The demand on the I/O sub-system scales with throughput. To by-pass this problem, and yet have meaningful analysis we put the database files in a file system in main-memory. Threads that need to perform and I/O still have to context switch, but we ensure that I/Os are not the bottleneck. In some other experiments with systems that we have access to the source code (such as Shore-MT [JPH+ 09], DORA [PJHA10] and PLP [PTJA11]), we modify the system to impose a 6 msec penalty for each I/O operation. The artificial delay simulates a high-end disk array having many spindles, such that all requests can proceed in parallel but must each still pay the cost of a disk seek. This arrangement is somewhat pessimistic because it assumes every access requires a full seek even if there is some sequential component to the access pattern, but it ensures that all aspects of the transaction processing system are exercised. We observed that the quantitative analysis did not change regardless if we were keeping the data in main-memory or imposing the artificial delay. 2.4 Transactional processing workloads and benchmarks The concurrency of requests and the need to maintain the ACID requirements are the main characteristics of transactional workloads. Recently, due to the high increase in the need for data management services, and because some applications can tolerate it, systems relax ACID requirements (e.g. [Vog09]) or drop “traditional” data processing capabilities of database management systems (e.g. the key-value stores [CDG+ 06, DHJ+ 07]). Transaction processing benchmarks are the gold standards for performance, and they are used for marketing purposes. The following subsections describe several important database transaction processing benchmarks mentioned and used throughout this dissertation. 2.4.1 TPC-A The first widely-accepted database benchmark was formalized in 1985 [A+ 85]. That specification included three workloads, of which the “DebitCredit” stressed the database engine. The DebitCredit benchmark was an instant success, soon database and hardware vendors took to reporting extraordinary results, often achieved by removing key constraints from the specification. Therefore, in 1988 a consortium of analysts and hardware, operating system, and database system vendors formed the Transaction Processing Performance Council (TPC) in order to enforce some order in database benchmarking. Its first benchmark specification, TPC-A, essentially formalized the DebitCredit benchmark. TPC-A is very simple. It models deposits and withdrawals on random bank accounts, with the associated double-entry accounting on a database that contains 10k Branches, 100k Tellers, and 10M Accounts. It also captures the entire system, including terminals and 2.4. OLTP WORKLOADS AND BENCHMARKS 29 network. Transactions usually originate from their “home” Branch, but can go anywhere; conflicts are also possible, requiring the system to recover occasionally from failed transactions. An important aspect of this benchmark is its scaling rule: for a result to be valid, the database size must be proportional to the reported throughput. Simple though it was, the DebitCredit benchmark highlighted the importance of quantifying the performance and correctness of different systems – early benchmarking showed vast performance differences between different vendors (400x), as well as exposing serious bugs which had lurked, undiscovered, for many years in mature products. 2.4.2 TPC-B TPC’s second benchmark, TPC-B [TPC94], was also very similar to DebitCredit, but cut out the network and terminal handling to create a database engine stress test. Like DebitCredit, the TPC-B database contains four tables: Branch, Teller, Account, and History, which are accessed in double-entry accounting style as customers make deposits and withdrawals from various tellers. The benchmark consists of a single transaction AccountUpdate and stresses the transaction processing engine heavily, especially logging and concurrency control. 2.4.3 TPC-C For its third benchmark specification, TPC-C [TPC07], the TPC moved away from banking to commerce. TPC-C models an online transaction processing database for a wholesale supplier. It consists of five transactions which follow customer orders from initial creation to final delivery and payment. Below we briefly describe the five transactions and in the parenthesis we show the frequency of each transaction as specified by the benchmark. • New Order (45%). The NewOrder inserts a new sales order into the database. It is a medium-weight transaction with 1% failure rate due to invalid inputs. • Payment (43%). The Payment is a short transaction, very similar to the transaction of TPC-B, which makes a payment on an existing order. • Order Status (4%). The OrderStatus is a read-only transaction which computes the shipping status and the line items of an order. • Delivery (4%). The Delivery is the largest update transaction and also the most contentious. It selects the oldest undelivered orders for each warehouse and marks them as delivered. • Stock Level (4%). The StockLevel is also a read-only transaction. It joins on average 200 order line items with their corresponding stock entries in order to produce a report. CHAPTER 2. BACKGROUND: TRANSACTION PROCESSING 30 The benchmark combines the five transactions listed above at their specified frequencies. The specification lays out strict requirements about response time, consistency, and recoverability in the system, and returned to testing an end-to-end system that includes network and terminal handling. Like the transactions, the database schema is more complex, consisting of nine tables instead of four; where prior benchmark schemas could be represented as a tree the TPC-C schema is a directed acyclic graph. TPC-C stresses the entire stack (database system, operating system and hardware) in several ways. First, it mixes together short and long, read-only and update-intensive transactions, exercising a wider variety of features and situations than previous benchmarks. In addition, the benchmark has significant hotspots, partly from the way transactions access the Warehous table, and partly from the way the Delivery transaction is designed. The resulting contention and deadlocks stress the system’s concurrency control mechanisms. Finally, the database grows throughout the benchmark run, stressing code paths which previous benchmarks had not touched. TPC-C is the most popular OLTP benchmark for over twenty years. All database vendors have published results in TPC’s website, and in several occasions it has been used for marketing purposes.8 2.4.4 TPC-E 9 The goal of TPC with its latest OLTP benchmark, TPC-E, was to make a more realistic than TPC-C [TPC10], which is getting rather old. TPC-E incorporates several features that are found in real-world transaction processing applications but missing in TPC-C, such as check constraints and referential integrity. In addition, the TPC-E databases are populated with pseudo-real data that are based on the year 2000 U.S. and Canada census data and actual listings on the NYSE and NASDAQ stock exchanges. In this way, TPC-E reflects natural data skews in the real world, addressing a complaint of TPC-C using random data that do not reflect real-world data distributions TPC-E models a financial brokerage house. There are three components: customers, brokerage house, and stock exchange. TPC-E’s focus is the database system supporting the brokerage house, while customers and stock exchange are simulated to drive transactions at the brokerage house. 8 9 E.g. http://www.oracle.com/us/solutions/performance-scalability/t3-4-tpc-c-12210-bmark-190934.html and http://www-03.ibm.com/press/us/en/pressrelease/32328.wss . Some of the material of this subsection is presented at [CAA+ 10]. 2.4. OLTP WORKLOADS AND BENCHMARKS 31 There are 33 tables in TPC-E, over three times as many as in TPC-C. Of the 33 TPC-E tables, there are 9 tables recording customer account information, 9 tables recording broker and trade information, 11 market-related tables, and 4 dimension tables for addresses and fixed information such as zip codes. 19 of the 33 tables scale as the number of customers, 5 tables grow during TPC-E runs, and the remaining 9 tables are static. TPC-E has 6 read-only and 4 read-write transaction types, compared to 2 read-only and 3 read-write transaction types in TPC-C. 76.9% of the generated transactions in TPC-E are read-only, while only 8% of TPC-C transactions are read-only. This suggests that TPC-E is more read intensive than TPC-C. TPC-E’s test setup is complicated, requiring the development of customer and market drivers. The complexity and the lack of in-depth understandings of TPC-E have led so far to its slow adoption by both industry and academia. For example, only one database vendor has posted results of this benchmark in TPC’s website. TPC-E it imposes way less stress to the database engine than TPC-C or TPC-B [CAA+ 10]. Since the goal of this dissertation is to improve the scalability of database engines when their internals are under stress, we have elected not to use it. 2.4.5 TATP The only benchmark that we use in this dissertation and it is not specified by the Transaction Processing Council is the Telecommunications Application Transaction Processing Benchmark or TATP [NWMR09] 10 . TATP was originally developed by Nokia as an in-house test to verify the suitability of various database, operating system and hardware offerings for use with Nokia’s telecommunications business. TATP consists of seven transactions, operating on four database tables, which implement various Home Location Register operations executed by mobile networks during cell phone calls, including cell tower hand-offs and call forwarding. The transactions are extremely short, usually accessing only 1-4 database rows, and must execute with very low latency even under extreme load. The benchmark is unusual in that many transactions fail due to either invalid inputs or because the probe for non-existent entries. They have very high failure rate, which is on average 25%, stresses the logging and recovery components of the system, while may cause deadlocks. Three of the transactions are read-only while the other four perform updates (execution frequency in the parenthesis): • Get Subscriber Data (35%). The GetSubscriberData is a read-only transaction which retrieves information about the location of subscriber accessing a single table. It is one of the two transactions of the benchmark that are not expected to fail. 10 Also known as “Telecom One” or ‘TM-1”, and as “Network Database Benchmark” or “NDBB”. 32 CHAPTER 2. BACKGROUND: TRANSACTION PROCESSING • Get New Destination (10%). The GetNewDest is a read-only transaction which retrieves the current call forwarding destination for a subscriber, if any, touching two tables. There is a 75% probability the subscriber to not have a current call forwarding destination, in that case the transaction fails. • Get Access Data (35%). The GetAccData is a read-only transaction which returns the subscriber’s access validation data, by probing a single table. There is 37.5% probability for this transaction to fail. • Update Subscriber Data (2%). The UpdSubData transaction updates a subscriber’s profile touching two tables and with 37.5% failure rate. • Update Location (14%). The UpdLocation transaction updates the current location of a subscriber. It touches a single table and never fails. • Insert Call Forwarding (2%). The InsCallFwd adds a call forwarding destination. It touches three tables (making it the most complicated transaction of the benchmark), and has 68.75% failure rate. • Delete Call Forwarding (2%). The DelCallFwd transaction removes a call forwarding destination. It touches two tables and has 68.75% failure rate. Reflecting its origins in cell phone call processing, the benchmark focuses on throughput, very low and predictable response times, and high availability. From the description of the transactions the reader understands how stressful this benchmark is for the database engine. All the transactions constitute of one or few index probes and record accesses, and very limited application logic. As result, during the execution of this benchmark the system spends most of its time inside the database engine. For example, compared with TPC-C, TATP’s dataset tends to be memory-resident and the transactions are much shorter with vastly higher abort rates. TATP stresses the transaction processing engine heavily because the short transactions tend to expose overheads longer transactions would mask. 33 Chapter 3 Scalability Problems in Database Engines Database engines have long been able to efficiently handle multiple concurrent requests. Until recently, however, a computer contained only a few single-core CPUs, and therefore only a few transactions could simultaneously access the database engine’s internal structures. This allowed database engines to get away with using non-scalable approaches without any severe penalty. With the arrival of multicore chips, however, this situation is rapidly changing. More and more threads can run in parallel, stressing the internal scalability of the database engine. Systems optimized for high performance at a limited number of cores are not assured similarly high performance at a higher core count, because unanticipated scalability obstacles arise. In this chapter, we first present the major components of a typical database engine, making clear that the codepaths in transaction execution are full of points of serializations. Then, we benchmark four popular open-source database engines (SHORE, BerkeleyDB, MySQL, and PostgreSQL) on a modern multicore machine. We find that all suffer in terms of scalability. 1 3.1 Introduction Most database engine designs date back to the 1970’s or 1980’s, when disk I/O was the predominant bottleneck. Machines typically featured 1-8 (uni)processors (up to 64 at the very high end) and limited RAM. Single-thread speed, minimal RAM footprint, and I/O subsystem efficiency determined overall performance of a storage manager. Efficiently multiplexing concurrent transactions to hide disk latency was the key to high throughput. Thus, research focused on efficient buffer pool management, fine-grain concurrency control, and sophisticated caching and logging schemes. Today’s database systems face a different environment. Main memories are in the order of several tens of gigabytes, and play the role of disk for many applications whose working set fits in memory [Gra07b, SMA+ 07]. Modern CPU designs all feature multiple processor cores per chip, often with each core providing some flavor of hardware multithreading. For the 1 This chapter highlights some of our findings which appeared in EDBT 2009 [JPH+ 09]. CHAPTER 3. SCALABILITY PROBLEMS IN DATABASE ENGINES 34 Norm. Throughput 10 MySQL 8 BDB 6 PostgreSQL 4 SHORE 2 0 0 4 8 12 16 20 24 Concurrent Clients/Transactions 28 32 Figure 3.1: Scalability of four popular open-source storage managers, with throughput on the y-axis. A perfectly scalable system, at 32 clients (at the right-most of the graph) would achieve 32x the performance of the 1 client, more than three times higher than the highest performing system under test. foreseeable future we can expect single-thread performance to remain the same or increase slowly while the number of available hardware contexts grows exponentially. As a result, the database engine must be able to utilize the dozens of hardware contexts that will soon be available to it. However, as Section 3.2 discusses, the codepath of a typical database engine is full of points of serialization which may hamper scalability. Unfortunately, the internal scalability of database engines has not been tested under such rigorous demands before. To determine how well existing database engines scale, we experiment with four popular open-source systems: SHORE [CDF+ 94], BerkeleyDB 2 , MySQL 3 , and PostgreSQL [SR86]. The latter three engines are all widely deployed in commercial systems. We ran our experiments on a Sun T2000 (Niagara) server [KAO05], a “lean” multicore processor design [HPJ+ 07], featuring eight cores and four hardware thread contexts per core for a total of 32 OS-visible “processors.” Our first experiment consists of a micro-benchmark (a small, tightly-controlled application) where each client in the system creates a private table and repeatedly inserts records into it (see Section 3.3.1 for details). The setup ensures there is no contention for database locks or latches, and that there is no I/O on the critical path. Figure 3.1 shows the results of executing this micro-benchmark on each of the four storage managers. The number of concurrent threads varies along the x-axis with the corresponding 2 3 http://www.oracle.com/technology/products/berkeley-db/index.html http://www.mysql.com 3.2. CRITICAL SECTIONS INSIDE A DATABASE ENGINE 35 throughput for each engine on the y-axis. As the number of concurrent threads grows from 1 to 32, throughput in a perfectly scalable system should increase linearly. However, none of the four systems scales well, and their behavior varies from arriving at a plateau (PostgreSQL and SHORE) to a significant drop in throughput (BerkeleyDB and MySQL). These results suggest that, with core counts doubling every two years, none of these transaction processing systems is ready for the multicore era; even though database systems have utilized parallel hardware for decades (e.g. [DG92, SKPO88, DGS+ 90, BAC+ 90, AVDBF+ 92]). It should be noted that Figure 3.1 summarizes the situation ca. 2009. The situation has improved markedly since we first ran these experiments. Developers of the various engines have focused on improving their scalability, sometimes reporting that they used techniques presented in this dissertation. In retrospect these results are understandable because when these engines were developed, internal scalability was not a bottleneck and designers did not foresee the coming shift to multicore hardware. At the time, it would have been difficult to justify spending considerable effort in this area. Today, however, internal scalability of database engines is the key to performance as core counts continue increasing. The rest of this chapter is structured as follows. In Section 3.2, we briefly overview the major components of a database engine and list the kinds of critical sections they use. Then, in Section 3.3, we measure the scalability of popular database engines, and we conclude in Section 3.4. 3.2 Critical sections inside a database engine Database engines purposefully serialize transaction threads in three ways. Database locks enforce consistency and isolation between transactions by preventing other transactions from accessing the lock holder’s data. Locks are a form of logical protection and can be held for long duration (potentially several disk I/O times). Latches protect the physical integrity of database pages in the buffer pool, allowing multiple threads to read them simultaneously, or a single thread to update them. Transactions acquire latches just long enough to perform physical operations (at most one disk I/O), depending on locks to protect that data until transaction commit time. Locks and latches have been studied extensively [ACL87, GL92]. Database locks are especially expensive to manage, prompting proposals for hardware acceleration [Rob85]. Critical sections form the third source of serialization. Database engines employ many complex, shared data structures; critical sections (usually enforced with semaphores or mutex 36 CHAPTER 3. SCALABILITY PROBLEMS IN DATABASE ENGINES locks) protect the physical integrity of these data structures in the same way that latches protect page integrity. Unlike latches and locks, critical sections have short and predictable duration’s because they seldom span I/O requests or complex algorithms [JPA08, HM93]; often the thread only needs to read or update a handful of memory locations. For example, a critical section might protect traversal of a linked list. Critical sections abound throughout the codepath of typical database engines. For example, in Shore-MT (the storage manager we present next in Chapter 5), we estimate that a TPC-C Payment transaction – which only touches 4-6 database records (see Section 2.4.3) – enters roughly one hundred critical sections before committing. Under these circumstances, even uncontended critical sections are important because the accumulated overhead can contribute a significant fraction of overall cost. The rest of this section presents an overview of major storage manager components and lists the kinds of critical sections they use. Buffer pool manager. The buffer pool manager maintains a pool for in-memory copies of in-use and recently-used database pages and ensures that the pages on disk and in memory are consistent with each other. The buffer pool consists of a fixed number of frames which hold copies of disk pages and provide latches to protect page data. The buffer pool uses a hash table that maps page IDs to frames for fast access, and a critical section protects the list of pages at each hash bucket. Whenever a transaction accesses a persistent value (data or metadata) it must locate the frame for that page, pin it, then latch it. Pinning prevents the pool manager from evicting the page while a thread acquires the latch. Once the page access is complete, the thread unlatches and unpins the page, allowing the buffer pool to recycle its frame for other pages if necessary. Page misses require a search of the buffer pool for a suitable page to evict, adding yet another critical section. Overall, acquiring and releasing a single page latch requires at least 3-4 critical sections, and more if the page gets read from disk. Lock manager. Database locks preserve isolation and consistency properties between transactions. Database locks are hierarchical, meaning that a transaction wishing to lock one row of a table must first lock the database and the table in an appropriate intent mode. Hierarchical locks allow transactions to balance granularity with overhead: fine-grained locks allow high concurrency but are expensive to acquire in large numbers. A transaction which plans to read many records of a table can avoid the cost of acquiring row locks by escalating to a single table lock instead. However, other transactions which attempt to modify unrelated rows in the same table would then be forced to wait. The number of possible locks 3.3. SCALABILITY OF EXISTING ENGINES 37 scales with the size of the database, so the storage engine maintains a lock pool very similar to the buffer pool. The lock pool features critical sections that protect the lock object free list and the linked list at each hash bucket. Each lock object also has a critical section to “pin” it and prevent recycling while it is in use, and another to protect its internal state. This means that to acquire a row lock, a thread enters at least three critical sections for each of the database, table, and row lock. Log manager. The log manager ensures that modified pages in memory are not lost in the event of a failure: all changes to pages are logged before the actual change is made, allowing the page’s latest state to be reconstructed during recovery. Every log insert requires a critical section to serialize log entries and another to coordinate with log flushes. An update to a given database record often involves several log entries due to index and metadata updates that go with it. Free space management. The storage manager maintains metadata which tracks disk page allocation and utilization. This information allows the storage manager to allocate unused pages to tables efficiently. Each record insert (or update that increases record size) requires entering several critical sections to determine whether the current page has space and to allocate new pages as necessary. Note that the transaction must also latch the free space manager’s metadata pages and log any updates. Transaction management. The system maintains a total order of transactions in order to resolve lock conflicts and maintain proper transaction isolation. Whenever a transaction begins or ends this global state must be updated. In addition, no transaction may commit during a log checkpoint operation, in order to ensure that the resulting checkpoint is consistent. Finally, multi-threaded transactions must serialize the threads within a transaction in order to update per-transaction state such as lock caches. 3.3 Scalability of existing engines Obviously the actual number and behavior of critical sections differs depending on the specific implementation of the database engine under test. That’s why in this section we measure the internal scalability of various database engines. We begin by describing the experimental environment and then we proceed to the actual evaluation. 38 3.3.1 CHAPTER 3. SCALABILITY PROBLEMS IN DATABASE ENGINES Experimental setup All experiments were conducted using a Sun T2000 (Niagara) server [KAO05, DLO05] running Solaris 10. The Niagara chip has an aggressive multi-core architecture with 8 cores clocked at 1GHz; each core supports 4 thread contexts, for a total of 32 OS-visible “processors.” The 8 cores share a common 3MB L2 cache and each of them is clocked at 1GHz. The machine is configured with 16GB of RAM and its I/O subsystem consists of a RAID-0 disk array with 11 15kRPM disks. We relied heavily on the Sun Studio development suite, which integrates compiler, debugger, and performance analysis tools. Unless otherwise stated every system is compiled using version 5.9 of Sun’s CC. All profiler results were obtained using the ‘collect’ utility, which performs sample-based profiling on unmodified executables and imposes very low overhead (< 5%). We evaluate four open-source database engines: PostgreSQL [SR86], MySQL, BerkeleyDB, and SHORE [CDF+ 94]. PostgreSQL v8.1.4. PostgreSQL is an open source database management system providing a powerful optimizer and many advanced features. We used a Sun distribution of PostgreSQL optimized specifically for the T2000. We configured PostgreSQL with a 3.5GB buffer pool, the largest allowed for a 32-bit binary.4 The client drivers make extensive use of SQL prepared statements. MySQL v5.1.22-rc. MySQL is a very popular open-source database server recently acquired by Sun. We configured and compiled MySQL from sources using InnoDB as the underlying transactional storage engine. InnoDB is a full transactional database engine (unlike the default, MyISAM). Client drivers use dynamic SQL syntax calling stored procedures because we found they provided significantly better performance than prepared statements. BerkeleyDB v4.6.21. BerkeleyDB is an open source, embedded database engine currently developed by Oracle and optimized for C/C++ applications running known workloads. It provides full database engine capabilities but client drivers link against the database library and make calls directly into it through the C++ API, avoiding the overhead of a SQL front end. BerkeleyDB is fully reentrant but depends on the client application to provide multithreaded execution. We note that BerkeleyDB is the only storage engine without rowlevel locking; its page-level locks can severely limit concurrency in transactional workloads. 4 The release notes mention sub-par 64-bit performance 3.3. SCALABILITY OF EXISTING ENGINES 39 SHORE v5.0.1. SHORE was developed at the University of Wisconsin in the early 1990’s and provides features that all modern DBMS use: full concurrency control and recovery with two-phase row-level locking and write-ahead logging, along with a robust implementation of B+Tree indexes. The SHORE database engine is designed to be either an embedded database or the back end for a “value-added server” implementing more advanced operations. Client driver code links directly to the database engine and calls into it using the API provided for value-added servers. Client code must use the threading library that SHORE provides. For comparison and validation of the results, we also present measurements from a commercial database manager (DBMS “X”). 5 All database data resides on the RAID-0 array, with log files sent to an in-memory file system. The goal of our experiments is to exercise all the components of the database engine (including I/O, locking and logging), but without imposing I/O bottlenecks. Unless otherwise noted, all database engines were configured with 4GB buffer pools. We are interested in two metrics: throughput (e.g.. transactions per second) and scalability (how throughput varies with the number of active threads). Ideally an engine would be both fast and scalable. Unfortunately, as we will see, database engines tend to be either fast or scalable, but not both. Our micro-benchmark repeatedly inserts records into a database table backed by a BTree index. Each client uses a private table; there is no logical contention and no I/O on the critical path. 6 Transactions commit every 1000 records, with one exception: We observed a severe bottleneck in log flushes for MySQL/InnoDB and modified its version of the benchmark to commit every 10000 records in order to allow a meaningful comparison against the other engines. Record insertion stresses primarily the free space manager, buffer pool, and log manager. In order to extract the highest possible performance from each storage manager, we customized our benchmarks to interface with each storage manager directly through its respective C API. Client code executed on the same machine as the database server, but we found the overhead of clients to be negligible (< 5%). 3.3.2 Evaluation of performance and scalability We begin by benchmarking each database database engine under test and highlight the most significant factors that limit its scalability. Due to lock contention in the transactional 5 6 Licensing restrictions prevent us from disclosing the vendor. All the engines use asynchronous page cleaning and generated more than 40MB/sec of disk traffic during the tests. CHAPTER 3. SCALABILITY PROBLEMS IN DATABASE ENGINES 40 Efficiency (tps/client) Throughput / Client 10 DBMS 'X' PostgreSQL 1 MySQL BDB SHORE 0.1 0 8 16 24 32 Concurrent clients Figure 3.2: Comparison of the efficiency (throughput/thread) with which several storage engines execute transactions as the number of threads increases. Ideally, per-thread throughput remains steady as more threads (or utilization) join the system. benchmarks, the internals of the engines do not face the kind of pressure they do on the insert-only benchmark. Thus we use the latter to expose the scalability bottlenecks at high core counts and to highlight the expected behavior of the transactional benchmarks as the number of hardware contexts per chip continues to increase. Figure 3.2 compares the scalability of the various engines when we run the insert-only micro-benchmark. This figure shows the efficiency of the system, measured in transactions per second per thread, plotted on a log-y axis. Higher is better, and a perfectly scalable system would maintain the same efficiency as thread counts increase. We use log-y scale on the graphs because it shows scalability clearly without masking absolute performance. Linear y-axis is misleading because two systems with the same scalability will have differently-sloped lines, making the faster one appear less scalable than it really is. In contrast, a log-y graph gives the same slope to curves having the same scalability. To have a better insight on what is going on we profile the runs with multiple concurrent clients (16 or 24) stressing up the storage engine. Then we collect the results and interpret call stacks to identify the operations where each system spends its time. PostgreSQL. PostgreSQL suffers a loss of parallelism due to three main factors. First, contention for log inserts causes threads to block (XLogInsert). Second, calls to malloc add more serialization during transaction creation and deletion (CreateExecutorState and ExecutorEnd). Finally, transactions block while trying to lock index metadata (ExecOpenIndices), 3.3. SCALABILITY OF EXISTING ENGINES 41 even though no two transactions ever access the same table. Together these bottlenecks only account for 10-15% of total thread time, but that is enough to limit scalability. MySQL. MySQL/InnoDB is bottlenecked on two spots. The first one is the interface to InnoDB; in a function called srv conc enter innodb threads remain blocked as long as around the 39% of the total execution time. The second one are the log flushes. In another function labeled log preflush pool modified pages the system again experiences large blocking time equal to the 20% of the total execution time (even after increasing transaction length to 10K inserts). We also observe that MySQL spends a non-trivial fraction of its time on two mallocrelated functions, take deferred signal and mutex lock internal. This suggests a potential for improvement by avoiding excessive use of malloc (trash stacks, object re-use, thread-local malloc libraries, etc.) BerkeleyDB. BDB spends the majority of its time on either testing for availability or trying to acquire a mutex – the system spends over 80% of its processing time in two functions with names db tas lock and lock try. Presumably the former is a spinning test-and-set lock while the latter is the test of the same lock. Together these likely form a test-and-test-and-set mutex primitive [RS84], which is supposed to scale better than the simple test-and-set. The excessive use of test-and-test-and-set (TATAS) locking justifies the high performance of BDB on low contended cases, since the TATAS locks impose very little overhead on low contention, but fail miserably on high contention. BerkeleyDB employs coarse-grained page-level locking, which by itself imposes scalability problems. The two callers for lock acquisition have names bam search and bam get root. So, in high contention BDB spends most of its time trying to acquire the latches for tree probes. Additionally, we see that the system spends significant amount of time blocked waiting on a pthread mutex lock and cond wait, most probably because the pthread mutexes are used as a fallback plan to acquire the highly contended locks (i.e. spin-then-block). DBMS “X”. Unfortunately, the commercial database engine is significantly harder to profile, lacking debug symbols and making all system calls in assembly code rather than relying on standard libraries. However, we suspect from the system’s I/O behavior and CPU utilization that log flush times are the main barrier to scalability. SHORE. SHORE suffers multiple scalability bottlenecks, which is unsurprising since it was designed to multiplex all user-level threads over a single kernel thread. This section highlights the bottlenecks we observed in existing database engines, which developers cannot ignore if the goal is a true scalable system in emerging many-core hardware. 42 CHAPTER 3. SCALABILITY PROBLEMS IN DATABASE ENGINES Scaleup for P=32 10 8% 8 6 PostgreSQL Peff (Amdahl’s Law) 4 59% 2 MySQL BerkeleyDB 80% 0 0% 20% 40% Degree of serialization (s) 60% 80% 100% Figure 3.3: Measured bottleneck sizes and scalability vs. predictions by Amdahl’s Law. As the PostgreSQL case illustrates, what appears to be a small bottleneck can still hamper scalability as the number of concurrent clients increases. 3.3.3 Ramifications We presented the major components of a typical database engine and made clear that the codepaths in transaction execution are full of critical sections. As a result, and as can be seen in the preceding Section 3.3.2, all the major open source database engines suffer scalability bottlenecks which prevent them from exploiting multicore hardware. Some of the bottlenecks (such as in PostgreSQL and DBMS “X”), do not appear so large. However, Amdahl’s Law [Amd67] captures just how difficult it can be to extract scalable performance from parallel hardware. To illustrate, Figure 3.3 overlays Amdahl’s Law with a scatter plot of the scalability and bottleneck sizes for three of the database engines we evaluated in the previous section. 7 As can be seen, the predicted and measured impact of bottlenecks are very close, verifying that small bottlenecks have a disproportionate impact on scalability. Even PostgreSQL, which suffers only an 8% bottleneck, utilizes fewer than 10 cores. 3.4 Conclusion This chapter shows that the codepath of database engines is cluttered with a large number of critical sections, and underscores the importance of focusing on them since they hamper scalability. The rest of the dissertation investigates ways to boost scalability of database 7 We could not measure the size of the bottleneck for DBMS “X”, and SHORE’s bottleneck is absurdly large. 3.4. CONCLUSION 43 engines. We will show that this is do-able, even when starting from such unpromising results as shown here. 44 CHAPTER 3. SCALABILITY PROBLEMS IN DATABASE ENGINES Part II Addressing scalability bottlenecks 45 47 Chapter 4 Critical sections in transaction processing: categorization and implementation As discussed in the previous part, the serial computations or “critical sections” are those which determine the scalability of database engines. In this chapter, we make the key observation that not all the critical sections constitute equal threats to the scalability of the system. The most worrisome are those which the contention for them increases along with the hardware parallelism. System designers should make the removal of those unscalable critical sections their top priority. Furthermore, we observe that, in practice, many critical sections are so many and so short that enforcing them contributes a significant or even dominating fraction of their total cost and tuning them directly improves the systems performance. In general, in order to ameliorate the impact of critical sections, we should both provide algorithmic changes and employ proper synchronization primitives. 1 4.1 Introduction Ideally, a database engine would scale perfectly, with throughput remaining (nearly) proportional to the number of clients, even for a large number of clients, until the machine is fully utilized. In practice several factors limit database engine scalability. Disk and compute capacities often limit the amount of work that can be done in a given system, and badlybehaved applications generate high levels of lock contention and limit concurrency. However, these bottlenecks are all largely external to the database engine; within the storage manager itself, threads share many internal data structures. Whenever a thread accesses a shared data structure, it must prevent other threads from making concurrent modifications or data races and corruption will result. These protected accesses are known as critical sections, and can reduce scalability, especially in the absence of other, external bottlenecks. As Chapter 3 discussed, the codepaths of typical database engines are cluttered with critical sections. Out of the large number of critical sections only those whose contention 1 This chapter is based on material presented at [PTJA11] and [JPA08]. 48 CHAPTER 4. CRITICAL SECTIONS increases with the hardware parallelism impose threat to scalability. On the other hand, given their sheer number, even uncontended critical sections are important because their combined overhead can contribute a significant fraction of overall transaction cost. The literature abounds with synchronization approaches and primitives which could be used to enforce critical sections, each with its own strengths and weaknesses. The database system developer must then choose the most appropriate approach for each type of critical section encountered during the tuning process or risk lowering performance significantly. To our knowledge there is only limited prior work that addresses the scalability and performance impact of critical sections, leaving developers to learn which primitives are most useful by trial and error. As the database developer optimizes the system for scalability, algorithmic changes are required to reduce the number of threads contending for particular critical section. Additionally, we find that the method by which existing critical sections are enforced is a crucial factor in overall performance and, to some extent, scalability. Database code exhibits extremely short critical sections, such that the overhead of enforcing those critical sections is a significant or even dominating fraction of their total cost. Reducing the overhead of enforcing critical sections directly impacts performance and can even take critical sections off the critical path without the need for costly changes to algorithms. The contributions of this chapter are the following: 1. We make a distinction between the different categories of critical sections which impose different threats to scalability of the system. The most worrisome are those which the contention for them increases along with the hardware parallelism. 2. We make a thorough performance comparison of the various synchronization primitives in a software system developer’s toolbox and highlight the best ones for practical use. 3. Using a prototype database engine as test-bed, we show that the combination of enforcing the appropriate synchronization primitives and employing algorithmic changes can drastically improve scalability. The rest of this chapter is structured as follows. In Section 4.2 we make the key observation that not all the points of synchronization constitute equal threats to the scalability of the system. Then, Section 4.3 presents and evaluates the most common types of synchronization approaches. In addition, it identifies the most useful ones for enforcing the types of critical sections found in database code. Finally, Section 4.4 discusses how one can potentially improve critical section capacity and Section 4.5 concludes. 4.2. COMMUNICATION PATTERNS AND CRITICAL SECTIONS 4.2 49 Communication patterns and critical sections Traditional transaction processing systems excel at providing high concurrency, or the ability to interleave multiple concurrent requests or transactions over limited hardware resources. However, as chip manufacturers continue to stamp out as many processing cores as possible onto each chip, performance increasingly depends on execution parallelism, or the ability for multiple requests to make forward progress simultaneously in different execution contexts. Even the smallest of serializations on the software side therefore impacts scalability and performance [HM08]. Unfortunately, recent studies show that high concurrency in transaction processing systems does not necessarily translate to sufficient execution parallelism [JPH+ 09, JPS+ 10], due to the high degree of irregular and fine-grained communication they exhibit. In this section we categorize the types of communication that can occur in an OLTP system. Communication is important because in order to be performed correctly it imposes some kind of serialization, as the system needs to execute critical sections. Thus, critical sections, in turn, fall into different categories depending on the nature of communication they protect and the contention they tend to trigger in the system. We will use this categorization in later chapters in order to analyze the execution of a shared everything system as we will be evolving its design (e.g. Chapter 5, Chapter 6, and Chapter 7). 4.2.1 Types of communication Transaction processing systems employ several different types of communication and synchronization. Database locking operates at the logical (application) level to enforce isolation and atomicity between transactions. Page latching operates at the physical (database page) level to enforce the consistency of the physical data stored on disk in the face of concurrent updates from multiple transactions. Finally, at the lowest levels, critical sections protect various code paths which must execute serially to protect the consistency of the system’s internal state. Critical sections are traditionally enforced by mutex locks, atomic instructions, etc. We note that locks and latches, which form a crucial part of the systems’ internal state, are themselves protected by critical sections, so analyzing the behavior of critical sections captures nearly all forms of communication in the DBMS. 4.2.2 Categories of critical sections Because a transaction processing system cannot always eliminate communication entirely without giving up important features, we must find ways to achieve scalability while still allowing some communication. In order to guide this search, we break communication pat- CHAPTER 4. CRITICAL SECTIONS 50 Core Core Core Core Core Core Unbounded or unscalable (e.g. locking, latching) Core Core Core Core Core Core Fixed or point-to-point (e.g. operator pipelining) Core Core Core Core Core Core Cooperative or composable (e.g. logging) Figure 4.1: We classify the communication, and the resulting critical sections, into three patterns: unbounded, fixed and cooperative. Only the unbounded communication poses scalability problems, the other two mostly cause overheads into single-threaded execution. terns into three types: unbounded, fixed, and cooperative. We illustrate the three patterns in Figure 4.1 and briefly describe them in the following paragraphs. Unbounded or unscalable This type of pattern, shown on the left side of Figure 4.1, arises when the number of threads in a point of communication is roughly proportional to the degree of parallelism in the system. Unbounded or unscalable communication has the highly undesirable tendency to affect every thread in the system. As hardware parallelism increases the degree of contention for the corresponding critical sections that coordinate unbounded communication also increases without bounds. No matter how efficient or infrequent the communication, exponentiallyincreasing parallelism will eventually expose it as a bottleneck. In other words, making these critical sections shorter or less frequent provides a little slack but does not fundamentally improve scalability. Globally shared data structures, which multiple threads update concurrently, fall directly to this category. Thus, in a naive implementation (or in an implementation not optimized for high hardware parallelism, like the systems presented in Chapter 3), unbounded communication can easily dominate. Fixed The fixed communication pattern, shown in the middle of Figure 4.1, resides at the other extreme of the spectrum, and involves a constant or near-constant number of threads regardless of the degree of parallelism. The pattern itself limits the amount of contention which 4.2. COMMUNICATION PATTERNS AND CRITICAL SECTIONS 51 can arise for the corresponding critical sections, because contention is independent of the underlying hardware and depends only on the (fixed) number of threads which communicate. Grid-based simulations in scientific computing (including several from the SPLASH-2 benchmark suite [WOT+ 95]) exemplify this type of communication, with each simulated object communicating only with its nearest neighbors in the grid. Peer-to-peer networks (e.g. [SMK+ 01, RD01]) also employ fixed or near-fixed communication patterns. In data management applications an example of fixed communication are producerconsumer pairs. Producer-consumer pairs are frequent in business intelligence workloads which execute long-running queries consisting of multiple database operators and exhibiting intra-query parallelism. For example, each operator in a query produces data consumed by the operator right above in the query execution plan. The execution of the producer operator (and the sub-tree below it in the execution plan) and of the consumer (and everything above) can be parallelized using the exchange operator [Gra90]. Such operator pipelining does not cause contention because of the fixed number of threads communicating. On the other hand, the parallelism in transaction processing systems comes from the concurrent execution of different requests (transactions), rather than from a single request, as the intraquery parallelism in business intelligence workloads. Transactions are typically executed by a single thread since they have very narrow execution plans and touch few data. Each thread in the system acting on behalf of an independent transaction competes with the other threads to access shared data structures, rather than to pass data to other threads. Thus, in transaction execution fixed communication in not as frequent. Cooperative or composable A third kind of communication pattern, which we call cooperative and show on the right side of Figure 4.1, arises when threads which wait for some resource can cooperate with each other to reduce contention. A canonical example of cooperative communication arises in the context of a parallel LIFO queue where threads push() and pop() items. While accessing the head of the queue is a critical section (if two threads modify it concurrently, it will cause unpredictable behavior or corruption), pairs of push() and a pop() requests which encounter delays can cooperate by combining their requests and eliminating themselves directly without competing further for the underlying data structure [MNSS05]. Cooperative communication results to critical sections that are highly resistant to contention because threads take advantage of queuing delays to combine their requests. Requests which combine drop out, making the communication self-regulating: adding more threads CHAPTER 4. CRITICAL SECTIONS 52 to the system gives more opportunity for threads to combine rather than competing directly for the critical section. Examining these three types of communication suggests that unbounded communication is the main threat to scalability. Neither of the other two types allows contention to grow without bound, even though they sometimes pose significant overhead in single-thread performance. Therefore the designers of transaction processing systems (or any type of software systems in general) should focus on eliminating unbounded communication. 4.2.3 How to predict and improve scalability As scalability should be the most important goal of modern software systems, the two questions that rise is how to predict and how to improve the scalability of software systems. Predict. Out of the three types of critical sections we discussed, it is clear that only the un-scalable ones impose threat to the scalability of the system. The other two types, fixed and composable, aggravate only the single-thread performance. Thus, a reliable way to measure and predict the scalability of a system is by counting and categorizing the critical sections it normally executes. Improve. The real key to scalability lies in two directions. • First, any communication which is not necessary should be eliminated. For example, in the storage manager which we will be using for the rest of this work (see Section 5.2), some transaction-related statistics were being kept in a shared memory space and they were being accessed by all the threads in the system. It was not long enough before the statistics maintenance to be become the obstacle in scalability. • Second, all unbounded communication should be eliminated or converted to either the fixed or composable type, thus removing the potential for bottlenecks to arise. While we worry about scalability, we should not completely ignore the single-thread performance. There are many different synchronization primitives which can be used to enforce a critical section. Given the type of communication, the expected contention and the size (or length) of the critical section, different implementations are more appropriate. We discuss this issue next. 4.3 Enforcing critical sections The literature abounds with different synchronization primitives and approaches, each with different overhead (cost to enter an uncontended critical section) and scalability (whether, 4.3. ENFORCING CRITICAL SECTIONS 53 and by how much, overhead increases under contention). Unfortunately, efficiency and scalability tend to be inversely related: the cheapest primitives are unscalable, and the most scalable ones impose high overhead; but both metrics impact the performance of a database engine. Next we present a brief overview of the types of primitives available to the designer. 4.3.1 Synchronization primitives The most common approach to synchronization is to use a synchronization primitive to enforce the critical section. There is a wide variety of primitives to choose from, all more or less interchangeable with respect to correctness. Blocking mutex. All operating systems provide heavyweight blocking mutex implementations. Under contention these primitives deschedule waiting threads until the holding thread releases the mutex. These primitives are fairly easy to use and understand, in addition to being portable. Unfortunately, due to the cost of context switching and their close association with the kernel scheduler, they are not particularly cheap or scalable for the short critical sections we are interested in. Test-and-set spinlocks. Test-and-set (TAS) spinlocks are the simplest mutex implementation. Acquiring threads use an atomic operation such as a compare-and-swap to simultaneously lock the primitive and determine if it was already locked by another thread, repeating until they lock the mutex. A thread releases a TAS spinlock using a single store. Because of their simplicity TAS spinlocks are extremely efficient. Unfortunately, they are also among the least-scalable synchronization approaches because they impose a heavy burden on the memory subsystem. Variants such as test-and-test-and-set (TATAS) [RS84], exponential back-off [And90], and ticket-based [RK79] approaches reduce the problem somewhat, but do not solve it completely. Backoff schemes, in particular, are hardware-dependent and difficult to tune. Queue-based spinlocks. Queue-based spinlocks organize contending threads into a linked list queue where each thread spins on a different memory location. The thread at the head of the queue holds the lock, handing off to a successor when it completes. Threads compete only long enough to append themselves to the tail of the queue. The two best-known queuing spinlocks are MCS [MCS91a] and CLH [Cra93, MLH94], which differ mainly in how they manage their queues. MCS queue links point toward the tail, while CLH links point toward the head. Queuing improves on test-and-set by eliminating the burden on the memory system and also by decoupling lock contention from lock hand-off. Unfortunately, each thread is responsible to allocate and maintain a queue node for each lock it acquires. CHAPTER 4. CRITICAL SECTIONS 54 Memory management can quickly become cumbersome in complex code, especially for CLH locks, which require heap-allocated state. Reader-writer locks. In certain situations, threads enter a critical section only to prevent other threads from changing the data to be read. Reader-writer locks allow either multiple readers or one writer to enter the critical section simultaneously, but not both. While operating systems typically provide a reader-writer lock, we find that the pthreads implementation suffers from extremely high overhead and poor scalability, making it useless in practice. The most straightforward reader-writer locks use a normal mutex to protect their internal state; more sophisticated approaches extend queuing locks to support reader-writer semantics [MCS91b, KSUH93]. A note about convoys. Some synchronization primitives, such as blocking mutex and queue-based spinlocks, are vulnerable to forming stable quasi-deadlocks known as convoys [BGMP79]. Convoys occur when the lock passes to a thread that has been descheduled while waiting its turn. Other threads must then wait for the thread to be rescheduled, increasing the chances of further preemptions. The result is that the lock sits nearly idle even under heavy contention. Recent work [HSIS05] has provided a preemption-resistant form of queuing lock, at the cost of additional overhead which can put medium-contention critical sections squarely on the critical path. However, as [JSAM10] shows, proper scheduling can eliminate the problem of convoys due to lock preemptions. 4.3.2 Alternatives to locking Under certain circumstances critical sections can be enforced without resorting to locks. For example, independent reads and writes to a single machine word are already atomic and need no further protection. Other, more sophisticated approaches such as optimistic concurrency control and lock-free data structures allow larger critical sections as well. Optimistic concurrency control. Many data structures feature read-mostly critical sections, where updates occur rarely, and often come from a single writer. The reader’s critical sections are often extremely short and overhead dominates the overall cost. Under these circumstances, optimistic concurrency control (OCC) schemes can improve performance dramatically by assuming no writer will interfere during the operation. The reader performs the operation without enforcing any critical section, then afterward verifies that no writer interfered (e.g. by checking a version stamp). In the rare event that the assumption did not hold, the reader blocks or retries. The main drawbacks to OCC are that it cannot be applied to all critical sections (since side effects are unsafe until the read is verified), and 4.3. ENFORCING CRITICAL SECTIONS 55 unexpectedly high writer activity can lead to livelock as readers endlessly block or abort and retry. Lock-free data structures. Much current research focuses on lock-free data structures [Her91] as a way to avoid the problems that come with mutual exclusion (e.g. [Mic02, FR04]). These schemes usually combine optimistic concurrency control and atomic operations to produce data structures that can be accessed concurrently without enforcing critical sections. Unfortunately, there is no known general approach to designing lock free data structures; each must be conceived and developed separately, so database engine designers have a limited library to choose from. In addition, lock-free approaches can suffer from livelock unless they are also wait-free, and may or may not be faster than the lock-based approaches under low and medium contention (many papers provide only asymptotic performance analysis rather than benchmark results). Transactional memory. Transactional memory approaches enforce critical sections using database-style “transactions” which complete atomically or not at all. This approach eases many of the difficulties of lock-based programming and has been widely researched. Unfortunately, software-based approaches [ST95] impose too much overhead for the tiny critical sections we are interested in, while hardware approaches [HM93, RG02] generally suffer from complexity, lack of generality, or both, and have not been adopted. 2 Finally, we note that transactions do not inherently remove contention; at best transactional memory can serialize critical sections with very little overhead. 4.3.3 Choosing the right approach This subsection evaluates the different synchronization approaches using a series of microbenchmarks that replicate the kinds of critical sections found in the code of a transaction processing system. We present the performance of the various approaches as we vary three parameters: contended vs. uncontended accesses, short vs. long duration, and read-mostly vs. mutex critical sections. We then use the results to identify the primitives which work best in each situation. Each microbenchmark creates N threads which compete for a lock in a tight loop over a one second measurement interval (typically 1 − 10M iterations). The metric of interest is cost per iteration per thread, measured in nanoseconds of wall-clock time. Each iteration begins with a delay of To ns to represent time spent outside the critical section, followed by 2 It is not a surprise that one of the most high-profile attempts to release a transactional memory processor, Sun ROCK [DLMN09], was abandoned just before going to market. CHAPTER 4. CRITICAL SECTIONS 56 Cost/Iteration (nsec) 800 pthread ppmcs 600 mcs tatas 400 ideal 200 0 0 8 16 24 32 Threads Figure 4.2: Performance of mutex locks as the contention varies. Lower cost is better. an acquire operation. Once the thread has entered the critical section, it delays for Ti ns to represent the work performed inside the critical section, then performs a release operation. All delays are measured to 4 ns accuracy using the machine’s cycle count register (highresolution time counters); we avoid unnecessary memory accesses to prevent unpredictable cache misses or contention for hardware resources. For each scenario we compute an ideal cost by examining the time required to serialize Ti plus the overhead of a memory barrier, which is always required for correctness. Experiments involving readers-writers are set up exactly the same way, except that readers are assumed to perform their memory barrier in parallel and threads use a pre-computed array of random numbers to determine whether they should perform a read or write operation. All of our experiments were performed using a Sun T2000 machine, which contains one Sun Niagara I processor, running Solaris 10. The Sun Niagara I chip [KAO05] is a multi-core architecture with 8 cores; each core provides 4 hardware contexts for a total of 32 OS-visible “processors”. Cores communicate through a shared 3MB L2 cache. Contention Figure 4.2 compares the behavior of four mutex implementations as the number of threads in the system varies along the x-axis. The y-axis gives the cost of one iteration as seen by one thread. In order to maximize contention, we set both To and Ti to zero; threads spend all their time acquiring and releasing the mutex. TATAS is a test-and-set spinlock variant. MCS and ppMCS are the original and preemption-resistant MCS locks, respectively, while pthread is the native pthread mutex. Finally, “ideal” represents the lowest achievable cost 4.3. ENFORCING CRITICAL SECTIONS 57 Cost/Iteration (nsec) 800 pthread ppmcs mcs tatas ideal 600 400 200 0 0 100 200 300 Critical section length (nsec) Figure 4.3: Performance of mutex locks as the duration of the critical section varies. Lower cost is better. per iteration, assuming that the only overhead of enforcing the critical section comes from the memory barriers which must be present for correctness. As the degree of contention of the particular critical section changes, different synchronization primitives become more appealing. The native pthread mutex is both expensive and unscalable, making it unattractive. TATAS is by far the cheapest for a single thread, but quickly falls behind as contention increases. We also note that all test-and-set variants are extremely unfair, as the thread which most recently released it is likely to re-acquire it before other threads can respond. In contrast, the queue-based locks give each thread equal attention. Duration Another factor of interest is the performance of the various synchronization primitives as the duration of the critical section varies (under medium contention) from extremely short to merely short. We assume that a long, heavily-contended critical section is a design flaw which must be addressed algorithmically. Figure 4.3 shows the cost of each iteration as 16 threads compete for each mutex. The inner and outer delays both vary by the amount shown along the x-axis (keeping contention steady). We see the same trends as before, with the main change being the increase in ideal cost (due to the critical section’s contents). As the critical section increases in length, the overhead of each primitive matters less; however, ppMCS and TATAS still impose 10% higher cost than MCS, while pthread more than doubles the cost. CHAPTER 4. CRITICAL SECTIONS 58 Cost/Iteration (nsec) 800 Cost/Iteration (nsec) 800 600 600 400 400 200 200 0 tatas (R/W) tatas MCS (R/W) mcs occ ideal 0 0 8 Threads 16 24 32 1 10 Reads per write (avg) 100 Figure 4.4: Performance of reader-writer locks as contention (left) and reader-writer ratio (right) vary. Reader-writer ratio The last parameter we study is the ratio between the readers and the writers. Figure 4.4 (left) characterizes the performance of several reader-writer locks when subjected to 7 reads for every write and with To and Ti both set to 100 ns. The cost/iteration is shown on the y-axis as the number of competing threads varies along the x-axis. The TATAS mutex and MCS mutex apply mutual exclusion to both readers and writers. The TATAS rwlock extends a normal TATAS mutex to use a read/write counter instead of a single “locked” flag. The MCS rwlock comes from the literature [KSUH93]. OCC lets readers increment a simple counter as long as no writers are around; if a writer arrives, all threads (readers and writers) serialize through an MCS lock instead. We observe that reader-writer locks are significantly more expensive than their mutex counterparts, due to the extra complexity they impose. For very short critical sections and low reader ratios, a mutex actually outperforms the rwlock; even for the 100ns case shown here, the MCS lock is a usable alternative. Figure 4.4 (right) fixes the number of threads at 16 and varies the reader ratio from 0 (all writes) to 127 (mostly reads) with the same delays as before. As we can see, the MCS rwlock performs well for high reader ratios, but the OCC approach dominates it, especially for low reader ratios. For the lowest read ratios, the MCS mutex performs the best – the probability of multiple concurrent reads is too low to justify the overhead of a rwlock. 4.3. ENFORCING CRITICAL SECTIONS 59 Figure 4.5: The space of critical section types. Each corner of the cube is marked with the appropriate synchronization primitive to use for that type of critical section. 4.3.4 Discussion and open issues The microbenchmarks from the previous section illustrate the wide range in performance and scalability among the different primitives. From the contention experiment we see that the TATAS lock performs best under low contention due to having the lowest overhead; for high contention, the MCS lock is superior due to its scalability. The experiment also highlights how expensive it is to enforce critical sections. The ideal case (memory barrier alone) costs 50 ns, and even TATAS costs twice that. The other alternatives cost 250 ns or more. By comparison a store costs roughly 10 ns, meaning critical sections which update only a handful of values suffer more than 80% overhead. As the duration experiment shows, pthread and TATAS are undesirable even for longer critical sections that amortize the cost somewhat. Finally, the reader-writer experiment demonstrates the extremely high cost of reader-writer synchronization; a mutex outperforms rwlocks at low read ratios by virtue of its simplicity, while optimistic concurrency control wins at high ratios. Figure 4.5 summarizes the results of the experiments, showing which of the three synchronization primitives to use under what circumstances. We note that, given a suitable algorithm, the lock free approach might be best. The results also suggest that there is much room for improvement in the synchronization primitives that protect small critical sections. Hardware-assisted approaches (e.g. [RG01]) and implementable transactional memory might be worth exploring further in order to reduce CHAPTER 4. CRITICAL SECTIONS 60 overhead and improve scalability. Reader-writer primitives, especially, do not perform well as threads must still serialize long enough to identify each other as readers and check for writers. All the knowledge collected from the experiments of this section can be used for improving the scalability of database engines, whose codepath is cluttered with critical sections. 4.4 Handling problematic critical sections By definition, critical sections limit scalability by serializing the threads which compete for them. Each critical section is simply one more limited resource in the system that supports some maximum throughput. Database engine designers can potentially improve critical section capacity (i.e. peak throughput) by changing how they are enforced or by altering algorithms and data structures. 4.4.1 Algorithmic changes Algorithmic changes can address bottleneck critical sections in three ways: 1. By reducing how often threads enter them. Ideally problematic critical sections would never be executed. For example, in Section 5.3 we are going to see how we remove a significant obstacle to scalability by avoiding interacting with the lock manager through caching. 2. By downgrading them to a category less threatening to scalability. According to the discussion of Section 4.2, a fixed contention or composable critical section is more desired than an unscalable one. For example, in Section 5.4 we are going to see how we improve the scalability by downgrading the critical section for inserting entries to the log buffer from unscalable to composable. 3. By breaking them into several “smaller” ones in a way that it both reduces the length of it and distributes contending threads as well (ideally, each thread can expect an uncontended critical section). For example, buffer pool managers typically distribute critical sections by hash bucket so that only probes for pages in the same bucket must be serialized. In theory, algorithmic changes are the superior approach for addressing critical sections because they can remove or distribute critical sections to ease contention. Unfortunately, developing new algorithms is challenging and time consuming, with no guarantee of a breakthrough for a given amount of effort. In addition, even the best-designed algorithms will 4.4. HANDLING PROBLEMATIC CRITICAL SECTIONS Throughput (ktps) 100 61 Algorithm (bpool) Algorithm (lock, log) Tuning (MCS) Tuning (TATAS) 10 Algorithm (bpool) Baseline 1 1 Concurrent Threads 10 100 Figure 4.6: Algorithmic changes and tuning combine to give best performance. eventually become bottlenecks again if the number of threads increases enough, or if nonuniform access patterns cause hotspots. 4.4.2 Changing synchronization primitives The other approach for improving critical section throughput is, as we saw in Section 4.3.3, by altering how they are enforced. Because the critical sections we are interested in are so short, the cost of enforcing them is a significant – or even dominating – fraction of their overall cost. Reducing the cost of enforcing a bottleneck critical section can improve performance a surprising amount. Also, critical sections tend to be encapsulated by their surrounding data structures, so the developer can change how they are enforced simply by replacing the existing synchronization primitive with a different one. These characteristics make critical section tuning attractive if it can avoid or delay the need for costly algorithmic changes. 4.4.3 Both are needed Figure 4.6 illustrates how algorithmic changes and synchronization tuning combined give the best performance. It presents the performance of several stages of tuning a modern storage manager, with throughput plotted against varying thread count in log-log scale. These numbers came from the experience of converting SHORE [CDF+ 94] to Shore-MT [JPH+ 09] (see Section 5.2). The process began with a thread-safe but very slow version of SHORE and repeatedly addressed critical sections until internal scalability bottlenecks had all been removed. The changes involved algorithmic and synchronization changes in all the major components of the storage manager, including logging, locking, and buffer pool management. CHAPTER 4. CRITICAL SECTIONS 62 The figure shows the performance and scalability of Shore-MT at various stages of tuning. Each thread repeatedly runs transactions which insert records into a private table. These transactions exhibit no logical contention with each other but tend to expose many internal bottlenecks. Note that, in order to show the wide range of performance the y-axis of the figure is log-scale; the final version of Shore-MT scales nearly as well as running each thread in an independent copy of Shore-MT. The “Baseline” curve at the bottom represents the thread-safe but unoptimized SHORE; the first optimization (bpool) was algorithmic and replaced the central buffer pool mutex with one mutex per hash bucket. As a result, scalability improved from one thread to nearly four, but single-thread performance did not change. The second, tuning, optimization (TATAS) replaced the expensive pthread mutex protecting buffer pool buckets with a fast test and set mutex (see Section 4.3.1 for details about synchronization primitives), doubling throughput for a single thread. The third, tuning, optimization (MCS) replaced the test-and-set mutex with a more scalable MCS mutex, allowing the doubled throughput to persist until other bottlenecks asserted themselves at four threads. The next line (lock, log) represents the performance of Shore-MT after algorithmic changes to the lock and log management code, at which point the buffer pool again became a bottleneck. Because the critical sections involved were already as efficient as possible, another algorithmic change was required (bpool2). This time the open-chained hash table was replaced with a cuckoo hash table [PR01] to further reduce contention for hash buckets, improving scalability from 8 to 16 threads and beyond. This example illustrates how both proper algorithms and proper synchronization are required to achieve the highest performance. In general, tuning primitives improves performance significantly, and sometimes scalability as well; algorithmic changes improve scalability and might help or hurt performance (more scalable algorithms tend to be more expensive). Finally, we note that the two tuning optimizations each required only a few minutes to apply, while each of the algorithmic changes required days or weeks to design, implement, and debug. The performance impact and ease of reducing critical section overhead make tuning an important part of the optimization process. 4.5 Conclusions This chapter focused on the critical sections which determine the scalability of any software system. In order to reliably predict the behavior of a system in high parallelism, one needs to not only count but also categorize the critical sections that are executed. Different types 4.5. CONCLUSIONS 63 of critical sections impose different threat to scalability. The definitely unwanted are the unscalable critical sections for which the contention increases with the hardware parallelism. The other two categories, fixed and composable, mainly lower single-thread performance. At the same time, the choice of synchronization primitives significantly affects performance as a large part of the execution is computation-bound. We observe that even uncontended critical sections sap performance because of the overhead they impose and we identify a small set of especially useful synchronization primitives. Database system developers can then utilize this knowledge to select the proper synchronization tool for each critical section and maximize performance. The bottomline is that critical sections impose obstacles to scalability and high overhead to single-thread (uncontended) execution. To ameliorate the impact of critical sections we should both provide algorithmic changes and employ proper synchronization primitives. 64 CHAPTER 4. CRITICAL SECTIONS 65 Chapter 5 Attacking un-scalable critical sections in a conventional design In Chapter 3, we showed that multicore hardware has caught database engines off guard, since their codepaths are full of, mainly un-scalable, critical sections. This chapter presents two mechanisms that boost the scalability of conventional transaction processing by reducing the number of un-scalable critical sections. In particular, we show how we can avoid a large number of un-scalable critical sections in the lock manager using Speculative Lock Inheritance, a mechanism that detects database locks which experience contention at run-time and caches them across transactions. We also show how a simple observation allows us to downgrade the log buffer inserts from being un-scalable critical sections to composable. For all the prototyping and evaluation we use Shore-MT, a multithreaded version of SHORE. ShoreMT constitutes a reliable baseline for our work, since compared to other database engines, it exhibits superior scalability and 2-4 times higher absolute throughput.1 5.1 Introduction Multicore hardware poses a unique challenge, providing exponentially growing parallelism which software must exploit in order to benefit from hardware advances. In the past, a primary use of concurrency in database engines was to overlap delays due to I/O and logical conflicts. Today, however, the high number of threads which can enter the database simultaneously puts new pressure on the internal scalability of the database engine. Unfortunately, none of the existing database engines provides the kind of scalability required for modern multicore hardware; Chapter 3 highlights how current offerings top out at 8-12 cores, failing to utilize the 32 or more hardware contexts which modern hardware makes available. While most database applications are inherently parallel, current database engines are designed under the assumption that only a limited number of threads will access 1 This chapter highlights findings which we presented in EDBT 2009 [JPH+ 09], VLDB 2009 [JPA09] , VLDB 2010 [JPS+ 10] and VLDBJ 2011 [JPS+ 11]. 66 CHAPTER 5. ATTACKING UN-SCALABLE CRITICAL SECTIONS their internal data structures at any given instant. Even when the application is executing parallelizable tasks, and even if disk I/O is off the critical path, serializing accesses to these internal data structures impedes scalability. The main problem is that the code paths of conventional transaction processing are full of critical sections most of them being un-scalable, according to the categorization of Section 4.2.2. In order to ameliorate the scalability of databases systems, the system designers need to study the codepaths and attack the sources of un-scalable critical sections. In this chapter, we attack two significant sources of un-scalable critical sections in conventional transaction processing. In particular, we show how we can avoid a large number of un-scalable critical sections in the lock manager using Speculative Lock Inheritance, a mechanism that detects database locks which experience contention at run-time and caches them across transactions. In addition, we show how a simple observation allows us to downgrade the log buffer inserts from being un-scalable critical sections to composable, and present the corresponding log buffer implementation. All the prototyping and evaluation are done using the Shore-MT storage manager [JPH+ 09]. Before we present the two mechanisms that reduce the number of un-scalable critical sections, we show that Shore-MT constitutes a reliable baseline system for our work, since compared to other database engines, it exhibits superior scalability and 2-4 times higher absolute throughput. After we integrate each of the two mechanisms to Shore-MT, we show the breakdown of critical sections for the execution of a simple transaction and the performance on a highly parallel multicore machine. As we move to more elaborate and scalable designs the number of un-scalable critical sections drops and performance increases. That observation validates our claim in Section 4.2.3 that one can measure and predict the scalability of a transaction processing system by analyzing the number and type of the critical sections at its codepath. The contributions of this chapter are three-fold: 1. We briefly present Shore-MT, a multithreaded version of the SHORE storage manager [CDF+ 94], which we use as a reliable baseline for the rest of our work and make it available to the research community2 . 2. We present Speculative Lock Inheritance or SLI. SLI boosts the scalability of transaction processing by detecting database locks that encounter contention at run time and passing those “hot locks” across transactions, thus avoiding the execution of a significant portion of unscalable critical sections. 2 Available at http://diaswww.epfl.ch/shore-mt/ 5.2. SHORE-MT: A RELIABLE BASELINE 67 3. We present a solution for the contention encountered in insertions on the main memory log buffer of transaction processing systems, by present a log buffer implementation based on consolidations of requests. This technique improves scalability by converting the unscalable critical sections for the log buffer inserts to composable ones. The rest of this chapter is structured as follows. Section 5.2 briefly presents Shore-MT. Section 5.3 shows how SLI helps us to avoid executing un-scalable critical sections inside the centralized lock manager. Section 5.4 shows how a log buffer implementation based on consolidations of requests allows us to downgrade the un-scalable critical sections for inserting records into the main memory log buffer to composable ones; and Section 5.6 concludes. 5.2 Shore-MT: a reliable baseline Since Chapter 3 showed that none of the available open source database engines manages to scale its performance on highly parallel multicore hardware, we set out to create one of our own based on the SHORE storage manager [CDF+ 94]. We elected the SHORE storage manager as our target for optimization for two reasons. First, SHORE supports all the major features of modern database engines: full transaction isolation, hierarchical and rowlevel locking [GR92], a CLOCK buffer pool with replacement and prefetch hints [Smi78], KVL-Tree indexes [Moh90], and ARIES-style logging and recovery [MHL+ 92]. Additionally, SHORE has previously shown to behave like commercial engines at the instruction level [ADH01], making it a good open-source platform for comparing against closed-source engines. This exercise resulted in Shore-MT, which scales far better than its open source peers while also achieving superior single-thread performance. During the process of implementing Shore-MT, we completely ignored single-thread performance and focused only on removing bottlenecks of SHORE. In order to compare the scalability of Shore-MT against the other open- and closed-source peers, we use the microbenchmark described in Section 3.3.1, as well as the Payment and NewOrder transactions from the TPC-C benchmark (see Section 2.4.3). Shore-MT scales commensurately with the hardware we make available to it, setting the absolute example for other systems to follow. In Figure 5.1, we plot the results of the microbenchmark from Figure 3.2 in Chapter 3, but this time also showing results for ShoreMT. While single-threaded SHORE did not scale at all, Shore-MT exhibits excellent scaling. Moreover, at 32 clients, it scales better than DBMS “X”, a popular commercial DBMS. While our original goal was only to achieve high scalability, we also achieved nearly 3x speedup in single-thread performance over SHORE. Shore-MT attains a healthy performance CHAPTER 5. ATTACKING UN-SCALABLE CRITICAL SECTIONS 68 Efficiency (tps/client) Throughput / Client 10 Shore-MT DBMS 'X' PostgreSQL 1 MySQL BDB SHORE 0.1 0 8 16 24 32 Concurrent clients Figure 5.1: Scalability of Shore-MT vs. several open-source database engines. A scalable system maintains steady per-thread performance as load increases. lead over the other engines.3 We attribute the performance improvement to the fact that database engines spend so much time in critical sections. The process of shortening or eliminating critical sections, and reducing synchronization overheads, also had the side effect of shortening the single-thread code path. As a further comparison, Figure 5.2 shows the performance of the three fastest database engines running the New Order (left) and Payment (right) transactions of TPC-C. Again, Shore-MT achieves the highest performance 4 while scaling as well as the commercial system for New Order. In New Order, all three systems encounter logical contention for the STOCK and ITEM tables, causing a significant dip in scalability across the board around 16 clients. Payment, in contrast, imposes no application-level contention, allowing Shore-MT to scale all the way to 32 threads. With Shore-MT, we have a database engine that performs better than its open source peers and its performance is limited only by the underlying hardware parallelism. That is, in the Niagara I multicore processor, with its 8 physical cores and 32 hardware contexts, ShoreMT’s performance scales almost optimally and there are no indications for any potential bottlenecks. 3 4 BerkeleyDB outperforms the other systems at first, but its performance drops precipitously for more than four clients. Some of the performance advantage of Shore-MT is likely due to its clients being directly embedded in the engine while the other two engines communicate with clients using local socket connections. We would be surprised, however, if a local socket connection imposed 100% overhead per transaction. 5.2. SHORE-MT: A RELIABLE BASELINE 69 Efficiency (ktps/thread) 1000 Shore-MT Dbms "X" Postgres 100 10 0 8 16 24 Concurrent NewOrder xcts 32 0 8 16 24 32 Concurrent Payment xcts Figure 5.2: Efficiency of Shore-MT, DBMS “X” and PostgreSQL for varying thread counts, when executing TPC-C NewOrder (left) and Payment (right) transactions, note the logarithmic scale in the y-axis. High, steady performance is better. 5.2.1 Critical section anatomy of Shore-MT However, the number of un-scalable critical sections in Shore-MT’s codepaths are still numerous. The left-most bar of Figure 5.3 shows the breakdown of the critical sections executed on average by Shore-MT for the completion of the very simple UpdLocation transaction of the TATP benchmark (see Section 2.4.5). 5 When Shore-MT executes this simple transaction, it still enters over 70 critical sections with nearly 60 of them being un-scalable. Thus, even though in the Niagara I multicore processor Shore-MT’s performance is limited by hardware and it seems that we cannot further optimize its design, Figure 5.3 shows that there are lurking bottlenecks. These bottlenecks are immediately exposed when we move our experimentation to an even more parallel hardware machine, like the second generation Niagara chip which contains 64 hardware contexts. The component which contributes the majority of the critical sections is the lock manager, while another significant source of critical sections is the log manager. In the following two sections, we present two mechanisms that handle those lurking bottlenecks. 5 We execute the transaction 1000 times and instrument each critical section code to get this breakdown. CHAPTER 5. ATTACKING UN-SCALABLE CRITICAL SECTIONS 70 CSs per Transaction 80 Uncategorized 70 Xct mgr 60 Aether log mgr 50 Log mgr 40 Metadata 30 20 Bpool 10 Page Latches Lock mgr 0 Shore-MT SLI SLI & Aether Figure 5.3: Comparison of the number and type of critical sections executed on average for the completion of a simple transaction, TATP UpdateLocation. The un-scalable critical sections are the bars with solid fills. Shore-MT’s lock manager and log manager codepaths are sources of a large fraction of un-scalable critical sections. 5.3 Avoiding un-scalable critical sections in the lock manager with SLI Virtually all database engines use some form of hierarchical locking [GR92] to allow applications to trade off concurrency and overhead. For example, requests which access large amounts of data can acquire coarse-grained locks to reduce overhead at the risk of reduced concurrency. At the same time, small requests can lock precisely the data (e.g. records) which they access and maximize concurrency with respect to other independent requests. The lock hierarchy is crucial for application scalability because it allows efficient fine-grained concurrency control at the logical level. Ironically, however, hierarchical database locking causes a new scalability problem while addressing the first one: all transactions must acquire intention locks in the upper levels of the hierarchy, contenting with each other for updating the internal lock state of each database lock, in order to access individual objects. The increased hardware concurrency leads to bottlenecks in the centralized lock manager, especially as hierarchical locking forces many threads to update repeatedly the state of a few hot locks. Physical contention causes locking-related bottlenecks even for scalable database applications which cause few logical conflicts. Because of the inherent behavior of hierarchical locking, we expect that every system will eventually encounter this kind of contention within the lock manager, if it has not done so already. Figure 5.4 highlights how contention for database locks impacts performance as we increase load in a multicore system running the 5.3. AVOID UN-SCALABLE CRITICAL SECTIONS WITH SLI Normalized CPU time LM Contention 300% Other Contention LM Overhead 200% Computation 71 418% 100% 0% 2% 11% 48% 67% System load 86% 98% Figure 5.4: Lock manager overheads as system load varies. Contention consumes CPU time without increasing performance. TATP benchmark (see Section 2.4.5 for the benchmark description and Section 5.3.2 for the experimental setup). The x-axis varies load on the system from very light (left) to very heavy (right) while the y-axis shows the fraction of CPU time each transaction spends in the lock manager (not counting time spent blocked on I/O or true lock conflicts). This figure, and those that follow in this subsection, define overhead and contention as the useful and useless work, respectively, performed by the system when processing transactions. We can make two observations from Figure 5.4. First, the useful work due to the lock manager, in light load is around 10-15% (relatively a small fraction of the total) corroborating other studies, like [HAMS08]. Second, nearly all contention in the system arises within the lock manager and that contention component grows rapidly, eventually accounting for a dominant nearly 75% of the transaction’s CPU time on heavy load. Figure 5.4 as well as the left-most bar of Figure 5.3 suggest that to improve scalability we must focus on eliminating contention within the lock manager. Though database designers are often willing to make sacrifices in consistency or other areas if it improves performance [Hel07, JFRS07, Vog09], our goal is to design a mechanism that does not change transaction consistency semantics or introduces other anomalies. It must be transparent and automatic, and impose minimal performance penalty on operation under light load or when the is no contention on the lock manager. Next we present speculative lock inheritance, a technique for reducing contention within the lock manager and achieves the aforementioned goals. 72 CHAPTER 5. ATTACKING UN-SCALABLE CRITICAL SECTIONS Figure 5.5: Agent threads which detect contention at the lock manager retain hot locks beyond transaction commit, passing them directly to the transactions which follow. 5.3.1 Speculative lock inheritance The key to reducing contention within the lock manager with speculative lock inheritance is the observation that virtually all transactions request high-level locks in compatible modes; even requests for exclusive access to particular rows or pages in the database generate compatible intent locks higher up, and transactions which require coarse-grained exclusive access are extremely rare in scalable workloads. Further, in the absence of intervening updates, it makes no semantic difference whether a shared mode (SH) or intention-shared mode (IS) lock is released and re-acquired or simply held continuously. Either way a transaction will see the same unchanged object, and other transactions are free to interleave their reads to the object as well. Speculative lock inheritance or SLI, exploits the lack of logical contention for hot, shared database locks to reduce physical contention for their internal lock state. As Figure 5.5 shows, SLI allows a completing transaction to pass on some locks which it acquired to transactions which follow and are going to be executed by the same worker agent thread. This avoids a pair of release and acquire calls to the lock manager for each such lock. During the lock release phase of transaction commit, the transaction’s agent thread identifies promising candidate locks and places them in a thread-local lock list instead of releasing them. It then initializes the next transaction’s lock list with these previously acquired locks hoping that the new transaction will use some of them. Successful speculation improves performance in two ways. First, a transaction which inherits useful locks makes fewer lock requests, with corresponding lower overhead and better response time; short transactions amortize the cost of the lock acquire over many row accesses instead of just one. Second, other transactions 5.3. AVOID UN-SCALABLE CRITICAL SECTIONS WITH SLI 73 which do request the lock will face less contention in the lock manager. In the following sub-sections we elaborate to more details of the speculative lock inheritance mechanism. Extensions to the lock manager When a transaction attempts to release a lock to the lock manager the agent thread determines whether it is a good candidate for inheritance (see next section). If so, it does not remove the request from the lock queue. Instead, it changes the request status from granted to inherited and moves it from the transaction’s private list to a different private list owned by the transaction’s agent thread. When the agent thread executes its next transaction, it pre-populates the new transaction’s lock cache with the inherited locks. The speculation succeeds if the new transaction attempts to request an inherited lock: it will find the request already in its cache, update its status from inherited back to granted, and add it to its lock list as if it had just acquired it. The status update uses an atomic compare-and-swap operation and does not require calling into the lock manager, allocating requests, or updating latch-protected lock state. Inheritance fails harmlessly if the transaction does not make use of the lock(s) it inherited: they do not cause overhead during transaction execution and the transaction simply releases them at commit time along with the locks it did use. If another transaction encounters an inconveniently inherited lock request and an atomic compare-andswap to invalid state succeeds, it simply unlinks the request from the queue and continues. Future attempts to reclaim the lock will fail, and the next time the owning agent completes a transaction it will deallocate any invalid requests it finds. Lock inheritance is a very lightweight operation regardless of whether it eventually succeeds or not. In the worst case a transaction does not use the lock it inherited, and pays the cost of releasing the lock which the previous transaction avoided. Both invalidations and garbage collection are performed only when a transaction is already traversing the queue and add only minimal overhead. In the best case the lock manager will be completely relieved of requests for hot locks, with a corresponding boost to performance. Criteria for inheriting locks The speculative lock inheritance mechanism uses five criteria to identify candidate locks which are likely to benefit subsequent transactions with minimal risk of reducing concurrency: 1. The lock is page-level or higher in the hierarchy (no record-level locks). 2. The lock is “hot” (i.e. we observed contention for the latch protecting it). 3. The lock is held in a shared or intention mode (e.g. S, IS, IX). 4. No other transaction is waiting on the lock (e.g. to lock it exclusively). 74 CHAPTER 5. ATTACKING UN-SCALABLE CRITICAL SECTIONS 5. The previous conditions recursively hold for the lock’s parent, if any. The first two criteria favor locks which are likely to be reused by a subsequent transaction. Very fine-granularity locks such as row locks are so numerous that the overhead of tracking them outweighs the benefits, while a lock which has only one outstanding request at release time is unlikely to have another request arrive in the near future. We detect a “hot” lock by tracking what fraction of the most recent several acquires encountered latch contention and enabling SLI when the ratio crosses a tunable threshold. 6 Criteria 3 and 4 ensure SLI does not hurt performance and concurrency or lead to starvation. The last criterion ensures that SLI maintains the hierarchical locking protocol. Ensuring correctness SLI preserves consistency semantics by only passing shared- or intention-mode locks from one transaction to another. Assuming the first transaction acquired its locks in a consistent way, the new transaction will inherit consistent locks. In addition, such lock modes ensure that the previous transaction did not change the corresponding data objects. From the perspective of a new transaction, an inherited lock request looks just like any other request that happened to be granted with no intervening updates since it was last released. Two-phase locking semantics are preserved because the inheritance is not finalized until the new transaction actually requests the lock. If an exclusive request arrives before then it invalidates the inheritance and the inheriting transaction must make a normal request. Therefore, from a semantic perspective an inherited lock was released and reacquired; only the underlying implementation has changed. From the perspective of both the inheriting and any competing transactions which arrive after the original transaction completes, the request was granted in the same order it would have been had SLI not intervened. A mixture of inherited and non-inherited locks is consistent and serializable for the same reasons. SLI preserves the hierarchical locking protocol by only inheriting locks whose parents are also eligible. Any inherited lock “orphaned” when its parent is invalidated will also be invalidated before any transaction tries to use it, thus avoiding the case where a low-level lock is held without appropriate locks on its ancestors. A transaction could also potentially acquire locks in a different order than expected if it inherits locks which it would have requested later than the beginning of the transaction. For example, Figure 5.6 shows how SLI could potentially induce new types of deadlocks between transactions that are otherwise well-behaved. During normal execution (left), transaction agents T1 and T2 both acquire lock L2 followed by L1. Whichever agent requests the 6 To achieve that we have to slightly modify the latch implementation. 5.3. AVOID UN-SCALABLE CRITICAL SECTIONS WITH SLI 75 Figure 5.6: Example of SLI-induced deadlock. lock second has to wait until the other commits its current transaction, but no deadlock is possible. However, enabling SLI (right) allows T1 to inherit L1 from a previous transaction. If agents could not invalidate inherited but not-yet-used locks, T1 would have effectively acquired its locks in reverse order and could deadlock with T2. Fortunately, SLI-induced deadlocks can easily be avoided because the transaction must still reclaim the lock before accessing the data; if any exclusive request arrives for an inherited lock before the inheriting transaction first requests it, the lock manager invalidates the inheritance. Once the request has been reclaimed the transaction has effectively acquired the lock in its natural order and conflicting requests will have the same risk of deadlock as in the unmodified system. Non-uniform locking patterns One concern with SLI is that for real-world workloads there might exist locking patterns which interfere with normal operations, preventing it from achieving its full potential. Next, we discuss two potential patterns, the moving hotspot and the bimodal workload, and show that they do not prevent SLI from being effective. Many workloads do not access data uniformly over time. Instead, the object of interest shifts, and contention with it. A common example is a table (such as a history or log) with heavy append traffic. For a given page of the table, for instance, high contention will disappear as soon as the page fills and transactions begin inserting records in a different page. This moving target presents two potential difficulties for SLI. First, old unnecessarily inherited locks might pollute transaction caches and lock lists, and waste space in the lock manager’s hash table. Second, newly hot locks will not be inherited at first, leading to 76 CHAPTER 5. ATTACKING UN-SCALABLE CRITICAL SECTIONS contention. Fortunately, neither problem occurs in practice because SLI has a short memory: if transactions do not use inherited locks their agent thread will release them quickly; if new sources of contention appear, SLI will quickly begin inheriting the problematic locks. A bimodal workload consists of two groups of transactions which access different sets of locks. If the distribution of transactions to agent threads is random, a high fraction of transactions will not utilize the locks they inherited from the previous, possibly different, transaction type. Given the short memory and inheritance criteria, the lock manager may stop inheritance even though it would be beneficial to continue. There are several potential ways to make SLI resistant to this sort of workload: • Identify groups of transaction types which acquire similar locks, and bias the assignment of transactions to agent threads so that similar transactions execute with the same agents most of the time. This approach would require either application developer assistance or some form of cluster identification based on observing which high-level locks each transaction type tends to acquire. • Apply a small hysteresis or momentum which prevents the lock manager from dropping inheritance just because one transaction did not use the lock. This approach is straightforward and inexpensive to implement using only local knowledge, but would tend to increase the number of useless locks which pass between transactions. • Do nothing. The fewer locks in common different transaction types acquire, the less contention their requests will cause and the less opportunity SLI has in the first place. Additionally, because contention tends to grow quadratically, 7 even a minor reduction in the number of threads competing for a lock request provides a significant improvement. In our experimentation we find that the third approach works well in practice, though as the number of cores per chip continues to increase, contention may grow to the point that only a subset of the total threads are required to cause significant contention. Passing information with agent threads While implementing SLI, we noticed that SLI’s technique to pass information across transactions executed by the same agent thread is very handy and can be used to remove other sources of un-scalable critical sections. In our prototype we employ this technique to avoid accessing the metadata database pages frequently and practically eliminate the critical sections associated with the metadata component. 7 If N threads all contend for the same object, each can expect to wait for N/2 threads, for O(N 2 ) total time wasted blocking or spinning. 5.3. AVOID UN-SCALABLE CRITICAL SECTIONS WITH SLI 77 In a naive design and for transactions that do few operations per object accessed, accessing metadata data pages may constitute a non-negligible source of critical sections. For example in our baseline implementation, each transaction maintains a data structure which caches access information about the various database objects the transaction had to access so far, so that in case it has to re-access them to do that quickly. This metadata information can be the page id of the root of an index or the page id of the first page of heap file, and is being stored in regular database pages. 8 When an agent thread accesses a metadata database page it has to enter a critical section. In the extreme case, if a transaction does a single record probe per table (e.g. a transaction that probes for a single customer), the critical section for accessing the metadata information constitutes a significant fraction of the critical sections for that object. Obviously the more operations per object accessed, the smaller that component. To address the problem of critical sections for accessing metadata, each agent thread populates the transaction-local metadata data structure as usual, but at transaction completion it does not destroy that data structure. Instead, the agent thread passes the populated data structure to the next transaction it will serve, a la SLI. The next transaction finds the metadata data structure pre-populated. At the infrequent case where the metadata information of a database object is stale (e.g. an object has been destroyed) the access will return with an error. The agent thread needs to refresh its metadata information and re-attempt before aborting. As we will see in the evaluation (Section 5.3.2), this technique removes a significant source of un-scalable critical sections for short transactions, while its overhead is negligible since metadata information changes very infrequently. 5.3.2 Evaluation of SLI We evaluate several individual transactions and transaction mixes on a multicore machine to identify both the opportunity for and the effectiveness of speculative lock inheritance. We make use of four metrics to determine the effectiveness of SLI. First, we consider the numbers and types of locks which are responsible for contention, using software counters. Second, we use profiling tools to identify bottlenecks (or lack of them) through time breakdowns. Third, we consider the resulting anatomy of critical sections, and, finally, we measure system throughput to quantify the performance impact of SLI. We perform all experiments on a Sun Niagara II machine running Solaris 10. The Niagara II chip [JN07] contains 8 cores each supporting 8 hardware contexts, for a total of 64 OS8 That is, a database page belongs either to a heap file, an index or contains metadata. 78 CHAPTER 5. ATTACKING UN-SCALABLE CRITICAL SECTIONS CPU utilization (out of 64) 64 48 32 LM contention Other contention LM overhead Computation 16 0 Figure 5.7: Execution time breakdown for the baseline system running transactions from the TATP, TPC-B, and TPC-C benchmark, each at the load giving peak performance. High bars with low contention are best. visible “CPUs.”. That is twice as many as the Niagara I machine which we used for the evaluation of baseline Shore-MT in Section 5.2. Opportunity analysis To demonstrate the potential for SLI, we perform a two-fold opportunity analysis. We first profile the baseline system under high load and produce a breakdown of overheads arising out of the lock manager, giving an upper bound on the performance improvement SLI could achieve. We also examine the number and types of locks acquired by transactions to verify the basic underlying idea behind SLI, which is to exploit shared and intention locks to reduce contention. Lock manager overhead and contention. To identify the magnitude of contention within the lock manager, we profile transactions and transaction mixes from the TATP, TPC-B and TPC-C benchmarks. First, we find the load (number of concurrent clients) which maximizes performance for each workload and, then, for that load we plot the time breakdown. Figure 5.7 shows the normalized work breakdown extracted from the profiler output. The height of each bar shows the number of hardware contexts utilized at peak performance. Note that many transactions peak long before utilizing all 64 available contexts. Each column in the graph shows the fraction of CPU time a transaction spent in both work and contention, both inside and outside the lock manager. The results confirm that the lock manager is a large bottleneck in the system, especially for the smaller transactions, such as those from TATP. As expected, the largest TPC-C transactions do not suffer from the lock manager bot- 5.3. AVOID UN-SCALABLE CRITICAL SECTIONS WITH SLI 79 Throughput (ktps) TATP 60 50 TPC-C Payment 40 TPC-B 30 20 10 0 0 16 32 48 Hardware contexts utilized 64 Figure 5.8: Impact of lock manager bottleneck as load varies. Performance should be increasing monotonically for utilization under 64 hardware contexts. tleneck: Stock queries a large amount of data and thus amortizes the cost of acquiring high-level locks; Delivery is not only large but also introduces true lock contention which blocks transactions so they do not compete for the lock manager. The measured (useful) lock manager overheads range between 10-20%, corroborating the results in [HAMS08]. The profiler results also indicate that the lock manager bottleneck is smaller for mixes of transactions which access a wider variety of tables, like the TATP and TPC-C Mix), even though transaction size has a far stronger effect. Mixing different transactions together reduces the bottleneck for two reasons: different access patterns spread contention over more types of locks and agents running long transactions spend less time in the lock manager, easing pressure. For the workloads with small transactions we expect the bottleneck to grow over time as more cores per chip allow multiple different hotspots in the lock manager at the same time. Distributing hotspots over multiple tables will not eliminate contention in the long run because, even if the number of heavily-accessed tables in a workload grows over time, we do not expect it to grow uniformly or nearly as fast as core counts. Figure 5.8 illustrates the impact in performance of the lock manager bottleneck as we increase the load on the system along the x-axis from near idle to saturated. Each data series shows the throughput achieved at different CPU utilizations for the TATP and TPC-B benchmarks, as well as for TPC-C Payment transactions. For small numbers of hardware contexts utilized, we see throughput increasing nearly linearly and the system scales well. However, as the number of hardware contexts increases past 32 contexts the lock manager bottleneck begins to impact performance, and by 48 contexts the bottleneck becomes severe enough CHAPTER 5. ATTACKING UN-SCALABLE CRITICAL SECTIONS 80 Breakdown by lock type 120% 6 7 5 10 6 100% 80% 60% 40% 16 8 6 9 6 19 19 37 117 274 1099 68 114 Hot, X-High Hot, X-Row Hot, S-High Hot, S-Row Cold High Cold Row 20% 0% Figure 5.9: Breakdown of SLI-related characteristics for locks acquired by each workload. SLI works with shared- or intention-mode locks, both types are grouped together as S-High and S-Row. that throughput starts to drop – the system is unable to utilize effectively the additional processing power available to it. Opportunity for lock inheritance. Two conditions must hold in order for SLI to improve system scalability: most contention in the system should center around the internal state of hot locks, and those locks must be inherit-able, meeting the SLI’s criteria, for passing them from one transaction to the next. The previous section illustrates that, for short transactions, the lock manager is indeed the primary source of contention in the system. We now analyze lock access patterns to evaluate the opportunity for SLI to reduce that contention (and verify its hypothesis). The analysis considers three characteristics: hot vs. cold lock, shared vs. exclusive requests and row-level locks vs. those higher in the hierarchy. SLI targets hot, shared- or intention-mode, high-level locks. We are not interested in cold locks because they do not cause contention within the lock manager, SLI cannot work with exclusive lock modes because it would impact concurrency, and we hypothesize that row-level locks which are hot and in shared or intention mode are too rare to be worth considering. Therefore, SLI will have the most potential to improve performance if a large fraction of locks meet the inheritance criteria and if most remaining locks are cold. We note that it is entirely possible for many transactions to wait on “cold” locks, especially in badly-behaved workloads. However, true lock contention serializes transactions, and the resulting low concurrency reduces contention for the lock’s internal state, making SLI unnecessary. 5.3. AVOID UN-SCALABLE CRITICAL SECTIONS WITH SLI Breakdown by lock type 120% 4 5 3 6 100% 4 9 5 4 5 4 12 81 12 12 49 30 345 31 40 Not Inherited 80% Discarded 60% Invalidated 40% Upgraded 20% Used 0% Figure 5.10: Breakdown of outcomes for locks which SLI could choose to pass between transactions. Figure 5.9 shows a breakdown of the types of locks acquired by each transaction or transaction mix. The number at the top of each column is the average number of locks acquired per transaction. SLI targets locks which are both hot and inherit-able. Hot locks which remain cannot be handled by SLI and ideally contribute only a small fraction of the total. As expected, the smallest transactions acquire few locks but most of those locks are inherit-able and many are hot. As transactions acquire more and more locks the number of hot and inherit-able locks does not increase as quickly, indicating lower contention in the lock manager and less opportunity for SLI. We observe that, for the workloads analyzed here, there are very few, if any, hot non-inheritable locks and that transactions with the most hot and inherit-able locks also experience the highest contention in the lock manager according to Figure 5.8. Together these indicate that indeed SLI has the potential to reduce or eliminate the lock manager bottleneck. We note that, though there are relatively few hot and inheritable locks in the breakdown, transactions inheriting them will have a disproportionate impact in reducing contention. Row-level locks, though numerous, are not usually hot, and even less often both hot and inherit-able. Effectiveness of lock inheritance We first examine the effectiveness of SLI in passing locks between transactions. When inheritance is effective most hot locks in the system are inherited and used by succeeding transactions. SLI will not eliminate fully the lock manager bottleneck if hot locks are not inherited, or remain unused and are discarded, or are invalidated before transactions reclaim them. 82 CHAPTER 5. ATTACKING UN-SCALABLE CRITICAL SECTIONS CPU utilization (out of 64) 64 48 32 16 SLI contention LM contention Other contention SLI overhead LM overhead Computation 0 Figure 5.11: Breakdown of CPU utilization for each workload when SLI active and the machine is saturated. Lower contention is better. Figure 5.10 shows the breakdown of outcomes for only the hot locks in the system for each transaction and mix. SLI is selective, passing only hot locks between transactions. For shorter transactions most locks are hot, though a significant fraction of them are invalidated and cannot be used. The longest transactions have virtually no hot locks because they acquire so many that relatively little time per transaction goes to any one request. We also note that mixing multiple transaction types increases the number of locks which are invalidated, and also increases the number of useless locks which transactions eventually discard, potentially due to bimodal patterns discussed previously. However, as we will see in the next section, the locks which are successfully inherited are also the ones responsible for most of the lock manager bottleneck. Time breakdowns with SLI We expect SLI to work best when there is a heavy load of many small and usually nonconflicting transactions. Workloads under low load or with transactions which are large or conflicting, will not benefit nearly as much. Figure 5.11 shows the work breakdown of transactions when SLI is active. Significantly, none of the transactions has a large contribution from lock manager contention any more. This indicates that SLI is effective in identifying and passing the locks which cause most lock manager contention. We also note that SLI has low overhead. Even in the worst case it adds only 5% overhead usually with a corresponding decrease in lock manager overhead. For example, locks which are inherited but never used must still be released, and that overhead counts toward SLI, not the lock manager. In most cases, contention in the lock manager is replaced by useful work, 5.3. AVOID UN-SCALABLE CRITICAL SECTIONS WITH SLI 83 suggesting a significant performance improvement. However, the NewOrder transaction sees a shift of contention from the lock manager to other areas, mostly the space manager. As expected from Figure 5.7, the two large TPC-C transactions are virtually unchanged by SLI because they did not have a significant lock manager component to begin with. Ideally, SLI would also reduce transaction overhead by avoiding calls to the lock manager. However, we observe this effect to be negligible (< 4%) for even the shortest transactions, given the large fraction of locks which are not inherited. 9 Overall, when SLI is active transactions spend 75% or more of the time doing useful work even though the system is fully loaded, in contrast with Figure 5.7. For example, in TATP SLI exhibits lower contention at 95% machine utilization than baseline does at 60% utilization. Thus, SLI gives large speedup for this workload because it not only eliminates contention for existing load, but allows load to increase without the contention returning. Anatomy of critical sections and performance impact with SLI The primary goal of SLI is to reduce contention within the lock manager so it does not impede scalability. SLI achieves this goal by avoiding to interact with the lock manager of the acquisition/release of some “hot” locks. The second bar of Figure 5.3 shows the anatomy of critical sections for the simple TATP UpdateLocation transaction, when SLI is active. We observe that the number of critical sections entered on average is significantly reduced. Two are the reasons for this reduction. First, SLI is efficient in picking the right locks to inherit, so transactions need to interact with the lock manger less frequently. Also, as we discussed in Section 5.3.1, the metadata information is propagated from one transaction to the next, reducing the interaction with the metadata manager as well. The final result is that the number of un-scalable critical sections drops by 30% (from 60 to 40). Given the significant reduction of the number of un-scalable critical sections, we expect a significant boost in overall performance. Figure 5.12 compares performance of the baseline system with SLI on two benchmarks, TATP and TPC-B, which consist of short running transactions that put pressure on the lock manager component. As expected, TATP and TPC-B with their short transactions benefit significantly from SLI. Baseline system stops scaling somewhere between 32 and 48 hardware contexts. With SLI enabled, the system’s performance in TATP increases almost linearly up to 64 hardware contexts, as many as the machine has. In TPC-B performance increases again up to 64 hardware contexts, but this 9 Chapter 6 presents a technique that not only eliminates contention on the lock manager, but also significantly reduces locking overheads. 84 CHAPTER 5. ATTACKING UN-SCALABLE CRITICAL SECTIONS Throughput (ktps) 100 TATP TATP (SLI) 80 TPC-B 60 TPC-B (SLI) 40 20 0 0 16 32 48 Hardware Contexts Utilized 64 Figure 5.12: Performance improvement due to SLI, for TATP and TPC-B. time the increase is not linear because logging becomes the bottleneck. In the next section (Section 5.4) we focus on the logging-related problem(s). 5.4 Downgrading log buffer insertions to composable critical sections The log manager, an essential component of any transaction processing system which ensures the system ability to recover from crashes [MHL+ 92], is another potential source of bottlenecks. 10 The logging-related problem we are focusing on in this section is the log record inserts to the main memory log buffer. The log buffer inserts belong to the un-scalable type of critical section. As hardware parallelism increases, a large number of threads simultaneously attempt to insert to a centralized log buffer and the contention becomes a significant and growing fraction of total execution time. Continuing the analysis of the critical sections of the TATP UpdateLocation transaction, the second bar of Figure 5.3 shows that when SLI is enabled the log buffer inserts constitute around the 20% of the un-scalable critical sections. Where current hardware trends generally reduce other logging-related bottlenecks (e.g. solid state drives reduce I/O latencies [LMP+ 08, Che09, JPS+ 10]), each successive processor generation aggravates contention for log buffer inserts. We therefore consider the log buffer inserts as the most challenging logging-related problem with respect to future scalability. 10 See [JPS+ 10] and [JPS+ 11] for a detailed discussion on logging-related bottlenecks on multicore and multisocket hardware. 5.4. DOWNGRADING LOG BUFFER INSERTIONS 5.4.1 85 Log buffer designs Most database engines use some variant of ARIES [MHL+ 92], which assigns each log record a unique log sequence number (LSN). The LSN encodes a record’s disk address, acts as a timestamp for data pages written to disk, and serves as a pointer to log records both in memory and on disk. It is also convenient for LSN to serve as addresses in the log buffer, so that generating an LSN also reserves buffer space. In order to keep the database consistent in spite of repeated failures, ARIES imposes strict ordering constraints on LSN generation. While a total ordering is not technically required for correctness, valid partial orders tend to be too complex and interdependent to be worth pursuing as a performance optimization. 11 . Because of its serial nature, LSN generation and the accompanying log inserts impose serious limitation on parallelism in the system. In this section we attack the problem at its root, developing techniques which allow LSN generation to proceed in parallel. We achieve parallelism by adapting the concept of “elimination”, introduced at [ST97], to allow the system to generate sequence numbers in groups. An especially desirable effect of this grouping is that increased load leads to larger groups rather than causing contention. We also explore the performance trade-offs that come from decoupling the LSN generation process from the actual log buffer insert operation. We begin by considering the basic log insertion algorithm, which consists of three distinct phases: 1. LSN generation and log buffer acquire. The thread first claims the space it will eventually fill with the intended log record 2. Log record insertion. The thread copies the log record in the buffer space it has claimed. 3. Log buffer release. The transaction releases the buffer space, which allows the log manager to write the record to disk. Baseline implementation A straightforward log insert implementation acquires a central mutex before performing all three phases and the mutex is released at the same time as the buffer. That is, in the straightforward implementation there is a single un-scalable critical section which every thread needs to execute for every log insert it makes. This approach is attractive 12 for its simplicity: log inserts are relatively inexpensive, and in the monolithic case buffer release is simplified to a mutex release. Further, even though 11 12 We explore this option at [JPS+ 11] . There are indications that popular systems employ this simple design. One of them is PostgreSQL [Pos11] . 86 CHAPTER 5. ATTACKING UN-SCALABLE CRITICAL SECTIONS Figure 5.13: Illustrations of several log buffer designs. The baseline system can be optimized for shorter critical path (D), fewer threads attempting log inserts (C), or both (CD). LSN generation is fully serial, it is also short and predictable (barring exceptional situations such as buffer wraparound or full log buffer, which are comparatively rare). The monolithic log insert suffers a major weakness because it serializes buffer fill operations, even though buffer regions never overlap, adding their cost directly to the critical path. In addition, log record sizes vary significantly, making copying costs unpredictable. Figure 5.13 (B) illustrates how a single large log record can impose long delays on later threads. This situation arises frequently in our system because the distribution of log records has two strong peaks at 40B and 264B (a 6x difference) and the largest log records can occupy several KB each. To permanently eliminate contention for the log buffer, we seek to make the cost of accessing the log independent of both the sizes of the log records being inserted and the number of threads inserting them. The following subsections explore both approaches and propose a hybrid solution which combines them. Consolidating buffer allocation A log record consists of a standard header followed by an arbitrary payload. Log buffer allocation is composable in the sense that two successive requests also begin with a log header and end with an arbitrary payload. We exploit this composability by allowing threads to combine their requests into groups, carve up and fill the group’s buffer space off the critical path, and finally release it back to the log as a unit. To this end we extend the idea of elimination-based backoff [HSY04, MNSS05], a hybrid approach combining elimination trees 5.4. DOWNGRADING LOG BUFFER INSERTIONS 87 [ST97] with backoff. Threads which encounter contention back off, but instead of sleeping or counting cycles they congregate at an elimination array, a set of auxiliary locations where they attempt to combine their requests with those of others. When elimination is successful threads satisfy their requests without returning to the shared resource at all, making the backoff very effective. For example, stacks are amenable to elimination because push() and pop() requests which encounter each other while backing off can cancel each other directly via the elimination array and leave. Similarly, threads which encounter contention for log inserts back off to a consolidation array and combine their requests before reattempting the log buffer. We use the term “consolidation” instead of “elimination” because, unlike with a stack or counter, threads must still cooperate after combining their requests so that the last to finish can release the group’s buffer space. Like an elimination array, any number of threads can consolidate into a single request, effectively bounding contention at the log buffer to the number of array entries protecting the log buffer, rather than the number of threads in the system. The net effect of consolidation is that only the first thread from each group competes to acquire buffer space from the log, and only the last thread to leave must wait to release it. Figure 5.13 (C) depicts the effect of consolidation; the first thread to arrive is joined by two others while it waits on the log mutex and all three proceed in parallel once the mutex acquire succeeds. However, as the figure also shows, consolidation leaves significant wait times because only buffer fill operations within a group proceed in parallel; operations between groups are still serialized. Given enough threads in the system, at least one thread of each group is likely to insert a large log record, delaying later groups. Decoupling buffer fill and delegating release Because buffer fill operations are not inherently serial (records never overlap) and have variable costs, they are highly attractive targets to move off the critical path. All threads which have acquired buffer regions can safely fill those regions in any order as long as they release their regions in LSN order. We therefore modify the original algorithm so that threads release the mutex immediately after acquiring buffer space. Buffer fill operations thus become pipelined, with a new buffer fill starting as soon as the next thread can acquire its own buffer region. Decoupling log inserts from holding locks results in a non-trivial buffer release operation which becomes a second critical section. Like LSN generation, buffer release must be serialized to avoid creating gaps in the log. Log records must be written to disk in LSN order because recovery must stop at the first gap it encounters; in the event of a crash any com- 88 CHAPTER 5. ATTACKING UN-SCALABLE CRITICAL SECTIONS mitted transactions beyond a gap would be lost. No mutex is required, but before releasing its own buffer region, each thread must wait until the previous buffer has been released. With pipelining in place, arriving threads can overlap their buffer fills with that of a large log record, without waiting for it to finish first. Figure 5.13 (D) illustrates the improved concurrency that results, with significantly reduced wait times at the buffer acquire phase. Under most circumstances, log record sizes do not vary enough that threads wait for previous ones to release the buffer, but high skew in the record size distribution will limit scalability because a very large record will force small ones which follow to wait for it to complete. A further optimization (not shown in the figure) allows threads to delegate their buffer release to a predecessor which has still not completed. To summarize the delegated buffer release protocol, threads which would normally have to wait for a predecessor instead attempt to mark their buffer as abandoned using an atomic compare-and-swap operation. Threads which succeed in abandoning their buffer before the predecessor notifies them are free to leave, forcing the predecessor to release all buffers that would have waited for it. In addition to making the system much less sensitive to large log inserts, it also improves performance because a single thread releases groups of buffers in a tight loop rather than communicating the releases with other threads. Putting it all together: a hybrid log buffer In the previous two subsections we outlined (a) a consolidation array which reduces the number of threads entering the log insert critical section, and (b) a decoupled buffer fill which allows threads to pipeline buffer fills outside the critical section. Neither approach eliminates all contention by itself, but the two are orthogonal and can be combined easily. Consolidating groups of threads limits log contention to a constant that does not depend on the number threads in the system, while providing a degree of buffer insert pipelining (within groups but not between them). Decoupling buffer fill operations allows pipelining between groups and reduces the log critical section length by moving buffer outside, thus making performance relatively insensitive to log record sizes. The resulting design, shown in Figure 5.13 (CD), achieves bounded contention for threads in the buffer acquire stage and maximum pipelining of all operations. As we will see in the evaluation section, the hybrid version consistently outperforms the other configurations by combining their best features. 5.4.2 Evaluation of log buffer re-design This section details the sensitivity of the consolidation array based techniques to various parameters. 5.4. DOWNGRADING LOG BUFFER INSERTIONS 89 Experimental setup To isolate the log buffer inserts from any other logging-related bottlenecks we are using a modified version of Shore-MT where we integrated the optimizations described in [JPS+ 10], namely Early Lock Release and Flush Pipelining. In addition, to eliminate contention in the lock manager and focus on logging, we employ SLI (see previous section). We run the TATP, TPC-B and TPC-C benchmarks as well as a log insert microbenchmark. For that microbenchmark, we extract a subset of Shore-MT’s log manager as an executable which supports only log insertions without flushes to disk or performing other work, thereby isolating the log buffer performance. We then vary the number of threads, the log record size and distribution, and the timing of inserts. All results report the average of 10 30-second runs unless stated otherwise. We do not report variance because all measurements were within 2% of the mean. Measurements come from timers in the benchmark driver as well as Sun’s profiling tools. Profiling is highly effective at identifying software bottlenecks even in the early stages before they begin to impact performance, because problematic functions can be seen to shift their position in the timing breakdowns. All experiments were performed on a Sun Niagara II machine with 64 hardware contexts and 64GB of main memory running Solaris 10. Because our focus is on the logging subsystem, and because modern transaction processing workloads are largely memory resident [SMA+ 07], we use memory-resident data sets, while disk still provides durability. Log buffer contention First, to set the stage, we measure log buffer contention. Already from the second bar of Figure 5.3 we expect the log buffer inserts to be a potential scalability bottleneck. Figure 5.14 shows the time breakdown for Shore-MT using its baseline log buffer implementation as an increasing number of clients submit the UpdateLocation transaction from TATP. As the load increases, the time each transaction spends contenting for the log buffer increases at a point which the log buffer contention becomes the bottleneck taking more than 35% of the execution time. This problem will only grow as processor vendors release more parallel multi-core hardware. Impact of log buffer optimizations (microbenchmarks) A database log manager should be able to sustain any number of threads regardless of the size of the log records they insert, limited only by memory and compute bandwidth. Next, through a series of microbenchmarks we determine how well the log buffer designs proposed in Section 5.4.1 meet these goals. In each experiment we compare the baseline CHAPTER 5. ATTACKING UN-SCALABLE CRITICAL SECTIONS 90 CPU time (secs) 100% 80% Log mgr. contention 60% Other contention 40% Log mgr. work 20% Useful work 0% 2% 13% 25% 38% 50% 63% 75% 88% 97% Load Figure 5.14: Breakdown of the execution time of Shore-MT with two log optimizations (ELR and flush pipelining) enabled, running TATP UpdateLocation transactions as load increases. The log buffer inserts become the bottleneck. implementation with the consolidation array (C), decoupled buffer insert (D), and the hybrid solution combining the two optimizations (CD). We examine scalability with respect to both thread counts and log record sizes and we analyze how the consolidation array’s size impacts its performance. Further experiments explore the impact of skew in the record size distribution and of changing the number of slots in the slot array. Scalability with respect to thread count. The most important metric of a log buffer is how many insertions it can sustain per unit time, or the bandwidth which the log can sustain at a given average log insert size. It is important because core counts grow exponentially while log record sizes are application- and DBMS-dependent and are fixed. The average record size in our workloads is about 120 bytes and a high-performance application generates between 100 and 200MBps of log, or between 800K and 1.6M log insertions per second. Figure 5.15 (left) shows the performance of the log insertion microbenchmark for records of an average size of 120B as the number of threads varies along the x-axis. Each data series shows one of the log variants. We can see that the baseline implementation quickly becomes saturated, peaking at roughly 140MB/s and falling slowly as contention increases further. 13 Due to its complexity, the consolidation array starts out with lower throughput than the baseline. But once contention increases, the threads combine their requests and performance scales linearly. In contrast, decoupled insertions avoid the initial performance penalty and perform better, but eventually the growing contention degrades performance and perform worst than the consolidation array. Finally, the hybrid approach combines the 13 Notice that even such bandwidth would saturate any mechanical disk drive 5.4. DOWNGRADING LOG BUFFER INSERTIONS 91 Throughput (GB/s) 100 CD in L1 CD 10 C D 1 Baseline 0.1 0.01 1 4 Thread count 16 64 12 120 1200 12000 Log record size (bytes) Figure 5.15: Sensitivity analysis of the consolidation array with respect to thread count and log record size. The hybrid design combines the benefits of both optimizations. best properties of both optimizations, eliminating most of the startup cost from (C) while limiting the contention which (D) suffers. The drop in scalability near the end is a hardware limitation, as described in Section 5.4.2. Overall, we see that while both consolidation and decoupling are effective at reducing contention, both have limitations which we overcome by combining the two, achieving near-linear scalability. Scalability with respect to log record size. In addition to thread counts, log record sizes also have a strong influence on the performance of the log buffer. In the case of the baseline and consolidated variants, larger record sizes increase the critical section length; in all cases, however, larger record sizes decrease the number of log inserts one thread can perform because it must copy an increasing amount of data per insertion. Figure 5.15 (right) shows the impact of these two factors, plotting sustained bandwidth achieved by 64 threads as they insert log records ranging between 48B and 12KB (the largest record size in Shore-MT). As log records grow the baseline performs better, but there is always enough contention that makes all other approaches more attractive. The consolidated variant (C) performs better at small records sizes as it can handle contention much better than the decoupled record insert (D). But once the records size is over 1KB, contention becomes low and the decoupled insert variant fares better as more log inserts can be pipelined at the same time. The hybrid variant again significantly outperforms its base components across the whole range, but in the end all three become bandwidth-limited as they saturate the machine’s memory system. 92 CHAPTER 5. ATTACKING UN-SCALABLE CRITICAL SECTIONS Figure 5.16: Sensitivity to the number of slots and thread count in the consolidation array. Lighter colors indicate higher bandwidth. Finally, we modify the microbenchmark so that threads insert their log records repeatedly into the same thread-local storage, which is L1 cache resident. With the memory bandwidth limitation removed, the hybrid variant continues to scale linearly with record sizes until it becomes CPU-limited at roughly 21GBps (nearly 20x higher throughput than modern systems can reach). Sensitivity to slot array size. Our last microbenchmark analyzes whether (and by how much) the performance of the consolidation array is affected by the number of available slots. Ideally the performance should depend only on the hardware and be stable as thread counts vary. Figure 5.16 shows a contour map of the space of slot sizes and thread counts, where the height of each data point is its sustained bandwidth. Lighter colors indicate higher bandwidth, with contour lines marking specific throughput levels. We achieve peak performance with 3-4 slots, with lower thread counts peaking with fewer and high thread counts requiring a somewhat larger array. The optimal slot number corresponds closely with the number of threads required to saturate the baseline log which the consolidation array protects. Based on these results we fix the consolidation array size at four slots to favor high thread counts; at low thread counts the log is not on the critical path of the system and its peak performance therefore matters much less than at high thread counts. Anatomy of critical sections and impact in overall performance To complete the experimental analysis, we measure the impact of the log buffer optimization in the overall system performance. The right-most bar of Figure 5.3 shows the anatomy of 5.4. DOWNGRADING LOG BUFFER INSERTIONS 93 TATP-UpdateLocation TPC-B Throughput (Ktps) 175 150 125 100 75 50 25 0 Throughput (Ktps) Hybrid (CD) Opt. Baseline Baseline Hybrid (CD) Opt. Baseline Baseline 80 60 40 20 0 0 10 20 30 40 #CPU Utilized 50 60 0 10 20 30 40 50 60 #CPUs utilized Figure 5.17: Overall performance improvement provided by the hybrid log buffer design when the systems run TATP UpdateLocation transactions (left) and the TPC-B benchmark (right). The optimized baseline contains the ELR and Flush Pipelining optimizations [JPS+ 10]. The hybrid log buffer achieves the highest performance and displays no lurking bottleneck. critical sections for the simple TATP UpdateLocation transaction, when, on top of SLI, we employ the hybrid log buffer design (in this graph we refer the hybrid design as “Aether”). The only difference with the second bar is that now the majority of the critical sections related to the log manager are composable, instead of un-scalable. Hence, we expect better scalability. Figure 5.17 captures the scalability of Shore-MT running TATP UpdateLocation transactions (left) and the TPC-B benchmark (right). We plot throughput as the number of client threads varies along the x-axis. The hybrid (consolidated) log buffer design improves performance by 7% and 15% respectively by eliminating log contention. The performance improvements seem to be modest. This happens simply because the peak transaction execution rate (which is achieved when the machine is saturated) does not generate enough log bandwidth for the hybrid design to significantly outperform the baseline, which Figure 5.15 shows that it can sustain approximately 140MBps. Nevertheless, by converting the log buffer inserts from un-scalable critical sections to composable, the hybrid log buffer displays no lurking logging-related bottlenecks and our microbenchmarks suggest that it has significant headroom to accept additional log traffic as systems scale in the future. 94 5.5 CHAPTER 5. ATTACKING UN-SCALABLE CRITICAL SECTIONS Related work There is a broad set of literature on scaling the performance of database systems in general, and transaction processing system in particular. Up until recently, however, the majority of the studies focused on scaling out the performance rather than scaling up. Since the two techniques we presented in this section (speculative lock inheritance and consolidated log buffer inserts) affect the lock and the log manager, in the following two subsections we briefly describe work related with those two significant transaction processing components. 5.5.1 Reducing lock overhead and contention The guiding concept of speculative lock inheritance – not releasing locks between transactions – appears in Rdb/VMS [Jos91] as a way to reduce network communication costs. Locks in this distributed database physically migrate to nodes whose transactions acquire them. The authors highlight very briefly a “lock carry-over” optimization which allows a node to avoid the overhead of returning the lock to its home node when transactions complete by caching it locally, as long as no conflicting lock requests have arrived. Each carry-over saves at least one round trip over the network in the event the lock is reused by a later transaction, improving the performance of a two-node system by over 60%. In this chapter, we apply the concept of lock carry-over to the single-node Shore-MT engine to solve the problem of contention for lock state, which did not exist with the high network overheads and low node counts (1-3 in the evaluation) experienced by Rdb/VMS. We also detail an implementation designed for modern database engines running on multicore hardware with shared memory and caches, and where transactions, not nodes, hold locks. SLI allows a centralized lock manager to distribute requests among the many threads that would otherwise contend with each other. IBM’s DB2 provides a performance tuning registry variable, DB2 KEEPTABLELOCK [IBM11], which allows transactions or even connections to retain read-mode table locks between uses, again exploiting the idea of not releasing locks unless necessary. However, transactions only benefit from the setting if they repeatedly release and reacquire the same locks, and the documentation notes that retaining table locks for the life of a connection leads to “poor concurrency” because other transactions cannot make updates until the connection closes. The setting is disabled by default. Multiversioned buffer pools [BJK+ 97] allow writers to update copies of pages rather than waiting for readers to finish. Copying avoids the need for low-level locking because older versions remain available to readers, but it does not remove the need for hierarchical locks or the corresponding contention which SLI addresses. In addition, for the common case where a transaction updates only a few bytes per record accessed, multiversioning imposes 5.6. CONCLUSION 95 the cost of copying an entire database page per record. Finally, multiversioning provides “snapshot isolation,” which suffers from certain non-intuitive update anomalies that are only partly addressed to date [JFRS07, AFR09]. 5.5.2 Handling logging-related overheads Logging is one of the most important components of a database system, but also is one of the most complicated. Even in a single-threaded database engine the overhead of logging is significant. For example, Harizopoulos et al. [HAMS08] report that in a single-threaded database engine logging accounts for roughly 12% of the total time in a typical OLTP workload. Virtually all database engines employ some variant of ARIES [MHL+ 92], a sophisticated write-ahead logging system which integrates concurrency control with transaction rollback and disaster recovery, and allows the system to recover fully even if recovery is interrupted repeatedly by new crashes. To achieve its high robustness with good performance, ARIES couples tightly with the rest of the system, particularly the lock and buffer pool managers, and has a strong influence on the design of access methods such as B+Tree indexes [Moh90, ML92]. Main-memory database engines [DKO+ 84] impose a special challenge for log implementations because the log is the only I/O operation of a given transaction. Not only is the I/O time responsible for a large fraction of total response time, but short transactions also lead to high concurrency and contention for the log buffer. Some proposals go so far as to eliminate the log (and its overheads) altogether [SMA+ 07], replicating each transaction to multiple database instances and relying on hot fail-over to maintain durability. However, replication has its own large set of challenges [GHOS96], and it is a field of active research [TA10]. 5.6 Conclusion In this chapter we detailed two mechanisms which address scalability bottlenecks in two essential components of any transaction processing system, the lock manager and the log manager. Both mechanisms provide significant improvements in the performance of the baseline system, and reduce the number of un-scalable critical sections. Unfortunately, no matter the optimizations, the transaction execution codepath still contains a large number of critical sections. Figure 5.3 shows that the optimal design still enters 35 un-scalable critical sections. With hardware parallelism doubling each processor generation, eventually some of those critical sections will hamper scalability. This suggests that for embarrassingly 96 CHAPTER 5. ATTACKING UN-SCALABLE CRITICAL SECTIONS parallel execution, we need to depart from the conventional execution model and investigate more radical approaches, which is the topic of the next part (Part III). Part III Re-architecting transaction processing 97 99 Chapter 6 Data-oriented Transaction Execution While hardware technology has undergone major advancements over the past decade, transaction processing systems have remained largely unchanged. The number of cores on a chip grows exponentially, following Moore’s Law, allowing for an ever-increasing number of transactions to execute in parallel. As the number of concurrently-executing transactions increases, contended critical sections become scalability burdens. In typical transaction processing systems the centralized lock manager is often the first contended component and scalability bottleneck. In this chapter, we take a more rigorous approach against scalability bottlenecks of conventional transaction processing. We identify the conventional thread-to-transaction assignment policy as the primary cause of contention. Then, we design DORA, a system that decomposes each transaction to smaller actions and assigns actions to threads based on which data each action is about to access. DORA’s design allows each thread to mostly access thread-local data structures, minimizing interaction with the contention-prone centralized lock manager. Built on top of a conventional storage engine, DORA maintains all the ACID properties. Evaluation of a prototype implementation of DORA on a multicore system demonstrates that DORA eliminates any contention related to the lock manager and attains up to 4.8x higher throughput than the state-of-the-art storage engine when running a variety of synthetic and real-world OLTP workloads. 1 6.1 Introduction The diminishing returns of increasing on-chip clock frequency coupled with power and thermal limitations have led hardware vendors to place multiple cores on a single die and rely on thread-level parallelism for improved performance. Today’s multicore processors feature 64 hardware contexts on a single chip equipped with 8 cores2 , while multicores targeting spe1 2 This chapter highlights material presented at VLDB 2010 [PJHA10]. Modern cores support multiple hardware contexts, interleaving their instruction streams to improve CPU utilization. CHAPTER 6. DATA-ORIENTED TRANSACTION EXECUTION 100 DISTRICTS Thread-to-transaction (Conventional) Thread-to-data (DORA) 100 100 80 80 60 60 40 40 20 20 0 0 0.2 0.4 0.6 Time (secs) 0.8 0.2 0.4 0.6 Time (secs) 0.8 Figure 6.1: Comparison of the trace of the record accesses by the threads of a system that applies the conventional thread-to-transaction assignment of work policy (left) and a system that applies the thread-to-data policy (right). The data accesses of the conventional threadto-transaction system are uncoordinated and complex. On the other hand, the data accesses of the thread-to-data system are coordinated and show regularity. cialized domains find market viability at even larger scales. With experts in both industry and academia forecasting that the number of cores on a chip will follow Moore’s Law, an exponentially-growing number of cores will be available with each new process generation. As the number of hardware contexts on a chip increases exponentially, an unprecedented number of threads execute concurrently, contending for access to shared resources. Threadparallel applications running on multicores suffer of increasing delays in heavily-contended critical sections, with detrimental performance effects [JPH+ 09]. To tap the increasing computational power of multicores, software systems must alleviate such contention bottlenecks and allow performance to scale commensurately with the number of cores. Online transaction processing (OLTP) is an indispensable operation in most enterprises. In the past decades, transaction processing systems have evolved into sophisticated software systems with code bases measuring in the millions of lines. Several fundamental design principles, however, have remained largely unchanged since their inception. The execution of transaction processing systems is full of critical sections [JPH+ 09]. Consequently, these systems encounter significant performance and scalability problems on highly-parallel hardware. To cope with the scalability problems of transaction processing systems, researchers have suggested employing shared-nothing configurations [DGS+ 90] on a single chip [SMA+ 07] and/or dropping some of the ACID properties [DHJ+ 07, LBD+ 12]. 6.1. INTRODUCTION 6.1.1 101 Thread-to-transaction vs. Thread-to-data In this chapter, we argue that the primary cause of the contention problem is the uncoordinated data accesses that is characteristic of conventional transaction processing systems. These systems assign each transaction to a worker thread, a mechanism we refer to as thread-to-transaction assignment. Because each transaction runs on a separate thread, on every single shared data access threads need to contend with each other entering a very large number of un-scalable critical sections. The chaotic, uncoordinated access pattern of the thread-to-transaction (i.e. conventional) assignment policy becomes easily apparent with visual inspection. Figure 6.1 (left) depicts the accesses issued by each worker thread of a conventional transaction processing system, to each one of the records of the District table in a TPC-C database with 10 Warehouses3 . The system is configured with 10 worker threads and the workload consists of 20 clients repeatedly submitting Payment transactions from the TPC-C benchmark [TPC07], while we trace only 0.7 seconds of execution4 . The access patterns of each transaction, and consequently of each thread, are arbitrary and totally uncoordinated. To ensure data integrity from those uncoordinated accesses, each thread enters a large number of critical sections in the short lifetime of each transaction it executes. Critical sections, however, incur latch acquisitions and releases, whose overhead increases with the number of concurrent threads. Even more worrisome is that some of those critical sections are un-scalable — critical sections whose contention increases with the number of concurrent threads. To assess the performance overhead of critical section contention, Figure 6.2 depicts the throughput attained by a state-of-the-art storage manager (Shore-MT [JPH+ 09] and Section 5.2) as the machine utilization increases. The workload consists of clients repeatedly submitting GetSubscriberData transactions from the TATP benchmark [NWMR09] (methodology detailed in Section 6.4.1). As the machine utilization increases, the performance per CPU utilization drops. When utilizing all 64 hardware contexts the per hardware context performance drops by more than 80%. Figure 6.3(left) shows that the contention within the lock manager quickly dominates of the conventional system. At 64 hardware contexts the system spends more than 85% of its execution time on threads waiting to execute critical sections inside the lock manager. Based on the observation that uncoordinated accesses to data lead to high levels of contention, we propose a data-oriented architecture (DORA) to alleviate contention. Rather 3 4 A TPC-C database of scaling factor 10 The system and workload configuration are kept small to enhance the graphs visibility. CHAPTER 6. DATA-ORIENTED TRANSACTION EXECUTION Throughput /CPU Util. (Ktps/#ctxs) 102 3 2 1 DORA BASELINE 0 0 20 40 60 80 100 CPU Util. (%) Figure 6.2: Throughput per hardware context achieved by a conventional system and DORA when they execute the TATP GetSubscriberData transaction. Ideally, per-thread performance does not depend on the number of threads in the system. than coupling each thread with a transaction, DORA couples each thread with a disjoint subset of the database. Transactions flow from one thread to the other as they access different data, a mechanism we call thread-to-data assignment. DORA decomposes the transactions to smaller actions according to the data they access, and routs them to the corresponding threads for execution. In essence, instead of pulling data (database records) to the computation (transaction), DORA distributes the computation to wherever the data is mapped. Figure 6.1 (right) illustrates the effect of the data-oriented assignment of work on data accesses. It plots the data access patterns issued by a prototype DORA system, which employs the thread-to-data assignment. The accesses in DORA are coordinated and show regularity. A system adopting thread-to-data assignment can exploit the regular pattern of data accesses, reducing the pressure on contended components. In DORA, each thread coordinates accesses to its subset of data using a private locking mechanism. By limiting thread interactions with the centralized lock manager, DORA eliminates the contention in it (Figure 6.3 (right)) and provides better scalability (Figure 6.2). DORA exploits the low-latency, high-bandwidth inter-core communication of multicore systems. Transactions flow from one thread to the other with minimal overhead, as each thread accesses different parts of the database. Figure 6.4 compares the time breakdown of a conventional transaction processing system and a prototype DORA implementation when all 64 hardware contexts of a Sun Niagara II machine [JN07] are utilized running Nokia’s TATP benchmark [NWMR09] and OrderStatus transactions from the TPC-C benchmark 6.1. INTRODUCTION Baseline 100% Time Breakdown 103 DORA DORA 80% Lock Mgr Cont. 60% Lock Mgr 40% Other Cont. 20% Work 0% 8 33 65 89 CPU Util. (%) 100 8 34 69 86 CPU Util. (%) 100 Figure 6.3: Time breakdown of the conventional (left) and DORA (right) systems executing TATP GetSubscriberData transactions. Large and/or growing contention indicates poor scalability and performance. [TPC07]. From the breakdowns of TATP (left) we see that the DORA prototype eliminates the contention on the lock manager. Also, from the breakdowns of TPC-C OrderStatus (right), we see that DORA substitutes the heavy-weight centralized lock management with a much lighter-weight thread-local locking mechanism. 6.1.2 When DORA is needed DORA is a novel transaction processing design which is useful for transactional workloads with very high execution rates that pose pressure to the components of the transaction processing engine, such as the lock manager, when running on a multicore node. DORA does not do any compromises to the consistency level it offers; neither it requires any modifications to the application layer, even though as we will see an application agnostic to DORA’s partitioning is expected to perform better. Thus, DORA provides a solution for the scalability of “traditional” transaction processing within a single multicore node. A DORA system can replace existing transaction processing systems without requiring changes in the legacy application code. In addition, it maintains the ACID properties [GR92] and does not do any compromises in the data management functionality it provides (e.g. the ability to perform joins) or supported interface (e.g. not only key-value accesses). If the database cannot fit in a single node and a scale out solution is needed, one can easily employ the DORA design as the building block of a scale out solution. For applications that can tolerate relaxed consistency requirements or limited data management functionality, other solutions (possibly following the popular “NewSQL” or “NoSQL” approach) may be also suitable. CHAPTER 6. DATA-ORIENTED TRANSACTION EXECUTION 104 Time Breakdown (%) TATP TPC-C OrderStatus 100% DORA 80% Lock Manager Cont. 60% Lock Manager 40% Other Cont. 20% Work 0% Baseline DORA Baseline DORA Figure 6.4: Time breakdowns of the conventional and DORA systems when the machine is fully utilized running the TATP workload (left) and TPC-C OrderStatus transactions (right). A large ’work’ component indicates high throughput. 6.1.3 Contributions and chapter organization This chapter, which is centerpiece to the entire dissertation, makes three contributions. 1. We demonstrate that the conventional thread-to-transaction assignment results in contention at the lock manager that severely limits the performance and scalability on multicores. 2. We propose DORA, a data-oriented architecture that exhibits predictable access patterns and allows to substitute the heavyweight centralized lock manager with a lightweight thread-local locking mechanism. The result is a shared-everything system that scales to high core counts without weakening the ACID properties. 3. We evaluate a prototype DORA transaction execution engine and show that it attains up to 82% higher peak throughput against a state-of-the-art storage manager. Without admission control the performance benefits for DORA can be up to 4.8x. Additionally, when unsaturated DORA achieves up to 60% lower response times because it exploits the intra-transaction parallelism inherent in many transactions. The rest of the chapter is organized as follows. Section 6.2 explains why a conventional transaction processing system may suffer from contention in its lock manager. Section 6.3 presents DORA, an architecture based on the thread-to-data assignment, and Section 6.4 evaluates the performance of a prototype DORA OLTP engine. Section 6.5 discusses weaknesses of DORA. Finally, Section 6.6 presents related work and Section 6.7 concludes. 6.2. CONTENTION IN THE LOCK MANAGER Lock ID Hash Table T1 105 Transaction Head Lock strengths: IS < IX < S Lock Head Lock Manager L1 A Compute new lock mode (supremum) Lock Request L2 release … L3 IS IS S IS upgrade(IX) IX IS IS … C Grant new requests B Process upgrades Figure 6.5: Overview of a lock manager, with the inset depicting a lock release. 6.2 Contention in the lock manager In this section, we explain why in typical OLTP workloads the lock manager of conventional systems is often the first contended component and the obstacle to scalability. A typical OLTP workload consists of a large number of concurrent, short-lived transactions, each accessing a small fraction (ones to tens of records) of a large dataset. Each transaction independently executes on a separate thread. To guarantee data integrity, transactions enter a large number of critical sections to coordinate accesses to shared resources. One of the shared resources is the logical locks.5 The lock manager is responsible for maintaining isolation between concurrently-executing transactions, providing an interface for transactions to request, upgrade, and release locks. Behind the scenes it also ensures that transactions acquire proper intention locks, and performs deadlock prevention and detection. Next, we describe the lock manager of the Shore-MT storage engine [JPH+ 09] (for a more detailed discussion of database locking and Shore-MT’s lock manager see Section 2.2). Although the implementation details of commercial system’s lock managers are largely unknown, we expect their implementations to be similar. A possible varying aspect is that of latches. Shore-MT uses a preemption-resistant variation of the MCS queue-based spin-lock [HSIS05]. In the Sun Niagara II machine, our test bed, and for the CPU loads we are using in this study (< 130%), spinning-based implementations outperform any known solution involving blocking [JASA09]. 5 We use the term “logical locking” instead of the more popular “locking” to emphasize its difference with latching. Latching protects the physical consistency of main memory data structures. On the other hand, logical locking protects the logical consistency of database resources, such as records and tables. CHAPTER 6. DATA-ORIENTED TRANSACTION EXECUTION 106 Time Breakdown 100% Release Contention 80% Release 60% Acquire Contention Acquire 40% Other 20% 0% 7 28 57 76 100 CPU Util. (%) Figure 6.6: Breakdown of time spent in the lock manager when baseline Shore-MT runs the TPC-B benchmark. High contention leads to poor performance. The lock manager of Shore-MT is depicted on the left side of Figure 6.5. In Shore-MT every logical lock is a data structure that contains the lock’s mode, the head of a linked list of lock requests (granted or pending), and a latch. When a transaction attempts to acquire a lock, the lock manager first ensures the transaction holds higher-level intention locks, requesting them automatically if needed. If an appropriate coarser-grain lock is found, the request is granted immediately. Otherwise, the manager probes a hash table to find the desired lock. Once the lock is located, it is latched and the new request is appended to the request list. If the request is incompatible with the lock’s mode the transaction must block. Finally, the lock is unlatched and the request returns. Each transaction maintains a list of all its lock requests, in the order that it acquired them. At transaction completion, the transaction releases the locks one by one starting from the youngest. To release a lock (shown on the right side of Figure 6.5, the lock manager latches the lock and unlinks the corresponding request from the list. Before unlatching the lock, it traverses the request list to compute the new lock mode and to find any pending requests which may now be granted. Due to longer lists of lock requests, the effort required to grant or release a lock grows with the number of active transactions. Frequently-accessed locks, such as table locks, will have many requests in progress at any given point. Deadlock detection imposes additional lock request list traversals. The combination of longer lists of lock requests, with the increased number of threads executing transactions and contending for locks leads to detrimental results. Figure 6.6 shows where the time is spent inside the lock manager of Shore-MT when it runs the TPC-B benchmark [TPC94] and the system utilization increases on the x-axis. The 6.3. A DATA-ORIENTED ARCHITECTURE FOR OLTP 107 Figure 6.7: DORA is implemented as a layer on top of a storage manager. Its main three components are (a) a resource manager, (b) a dispatcher of actions, and (c) a set of worker threads that execute actions. breakdown is on the time it takes to acquire the locks, to release them, and the corresponding contention of each operation. When the system is lightly loaded, it spends more than 85% of the time on useful work inside the lock manager. As the load increases, however, the contention dominates. At 100% CPU utilization, more than 85% of the time inside the lock manager is contention (spinning on latches). 6.3 A Data-ORiented Architecture for OLTP In this section, we present the design of an OLTP system which employs a thread-to-data assignment policy. We exploit the coordinated access patterns of this assignment policy to eliminate interactions with the contention-prone centralized lock manager. At the same time, we maintain the ACID properties and do not physically partition the data. We call the architecture data-oriented architecture, or DORA. 6.3.1 Design overview DORA is implemented as a layer on top of a mostly traditional storage manager, as depicted in Figure 6.7. Its functionality includes three basic operations: • It binds worker threads to disjoint subsets of the database. • It distributes the work of each transaction across transaction-executing threads according to the data accessed by the transaction. • It avoids interactions with the centralized lock manager as much as possible during request execution. 108 CHAPTER 6. DATA-ORIENTED TRANSACTION EXECUTION Next we describe each operation in detail. We use the execution of the Payment transaction of the TPC-C benchmark as our running example. The Payment transaction updates a Customer’s balance, reflects the payment on the District and Warehouse sales statistics, and records it in a History log [TPC07]. Binding threads to data DORA couples worker threads with data by setting a routing rule for each table in the database. A routing rule is a mapping of sets of records, or datasets, to worker threads, called executors. Each dataset is assigned to one executor and an executor can be assigned multiple datasets from a single table. The only requirement for the routing rule is that each possible record of the table to map to a unique dataset. With the routing rules each table is logically decomposed into disjoint sets of records. All data resides in the same buffer pool and the rules imply no physical separation or data movement. A table’s routing rule may use any combination of the fields of the table. The columns used by the routing rule are called the routing fields. The columns of the primary or candidate key do not necessarily have to be the routing fields – any column can be. But in practice we have seen them to work well as routing fields. For example, the primary key of the Customers table of the TPC-C database consists of the Warehouse id (C W ID), the District id (C D ID), and the Customer id (C C ID). The routing fields may be all those fields or any subset of them. In the Payment transaction example, we assume the Warehouse id is the routing field in each of the four accessed tables. The routing rules are maintained at runtime by the DORA resource manager. Periodically, the resource manager updates the routing rules to balance load. The resource manager varies the number of executors per table depending on the size of the table, the number of requests for that table, and the available hardware resources. In the following chapter, we discuss the load balancing mechanism in more detail (Section 7.6). Transaction flow graphs In order to distribute the work of each transaction to the appropriate executors, DORA translates each transaction to a transaction flow graph. A transaction flow graph is a graph of actions to datasets. An action is a subset of a transaction’s code which involves access to a single or a small set of records from the same table. The identifier of an action identifies the set of records this action intends to access. Depending on the type of the access the identifier can be a set of values for the routing fields or the empty set. Two consecutive actions can be merged if they have the same identifier (refer to the same set). 6.3. A DATA-ORIENTED ARCHITECTURE FOR OLTP 109 Figure 6.8: A possible transaction flow graph for TPC-C Payment. The more specific the identifier of an action is, the easier is for DORA to route the action to its corresponding executor. That is, actions whose identifier are all the routing fields are directed to their executor by consulting the routing rule of the table. Actions whose identifier is a subset of the routing field set may map to multiple datasets. In that case, the action is broken to a set of smaller actions, each of them resized to correspond to a dataset. Secondary index accesses typically fall in this category. Finally, actions that do not contain any of the routing fields have the empty set as their identifier. For these secondary actions, the system cannot decide who their responsible executor is. In Section 6.3.2 we discuss how DORA handles secondary actions, while in Section 6.4.5 we evaluate DORA’s performance on transaction with secondary actions. To control the distributed execution of the transaction and to transfer data between actions with data dependencies, DORA uses shared objects across actions of the same transaction. Those shared objects are called rendezvous points or RVPs. If there is data dependency between two actions, an RVP is placed between them. The RVPs separate the execution of the transaction to different phases. The system cannot concurrently execute actions from the same transaction that belong to different phases. Each RVP has a counter initially set to the number of actions that need to report to it. Every executor which finishes the execution of an action decrements the corresponding RVP counter by one. When an RVP’s counter becomes zero, the next phase starts. The executor which zeroes a particular RVP initiates the next phase by enqueueing all the actions of that phase to their corresponding executors. The executor which zeroes the last RVP in the transaction flow graph calls for the transaction commit. On the other hand, any executor at any time can abort the transaction and hand it to recovery. 110 CHAPTER 6. DATA-ORIENTED TRANSACTION EXECUTION A transaction flow graph for the Payment transaction is shown in Figure 6.8. Each Payment transaction probes a Warehouse and a District record and updates them. In each case, both actions (record retrieval and update) have the same identifier and they can be merged. The Customer, on the other hand, 60% of the time is probed through a secondary index and then updated. That secondary index contains the Warehouse id, the District id, and the Customer’s last name. If the routing rule on the Customer table uses only the Warehouse id or/and the District id fields, then the system knows which executor is responsible for this secondary index access. If the routing rule uses also the Customer id field of the primary key, then the secondary index access needs to be broken to smaller actions that cover all the possible values for the Customer id. If the routing rule uses only the Customer id, then the system cannot decide which executor is responsible for the execution and this secondary index access becomes a secondary action. In our example, we assume that the routing field is the Warehouse id. Hence, the secondary index probe and the consequent record update have the same identifier and can be merged. Finally, an RVP separates the Payment transaction to two phases, because of the data dependency between the record insert on the History table and the other three record probes.6 Payment’s specification requires the Customer to be randomly selected from a remote Warehouse 15% of time. In that case, a shared-nothing system that partitions the database on the Warehouse will execute a distributed transaction with all the involved overheads. DORA, on the other hand, handles gracefully such transactions by simply routing the Customer action to a different executor. Hence, its performance is not affected by the percentage of “remote” transactions. Executing requests DORA routes all the actions that intend to operate on the same dataset to one executor. The executor is responsible to maintain isolation and ordering across conflicting actions. In this sub-section, we describe how DORA executes transactions avoiding centralized locking and maintaining transaction isolation; while, a detailed example of the execution of a transaction in DORA is described in the following sub-section (Section 6.3.1. To maintain isolation and ordering across actions, each executor has three data structures associated with it: a queue of incoming actions, a queue of completed actions, and a threadlocal lock table. The actions are processed in the order they enter the incoming queue. To detect conflicting actions the executor uses the local lock table. The conflict resolution 6 We should note that in [PTB+ 11] we present a tool that automatically generates the transaction flow graph of any arbitrary transaction, based on its SQL. 6.3. A DATA-ORIENTED ARCHITECTURE FOR OLTP 111 happens at the action-identifier level. That is, the input to the local lock table are action identifiers. The local locks have only two modes, shared and exclusive. Since the action identifiers may be only a subset of the routing fields, the locking scheme employed is similar to that of key-prefix locks [Gra07a]. Once an action acquires the local lock, it can proceed without centralized concurrency control. In regular transaction processing and under strict 2-phase locking each transaction releases every lock it acquired after the commit (or abort) log record has been flushed to disk. Similarly in DORA each transaction holds the local locks it acquired (through the actions it enqueued on every executor) until the transaction commits (or aborts) globally. That is, while at the terminal RVP, each transaction first waits for a response from the underlying storage manager that the log flush has been completed, which means that the commit (or abort) has completed as well. Then, it enqueues all the actions that participated in the transaction to the completion queues of their executors. Each executor removes entries from its local lock table as actions complete, and serially executes any blocked actions which can now proceed. Each executor implicitly holds an intent exclusive (IX) lock for the whole table, and does not have to interface the centralized lock manager in order to re-acquire it for every transaction. Transactions that intend to modify large data ranges which span multiple datasets or cover the entire table (e.g. a table scan, or an index or table drop) enqueue an action to every executor operating on that particular table. Once all the actions are granted access, the “multi-partition” transaction can proceed. In transaction processing workloads such operations already hamper concurrency, and therefore occur rarely in scalable applications. In addition, in order to avoid frequent interaction with the metadata manager, each executor caches the metadata information that it needs for accessing the datasets it has been assigned to. Caching the metadata information on the executor thread has similar effect with passing information with agent threads, which we described in Section 5.3.1. Detailed transaction execution example In this sub-section we describe in detail the execution of one TPC-C Payment transaction, our running example whose transaction flow graph is shown at Figure 6.8. Figure 6.9 shows the execution flow in DORA. Each circle is color-coded to depict the worker thread (executor or dispatcher) which executes that step. In total there are 12 steps for executing this transaction: 112 CHAPTER 6. DATA-ORIENTED TRANSACTION EXECUTION Figure 6.9: Execution example of the TPC-C Payment transaction in DORA. Step 1. The execution of the transaction starts from the thread that receives the request (e.g. from the network). That thread enqueues the actions of the first phase of the transaction to the corresponding executors. As we see from Figure 6.8, the first phase of Payment consists of three actions that are enqueued to corresponding Warehouse, District, and Customer executors. Step 2. The executor consumes actions enqueued to its incoming queue in a first-comefirst-served order. Once an action reaches the head of the queue it is picked by the executor. Step 3. Each executor probes its local lock table to determine whether it can process the action it is currently serving or not. If there is a logical lock conflict with a previous action, the action is added to a list of blocked actions. Its execution will resume once the transaction whose action blocks the particular action finishes. Otherwise, the executor executes the action without system-wide concurrency control. Step 4. Once the action is completed (with a set of operations in the underlying storage manager without system-wide concurrency control), the executor decrements the counter of the RVP of the first phase (RVP1). Step 5. If it is the last action to report to the RVP, the executor of the action that zeroed the RVP initiates the next phase by enqueueing the corresponding (one) action to the History table executor. Step 6. The History table executor does the same routine, picking the action from the head of the incoming queue. Step 7. The History table executor probes the local lock table. 6.3. A DATA-ORIENTED ARCHITECTURE FOR OLTP 113 Step 8. The Payment transaction inserts a record to the History table and for a reason we explain at Section 6.3.2, the execution of that action needs to interface the system-wide (centralized) lock manager. Step 9. Once the action is completed, the History executor updates the terminal RVP and calls for the transaction commit. Step 10. When the underlying storage engine returns from the system-wide commit (with the log flush and the release of any centralized locks), the History executor enqueues the identifiers of all the actions back to their executors. Step 11. The executors pick the committed action identifier. Step 12. The executors remove the entry from their local lock table, and search the list of pending actions for action which may now proceed. The detailed execution example and especially steps 9-12 show that the commit operation in DORA is similar with the 2-phase commit protocol [GR92], in the sense that the thread that calls the commit (“coordinator” in 2PC) also sends messages to release the local locks to the various executors (“participants” in 2PC). The main difference with the traditional 2-phase commit is that the messaging happens asynchronously and that the participants do not have to vote. Since all the modifications are logged under the same transaction identifier there is no need for additional messages and log inserts (separate “Prepare” and “Commit” messages and records in 2PC). That is, the commit is a one-off operation in terms of logging but still involves the asynchronous exchange of a message from the coordinator to the participants for the thread-local locking. This example shows how DORA converts the execution of each transaction to a collective effort of multiple threads. Also, it shows how DORA minimizes the interaction with the contention-prone centralized lock manager, at the expense of additional inter-core communication bandwidth. 6.3.2 Challenges In this section we describe three challenges in the DORA design. Namely, we describe how DORA handles record inserts and deletes (Section 6.3.2), how it executes secondary actions (Section 6.3.2), and how it avoids deadlocks (Section 6.3.2). 114 CHAPTER 6. DATA-ORIENTED TRANSACTION EXECUTION Record inserts and deletes Record probes and updates in DORA require only the local locking mechanism of each executor. However, there is still a need for centralized coordination across concurrent record inserts and deletions (executed by different executors) for their accesses to specific page slots. That is, it is safe to delete a record without centralized concurrency control with respect to any reads to this record, because all the probes will be executed serially by the executor responsible for that dataset. But, there is problem with the record inserts by other executors. The following interleaving of operations by transactions T1 executed by executor E1 and T2 executed by executor E2 can cause a problem: T1 deletes record R1. T2 probes the page where record R1 used to be and finds its slot free. T2 inserts its record. T1 then aborts. The rollback fails because it is unable to reclaim the slot which T2 now uses. This is a physical conflict (T1 and T2 do not intend to access the same data) which row-level locks would normally prevent and which DORA must address. To avoid this problem, the insert and delete record operations lock the record id (RID), and along with it the accompanying slot, through the centralized lock manager. Although the centralized lock manager can be a source of contention, typically the rowlevel locks that need to be acquired due to record insertions and deletes are not contended, and they make up only a fraction of the total number of locks a conventional system would lock. For example, when executing the Payment transaction DORA needs to acquire only 1 such lock (for inserting the History record), of the 19 a conventional system normally acquires. This challenge with the inserts and deletes is caused because DORA employs partitioning only at the logical level, and there are some unavoidable physical conflicts. In the next chapter (Chapter 7), we extend DORA’s design to the physical layer. One of the benefits of extending the DORA design to the physical layer is that we eliminate the possibility of physical conflicts and we do not have to acquire even those few centralized locks. Secondary actions The problem with secondary actions (Section 6.3.1) is that the system does not know which executor is responsible for their execution. To resolve this difficulty the indexes whose accesses cannot be mapped to executors, store the RID as well as all the routing fields at each leaf entry. The RVP-executing thread of the previous phase executes those secondary actions and uses the additional information to determine which executor should perform the access of the record in the heap file. 6.3. A DATA-ORIENTED ARCHITECTURE FOR OLTP 115 For example, consider a non-clustered secondary index on the last name of a Customers table and a DORA partitioning that does not use any of the fields of that index as the routing fields. In that case, all the accesses to that secondary index are secondary actions. To access a Customer though that index, DORA follows the following steps: (a) Any thread that completed the execution of the previous RVP (or the dispatcher thread) probes the secondary index under normal centralized concurrency control; (b) The probing thread retrieves the RIDs and the routing fields of the records that match the probing criteria; (c) The probing thread groups the matched RIDs according to the routing table and enqueues an action of special type with the list of RIDs to be accessed, to every partition that contains at least an involved RID; (d) Finally, when the executors of each involved partition dequeue such a special action they consult their local lock table and proceed directly to the heap file to access the selected record. Under this scheme uncommitted record inserts and updates are properly serialized by the executor, but deletes still pose a risk of violating isolation. Consider the interleaving of operations by transactions T1 and T2 using primary index Idx1 and a secondary index Idx2 which is accessed by any thread. T1 deletes Rec1 through Idx1. T1 deletes entry from Idx2. T2 probes Idx2 and returns not-found. T1 rolls back, causing Rec1 to reappear in Idx2. At this point T2 has lost isolation because it saw the uncommitted (and eventually rolled back) delete performed by T1. To overcome this danger, we can add a ’deleted’ flag to the entries of Idx2. When a transaction deletes a record it does not remove the entry from the index; any transaction which attempts to access the record will go through its owning executor and find that it was, or is being, deleted. Once the deleting transaction commits, it goes back and sets the flag for each index entry of a deleted record outside of any transaction. Transactions accessing secondary indexes ignore any entries with a deleted flag, and may safely re-insert a new record with the same primary key. Because deleted secondary index entries will tend to accumulate over time, we can modify the B-Tree’s leaf-split algorithm to first garbage collect any deleted records before deciding whether a split is necessary. For growing or update-intensive workloads, this approach will avoid wasting excessive space on deleted records. If updates are very rare, there will be little potential wasted space in the first place. In Section 6.4.5 we evaluate the performance of DORA with secondary actions. It is expected that the performance of DORA in workloads with very frequent secondary actions will not be optimal. However, DORA partitioning is only logical and a DBA or the application designer can easily modify the partitioning to reduce the frequency of secondary actions. For example, in the scenario with the secondary index on the last name of Customers, a sim- 116 CHAPTER 6. DATA-ORIENTED TRANSACTION EXECUTION ple solution would be to add the field of last names to the set of routing fields. Adding an additional field to the routing fields would only increase the number of datasets (partitions) but nothing other than that. At the same time it would eliminate any secondary actions. In [PTJA11] and [TPJA11] we show that the cost of repartitioning in DORA is very small, while in [PTB+ 11] we present a tool that monitors the accesses of the database and alerts when it observes high frequency of secondary actions. Deadlock Detection DORA transactions can block on local lock tables. Hence, the storage manager must provide an interface for executors to propagate this information to the deadlock detector. DORA proactively reduces the probability of deadlocks. Whenever a thread is about to submit the actions of a transaction phase, it latches the incoming queues of all the executors it plans to submit to, so that the action submission appears to happen atomically.7 This ensures that transactions with the same transaction flow graph will never deadlock with each other if they are on the same phase. That is, two transactions with the same transaction flow graph which are on the same phase will deadlock only if their conflicting requests are processed in reverse order. But that is impossible, because the submission of the actions appears to happen atomically, the executors serve actions in FIFO order and the local locks are held until the transaction commits. The transaction which will enqueue its actions first it will finish before the other. 6.3.3 Improving I/O and microarchitectural behavior We exploit the regularity and predictability of the accesses in DORA only in order: (a) to reduce the interaction with the centralized lock manager and hence reduce the number of expensive latch acquisitions and releases; and (b) to improve single thread performance by replacing the execution of the expensive lock manager code with the execution of a much lighter-weight thread-local locking mechanism. But, the potential of the DORA execution does not stop there. Potentially DORAs predictable access patterns can be exploited to improve both the I/O, as well as, the microarchitectural behavior in OLTP. In particular, the I/O executed during conventional OLTP is random and low performing.8 The DORA executors can buffer the I/O requests and issue them in batches since those 7 8 There is a strict ordering between executors. The threads acquire the latches in that order, avoiding deadlocks on the latches of the incoming queues of executors. As a proof, the performance of conventional OLTP systems is significantly improved with the usage of Flash-based storage technologies which exhibit high random access bandwidth [LMP+ 08] 6.3. A DATA-ORIENTED ARCHITECTURE FOR OLTP 117 I/Os are expected to target pages that are physically close to each other, improving the I/O behavior. Furthermore, the main characteristic of the micro-architectural behavior of conventional OLTP systems is the very large volume of shared read-modify accesses by multiple processing cores [BW04]. Accesses which, unfortunately, are also highly unpredictable [SWH+ 04]. Due to the two aforementioned reasons, emerging hardware technologies such as reactive distributed on-chip caches (e.g. [HFFA09, BW04]) and/or the most advanced hardware prefetchers (e.g. [SWAF09]) fail to significantly improve the performance of conventional OLTP. Since DORAs design is based on that the majority of the accesses to a specific data region are coming by a specific thread, we expect a friendlier behavior which can realize the full potential of the latest hardware developments by providing more private and predictable memory accesses. 6.3.4 Prototype Implementation In order to evaluate the DORA design, we implemented a prototype DORA OLTP engine over our baseline system, the Shore-MT storage manager [JPH+ 09] (and Section 5.2). ShoreMT is a modified version of the SHORE storage manager [CDF+ 94] with a multi-threaded kernel. SHORE supports all the major features of modern database engines: full transaction isolation, hierarchical locking, a CLOCK buffer pool with replacement and prefetch hints, BTree indexes, and ARIES-style logging and recovery [MHL+ 92]. We use Shore-MT because it has been shown to scale better than any other open-source storage engine. Our prototype does not have a optimizer which transforms regular transaction code to transaction flow graphs. Thus, all transactions are partially hard-coded. The database metadata and back-end processing are schema-agnostic and general purpose, but the code is schema-aware. This arrangement is similar to the statically compiled stored procedures that commercial engines support, converting annotated C code into a compiled object that is bound to the database and directly executed. For example, for maximum performance, DB2 allows developers to generate compiled “external routines” in a shared library for the engine to dlopen and execute directly within the engine’s core.9 The prototype is implemented as a layer over Shore-MT. Shore-MT’s sources are linked directly to the code of the prototype. Modifications to Shore-MT were minimal. We added an additional parameter to the functions which read or update records, and to the index and table scan iterators. This flag instructs Shore-MT to not use concurrency control. Shore-MT already has a builtin option to access some resources without concurrency control. In the case of insert and 9 http://publib.boulder.ibm.com/infocenter/ db2luw/v9r5/index.jsp 118 CHAPTER 6. DATA-ORIENTED TRANSACTION EXECUTION delete records, another flag instructs Shore-MT to acquire only the row-level lock and avoid acquiring the whole hierarchy. Even though according to DORA’s design a single executor (worker thread) can be assigned multiple datasets from various tables, in the prototype we assign only a single dataset (from a single table). Thus, if a transaction, like TPC-C Payment, accesses four tables it is going to be handled by at least four different executor threads. That impacts the performance of DORA on low core counts, as we will see in Section 6.4.7 where we evaluate the performance of DORA in a machine with limited hardware parallelism. 6.4 Performance Evaluation For the evaluation we use one of the most parallel multicore machines available and we compare against Shore-MT, which we label as Baseline. Shore-MT’s current performance and scalability make it one of the first systems to face the contention problem on commodity chip multicores. As hardware parallelism increases and transaction processing systems solve other scalability problems, they are expected to similarly face the problem of contention in the lock manager. Our evaluation covers several areas: • We measure how effectively DORA reduces the interaction with the centralized lock manager and what is the impact on performance (Section 6.4.2). • We quantify how DORA exploits the intra-transaction parallelism of transactions (Section 6.4.3). • We compare the peak performance Shore-MT and DORA achieve, if a perfect admission control mechanism is used (Section 6.4.4). • We evaluate the performance of DORA on secondary index accesses that can be either aligned with the partitioning scheme or not (Section 6.4.5). • We evaluate the performance on complicated transactions with joins (Section 6.4.6). • We compare how the two systems behave on hardware with limited parallelism (Section 6.4.7). • We compare the anatomy of critical sections for baseline Shore-MT and DORA (Section 6.4.8). 6.4.1 Experimental Setup and Workloads Hardware. We perform all our experiments on a Sun T5220 “Niagara II” box configured with 32GB of RAM and running Sun Solaris 10. The Niagara II chip [JN07] contains 8 cores, each capable of supporting 8 hardware contexts, for a total of 64 “OS-visible” CPUs. Each core has two execution pipelines, allowing it to simultaneously process instructions from any 6.4. PERFORMANCE EVALUATION 119 two threads. Thus, it can process up to 16 instructions per machine cycle, using the many available contexts to overlap delays in any one thread. I/O Subsystem. When running OLTP workloads on the Sun Niagara II machine, both the baseline Shore-MT system and the DORA prototype are capable of high performance. The demand on the I/O subsystem scales with throughput due to dirty page flushes and log writes. For the random I/O generated, hundreds or even thousands of disks may be necessary to meet the demand. 10 Given the limited budget and that we are interested in the behavior of the systems when a large number of hardware contexts are utilized, we store the database and the log on an in-memory file system. This setup exercises all the codepaths of the storage manager yet allows us to saturate the CPU. In addition, preliminary experimentation using high performing Flash drives indicates that the relative behavior remains the same. Workloads. We use transactions from three OLTP benchmarks: Nokia’s Network Database Benchmark or TATP [NWMR09] (also formerly known as TM1), TPC-C [TPC07], and TPC-B [TPC94, A+ 85]. Business intelligence workloads, such as the TPC-H benchmark [TPC06], spend a large fraction of their time on computations outside the storage engine imposing small pressure on the transaction processing system components, such as the lock manager. Hence, they are not an interesting workload for this study. The TATP benchmark consists of seven transactions, operating on four tables, implementing various operations executed by mobile networks. Three of the transactions are read-only while the other four perform updates. The transactions are extremely short, yet exercise all the codepaths in typical transaction processing. Each transaction accesses only 1-4 records, and must execute with low latency even under heavy load. We use a database of 5M subscribers (∼7.5GB). The TPC-C benchmark models an OLTP database for a retailer. It consists of five transactions that follow customer orders from creation to final delivery and payment. We set the buffer pool to be 4GB and use a TPC-C database of scaling factor 150, a database with 150 Warehouses, which occupies around 20GB on the disk. 150 Warehouses can support enough concurrent requests to saturate the machine, but the database is still small enough to fit in the in-memory file system. The TPC-B benchmark models a bank where customers deposit and withdraw from their accounts. We use a TPC-B database of scaling factor 100, a database with 100 Branches, which occupies 2GB on disk and fits entirely in the buffer pool. 10 As an example, consider the top results on the TPC-C OLTP benchmark, at http://www.tpc.org/tpcc/. All of them use I/O subsystems that worth hundreds thousands or millions of dollars. 120 CHAPTER 6. DATA-ORIENTED TRANSACTION EXECUTION For each run, the driver code spawns a certain number clients and the clients start submitting transactions. Although the clients run on the same machine with the rest of the system, they add small overhead (<3%). We repeat the measurements multiple times, and the measured relative standard deviation is less than 5%. We compile the sources using the highest level of optimization options using Sun’s CC v5.10 compiler 11 . For measurements that needed profiling, we used tools from Sun Studio 12 suite 12 . The profiling tools impose a certain overhead (∼15%) but the relative behavior between the two systems remains the same. 6.4.2 Eliminating Contention in the Lock Manager First, we examine the impact of contention on the lock manager for the Baseline system and DORA as they utilize an increasing number of hardware resources. The workload for this experiment consists of clients repeatedly submitting GetSubscriberData transactions of the TATP benchmark. The results are shown in Figure 6.2 and Figure 6.3. Figure 6.2 shows the throughput per CPU utilization of the two systems on the y-axis as the CPU utilization increases. Figure 6.3 shows the time breakdown for each of the two systems. We can see that the contention in lock manager becomes the bottleneck for the Baseline system, growing to more than 85% of the total execution. In contrast, for DORA the contention on the lock manager is eliminated. We can also observe that the overhead of the DORA mechanism is small. Much smaller than the centralized lock manager operations it eliminates even when those are uncontended. It is worth mentioning that the GetSubscriberData is a read-only transaction. Yet, the Baseline system suffers from contention in the lock manager. That is because the threads will contend even if they want to acquire the same lock in compatible mode; acquiring any database lock, even in compatible mode, needs synchronization. Next, we quantify how effectively DORA reduces the interaction with the centralized lock manager and the impact in performance. We measure the number of locks acquired by the Baseline and DORA. We instrument the code to report the number and the type of the acquired locks. Figure 6.10 shows the number of locks acquired per 100 transactions when the two systems execute transactions from the TATP and TPC-B benchmarks, as well as TPC-C OrderStatus transactions. The locks are categorized in three types. The record-level (row-level) locks, the locks of the centralized lock manager that are not at the record-level (labeled higher level), and the thread local locks DORA uses. 11 12 http://developers.sun.com/sunstudio/documentation/ss12u1/mr/READMEs/c++.html http://download.oracle.com/docs/cd/E19205-01/821-0304/ 6.4. PERFORMANCE EVALUATION 121 800 5000 600 Thread-Local 4000 Row-level 3000 Higher-level 400 2000 200 1000 0 0 Base DORA TATP Base DORA TPC-B Base DORA TPC-C OrderStatus Figure 6.10: Absolute number of locks acquired, categorized by type, when Baseline and DORA execute 100 transactions from various workloads. In typical OLTP workloads the contention for the row-level locks is limited, because there is a very large number of randomly accessed records. But, as we go up in the hierarchy of locks, we expect the contention to increase. For example, every transaction needs to acquire intention locks on the tables it is going to access. Figure 6.10 shows that DORA has only minimal interaction with the centralized lock manager. Figure 6.10 gives an idea on how those three workloads behave. TATP consists of extremely short running transactions. For their execution the conventional system acquires as many higher-level locks as row-level. In TPC-B, the ratio between the row-level to higher-level locks acquired is 2:1. Consequently, we expect the contention on the lock manager of the conventional system to be smaller when it executes the TPC-B benchmark than TATP. The conventional system is expected to scale even better when it executes TPC-C OrderStatus transactions, which they have even larger ratio of row-level to higher-level locks. Figure 6.11 confirms our expectations. We plot the performance of both systems in the three workloads. The x-axis is the offered CPU load. We calculate the offered CPU load by adding to the measured CPU utilization, the time the threads spend in the runnable queue waiting for a processor to run. We see that the Baseline system experiences scalability problems, more profound in the case of TATP. DORA, on the other hand, scales its performance as much as the hardware resources allow. When the offered CPU load exceeds 100%, the performance of the conventional system in all three workloads collapses. This happens because the operating system needs to preempt threads, and in some cases it happens to preempt threads that are in the middle of contended critical sections. The performance of DORA, on CHAPTER 6. DATA-ORIENTED TRANSACTION EXECUTION 122 Throughput (Ktps) TATP TPC-B 100 50 80 40 60 30 40 20 20 10 TPC-C OrderStatus 40 30 20 10 0 0 0 25 50 75 100 CPU Load (%) DORA Baseline 0 0 25 50 75 CPU Load (%) 100 0 25 50 75 100 CPU Load (%) Figure 6.11: Performance of baseline and DORA for the TATP and TPC-B benchmarks, as well as for TPC-C OrderStatus transactions as load increases along the x-axis, with throughput shown on the y-axis. DORA consistently achieves higher throughput, exhibiting almost linear scalability up to the 64 available hardware contexts. the other hand, remains high; another proof that DORA reduces the number of contended critical sections. Figure 6.4 shows the detailed time breakdown for the two systems at 100% CPU utilization for the TATP benchmark and the TPC-C OrderStatus transactions. DORA outperforms the Baseline system in OLTP workloads independently of whether the lock manager of the Baseline system is contended or not. 6.4.3 Intra-transaction Parallelism DORA exploits intra-transaction not only as a mechanism for reducing the pressure to the contended centralized lock manager, but also for improving response times when the workload does not saturate the available hardware. Exploiting intra-transaction parallelism is useful in several cases. For example, it can be useful for applications that exhibit limited concurrency due to heavy contention for logical locks or for organizations that simply do not utilize their available processing power. An transactional application can exhibit limited concurrency either because ti is poorly written (e.g. all transactions update the same few records) or when the database is large enough so that the system is I/O-bound. If the system is I/O bound exploiting intra-transaction parallelism improves performance because the system issues multiple requests in parallel which can improve I/O bandwidth. 6.4. PERFORMANCE EVALUATION 123 Norm. Response Time 1 Baseline 0.8 DORA 0.6 0.4 0.2 0 GetNewDest UpdSubData NewOrder Payment TPC-B Figure 6.12: Single-transaction response times. DORA exploits the intra-transaction parallelism, inherent in many workloads, to achieve faster responses. Response times of intra-parallel transactions In the experiment shown in Figure 6.12 we compare the average response time per request achieved by the Baseline system and DORA, when a single client submits intra-parallel transactions from the three workloads and the log resides in a in-memory file system. DORA exploits the available intra-transaction parallelism of the transactions and achieves lower response times. For example TPC-C NewOrder transactions are executed 2.1x faster under DORA. In badly designed applications where some records are extremely “hot”, DORA’s ability to exploit intra-transaction parallelism will immediately provide a significant performance boost. Intra-transaction parallelism with aborts One challenge with intra-transaction parallelism for DORA, are transactions with nonnegligible abort rates. For example, one of the characteristics of the TATP benchmark is that a large fraction of transactions (around 25%) need to abort due to invalid inputs. In such workloads, DORA may end up executing actions from already-aborted transactions, wasting useful cycles, extending the critical path and eventually performing poorly. There are two execution strategies DORA can follow for such intra-parallel transactions with high abort rates. The first strategy, is to execute such transactions in parallel and to check frequently for aborts. The second is to serialize their execution. That is, even though there is opportunity to proceed in parallel the execution of actions from such transactions, DORA can be pessimistic and execute them serially. This strategy ensures that if an action aborts there is no work wasted by the execution of any other parallel action. CHAPTER 6. DATA-ORIENTED TRANSACTION EXECUTION 124 TATP-UpdSubData Throughput (Ktps) 60 DORA-S Baseline DORA-P 50 40 30 20 10 0 0 25 50 75 CPU Load (%) 100 Figure 6.13: Performance when executing the UpdateSubscriberData transaction of TATP, a transaction with high abort rate. In such workloads DORA should be pessimistic and prefer serial transaction flow graphs. Figure 6.13 compares the throughput of the Baseline system and two variations of DORA, one with parallel and one with serial execution strategy, when an increasing number of clients submit repeatedly UpdateSubscriberData transactions from the TATP benchmark. This transaction, whose parallel and serial transaction flow graphs are depicted on the right side of the figure, consists of two independent actions. One action attempts to update a Subscriber and always succeeds. The other action attempts to update a corresponding SpecialFacility entry and it succeeds only 62.5% of the time, failing the rest of the time due to wrong input. The parallel execution is labeled DORA-P. While, the serial execution, which first attempts to update the SpecialFacility and only if that succeeds it tries to update the Subscriber, is labeled DORA-S. As we can see, the parallel plan is a bad choice for this workload. DORA-P achieves lower performance than even the Baseline, whereas DORA-S scales almost linearly as expected. The DORA resource manager monitors the abort rates of entire transactions and individual actions in each executor, and can adapt to abort rates. For example, when the abort rates are high, DORA can switches to serial execution plans. The simple way to convert an intra-parallel execution plan to serial is by inserting empty rendezvous points between actions of the same phase of the parallel plan. The higher the abort rate of a specific action, the soon it should be executed in the serial plan. Norm. Peak Throughput 6.4. PERFORMANCE EVALUATION 2 1.5 74% 98% 100% 100% 60% 66% 78% 76% 88% 70% 90% 95% 95% 95% 85% 125 70% 90% 88% 100% 95% 85% DORA-Ideal DORA Baseline-Ideal Baseline 1 0.5 0 TATP TPC-C TPC-B Figure 6.14: Maximum throughput achieved by Baseline and DORA for various workloads when a perfect admission control mechanism is applied. The number above each bar is the CPU utilization where the peak occurs. The light bars, labeled Baseline-Ideal and DORAIdeal, show the projected throughput if the machine was fully utilized and the system was scaling with the same rate with the one it achieved the peak throughput. 6.4.4 Maximizing Throughput Admission control can limit the number of outstanding transactions, and in turn, limit contention within the lock manager of the system. Properly tuned, admission control allows the system to achieve the highest possible throughput, even if it means leaving the machine underutilized. In Figure 6.14 we compare the maximum throughput of Baseline and DORA achieve, if the systems were employing perfect admission control. For each system and workload we report the CPU utilization, when this peak throughput was achieved. DORA achieves higher peak throughput in all the transactions we study, and this peak is achieved closer to the hardware limits. With the light bars, labeled Baseline-Ideal and DORA-Ideal we plot the ideal projected throughput if the machine was fully utilized and each system was scaling with the same rate with the one it achieved the peak throughput. In most of the cases the project ideal performance for the Baseline system in lower than DORA’s. This is happening because DORA substitutes the complex heavy-weight lock manager with a much lighter-weight one. For the TPC-C and TPC-B benchmarks, DORA achieves relatively smaller improvements. This happens for two reasons. First, those transactions do not expose the same degree of contention within the lock manager, and leave little room for improvement. Second, some of the transactions (NewOrder and Payment of TPC-C and the TPC-B) impose great pressure on the log manager that becomes the new bottleneck. 126 6.4.5 CHAPTER 6. DATA-ORIENTED TRANSACTION EXECUTION Secondary index accesses Non-clustered secondary indexes are pervasive in transaction processing, since they are the only means for speeding up transactions that access records using non-primary key columns. For example, consider a database table with Customers and a transaction which accesses those Customers with their last name, which is not the primary key of this table. It is absolutely necessary to have a secondary index on the last names, otherwise for every Customer retrieval through her last name the entire heap file needs to be scanned. As we already discussed in Section 6.3.1 and Section 6.3.2, secondary index accesses pose several challenges to the DORA design. We explore some of them in the next two subsections where we break the analysis of secondary index accesses to two cases: when the secondary index is aligned with the partitioning scheme and when it is not. To investigate the impact of non-clustered secondary index accesses, we conduct an experiment where we modify the GetSubscriberData transaction of the TATP benchmark to perform a range index scan on the secondary index with the names of the Subscribers and we control the number of matched records. In the original version of the transaction only one Subscriber is found. In the modified version, we probe for 1, 10, 100 and 1000 Subscribers, even though index scans for thousands of records are not typical in high-throughput transactional workloads. Partitioning-aligned range index scans We consider the case where a secondary index is aligned with the partitioning scheme. That is the case where the secondary index columns are a subset of the routing columns. In that case, a secondary index scan may return a large number of matched RIDs (record ids of entries that match the selection criteria) from several partitions. All the executors need to send the probed data to an RVP where an aggregation of the partial results takes place. As the range of the index scans become larger (or the selectivity drops), more data need to be sent to the RVPs potentially causing a bottleneck due to excessive data transfers. Figure 6.15 compares the performance of Baseline and DORA as an increasing number of clients (on the x-axis) repeatedly submit the transaction with the index scan, and the scanned index is aligned with the partitioning scheme. DORA improves performance by 101%, 82%, 78%, and 43% respectively. DORA’s improvement gets smaller as the range of the index scan increases for two reasons. First, as a larger number of records are accessed per index scan, the fraction of high-level locks to row-level locks decreases. Consequently, the contention for hot locks for the Baseline system decreases and its performance improves. In addition, the data transfers impose a non-negligible overhead for DORA as a larger number of records need to be sent to the RVP for aggregation. Still though, as long as the index 6.4. PERFORMANCE EVALUATION Throughput (Ktps) 250 Range=1 160 Range=10 140 200 127 30 Range=100 150 100 3 20 2 15 80 60 2 10 1 40 50 5 20 0 0 Range=1000 3 25 120 100 4 1 0 0 DORA Baseline 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 # Clients # Clients # Clients # Clients Figure 6.15: Performance of Baseline and DORA as the range of a (non-clustered) secondary index scan which is aligned with DORA’s partitioning increases. As the range of the index scan increases, more than one DORA partitions are accessed and more data need to be sent to the RVP. scans of partitioning-aligned secondary indexes are selective and touch a relatively small number of records, DORA provides a significant performance improvement. Transactions that touch tenths of thousands of records through index scans are not common on scalable transactional applications. Non partitioning-aligned range scans Next, we consider the case where a secondary index is not aligned with the partitioning scheme. We already detailed the drawbacks of this case and how DORA handles it in Section 6.3.2. Figure 6.16 compares the performance of Baseline and DORA as an increasing number of clients (on the x-axis) repeatedly submit the transaction with the index scan. But this time the index is non-aligned with the partitioning scheme. In that case, DORA improves performance by 121% and 13% when 1 and 10 records are accessed respectively. On the other hand, when 100 and 1000 records are accessed, DORA is 17% and 31% slower than the Baseline. We note that DORA exhibits intra-transaction parallelism for non-aligned secondary index accesses, since one thread does the secondary index access and another thread does the record access in the heap file. This is evident in all range sizes, as long as the number of concurrent clients is less than around 30. Non-aligned secondary index accesses impose significant overhead for DORA. On top of the reasons that close the performance gap between DORA and Baseline and we discussed in the previous subsection, the extra overhead comes from the extra work needed for the CHAPTER 6. DATA-ORIENTED TRANSACTION EXECUTION 128 Throughput (Ktps) 300 Range=1 80 Range=10 70 250 12 Range=100 150 0.8 6 40 30 100 1.0 8 50 0.6 4 0.4 20 50 2 10 0 0 Range=1000 1.2 10 60 200 1.4 0.2 0 DORA Baseline 0.0 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 # Clients # Clients # Clients # Clients Figure 6.16: Performance of Baseline and DORA as the range of a (non-clustered) secondary index scan which is not aligned with DORA’s partitioning increases. For the non-aligned index accesses DORA needs to do additional work per record accessed. record probes. In particular, each record probe is a two step process, where the secondary index probe is done by one thread conventionally which then requests from the appropriate executor threads to retrieve the selected records. Whereas a conventional system right after the secondary index probe it would access directly the records in the heap file through their RIDs, in DORA a set of packets are constructed and sent to the appropriate partition-owning threads, which access the records in the heap file. On top of the extra cycles wasted there, in DORA we increase the size of the index by appending in each leaf entry the routing fields of each record. Nevertheless, the benefits of DORA are substantial even in such “non-friendly” workload with secondary actions, as long as they probe for a limited number of records (tenths to low hundreds). On the other hand, a DBA can always modify DORA’s logical partitioning to eliminate any problematic secondary actions, if it is necessary. 6.4.6 Transactions with joins With the next experiment we try to quantify how DORA performs on transactions with joins. Joins are particularly challenging for DORA because they involve records coming from two tables, which in DORA are being accessed by at least two different threads. That means that a possibly significant number of records needs to be transferred from one thread to the other, through an RVP. As the amount of data transferred increases, we expect the performance of DORA to decrease. To quantify the performance on transactions with joins, we slightly modified the StockLevel transaction from the TPC-C benchmark, which contains a join between Orderlines and 6.4. PERFORMANCE EVALUATION 129 Normalized Throughput 1.40 1.20 1.00 0.80 0.60 Baseline 0.40 DORA 0.20 0.00 20 200 2000 20000 Tuples Joined Figure 6.17: Performance comparison between baseline and DORA on a slightly modified version of the TPC-C StockLevel transaction, which is a transaction with a join, where we regulate the number of records joined. As more records are joined, DORA’s performance gets lower. Stocks. In the default version of the transaction, 200 Orderlines join with equal number of Stocks. We modify the transaction to control the number of Orderlines joined, from 20 to 20000, even though transactions joins with tenths of thousands of records are not common in transactional workloads. The join is executed as a nested-loops join. That is, the Orderlines table is index-scanned for matching records which are sent to the RVP, and then the Stocks table is probed for joining records. Figure 6.17 plots the performance of DORA normalized to the performance of baseline for an increasing number of records joined. When only 20 tuples are joined DORA is faster than baseline by about 25%. As an increasing number of tuples are joined the performance gap closes. But it is only when tenths of thousands of records joined when baseline outperforms DORA. In particular, when 20000 tuples are joined baseline is faster than DORA by 8%. Thus, DORA is useful also for transactional workloads that contain joins of a modest number of records. We revisit this experiment in the following chapter (Section 7.7.6) where we show even further improvements in transactions with nested-loops joins because the index probes are faster. 6.4.7 Limited hardware parallelism As we already discussed in Section 1.4, the focus of this dissertation is not parallelismconstrained hardware. In the last part of the performance analysis section, we study the behavior of DORA on machines with limited hardware parallelism. To do that, we run transactions that exhibit intra-transaction parallelism in a machine where we control the number of available processing cores. CHAPTER 6. DATA-ORIENTED TRANSACTION EXECUTION 130 Throughput (Ktps) 2500 4000 TPC-C Payment 2000 Baseline DORA TPC-B 3000 1500 2000 1000 1000 500 0 0 1 2 3 Available CPUs 4 1 2 3 4 Available CPUs Figure 6.18: Performance of Baseline and DORA running intra-parallel transactions on a machine with limited hardware parallelism. DORA does not perform well in very low core counts, due to context switches and preemptions. Figure 6.18 compares the performance of baseline and DORA when a single client repeatedly submits TPC-C Payment transactions (left) and the TPC-B transaction (right) on a machine where we control the number of available hardware cores, from 1 to 4. Those two transactions exhibit intra-transaction parallelism and in DORA several threads participate in their execution. In particular, Payment (whose transaction flow graph is shown in Figure 6.8) consists of four actions out of which three execute in parallel in the first phase and one in the second; while TPC-B also consists of four actions all of them executing in parallel in one phase. We see that when the number of available cores at least matches the number of parallel actions (and in compliance with what we saw in Section 6.4.3), DORA gets lower response times proportional to the intra-transaction parallelism of each transaction. On the other hand, when the number of available cores is lower we observe different behavior. In the uni-processor case (one core available) DORA’s performance is lower but comparable with baseline, on the other hand in the two cores case DORA under-performs. Two are the main reasons for DORA’s drop in performance. First, it is the additional work that needs to be done per transaction in DORA in order to execute the transaction as a data flow; additional work for each transaction that wastes the sparingly available processor cycles. For example, DORA for each transaction creates several actions and sends them to the various participating threads, which need to context switch in, dequeue their action, execute it and context switch out. The second reason is the much higher number of preemptions that cause convoys [BGMP79] and are shown in the form of involuntary context 6.4. PERFORMANCE EVALUATION 131 Voluntary Ctxs Context switches (x1000) 600 500 400 300 200 100 0 600 500 400 300 200 100 0 TPC-C Payment Base DORA Base DORA Base DORA Base DORA 1 2 3 Available CPUs 4 Involuntary Ctxs TPC-B Base DORA Base DORA Base DORA Base DORA 1 2 3 4 Available CPUs Figure 6.19: Number of voluntary and involuntary context switches for Baseline and DORA when they run intra-parallel transactions on a machine with limited hardware parallelism. DORA does around an order of magnitude more context switches that impact performance. Even worse, a fraction of the context switches are involuntary which indicates preemptions. switches. Figure 6.19 plots the number of voluntary and involuntary context switches for the same duration of time for baseline and DORA when they run the same workloads with Figure 6.18. We see that a very large number of context switches take place in DORA, almost an order of magnitude more than baseline. Even worse, a significant fraction of them are involuntary context switches, which means that we have preemptions. A preemption may happen while a thread is inside a critical section. If the following scheduled thread also wants to execute the same critical section it will have to wait idle, forming a convoy. The aforementioned problems are related both to DORA’s design but also to a limitation of our prototype, where each executor thread is assigned datasets from a single database table. If we were able to assign datasets from multiple tables to a single executor thread, then we could have as many executor threads as the number of available cores, and there would be no need for context switches. But in that case the problem would be mitigated within each executor thread and the scheduling decisions it would have to make. That is, if each executor thread serves requests from multiple queue, it would have to make “intelligent” decisions on which queue to serve first. The bottom line is that DORA under-performs in the case of limited hardware parallelism. In some cases, like in the case of 2 hardware cores available, things can get bad, because of preemptions and un-intelligent scheduling decisions. Some of the problems would have been prevented in a more elaborate prototype implementation. In Section 6.5 we summarize DORA’s weaknesses. 132 CHAPTER 6. DATA-ORIENTED TRANSACTION EXECUTION CSs per Transaction 80 Uncategorized 70 Message passing 60 Xct mgr 50 Log mgr 40 Metadata 30 Bpool 20 Page Latches 10 Lock mgr 0 Baseline DORA Figure 6.20: Anatomy of critical section for the TATP UpdateLocation transaction. The radical redesign of DORA eliminates almost all the un-scalable critical sections at the lock manager (and metadata manager) at the very small expense of increased message passing, which is point-to-point communication – belongs to the fixed type of critical sections. 6.4.8 Anatomy of critical sections To conclude the performance evaluation section, Figure 6.20 compares the breakdown of critical sections for our running example, the TATP UpdateLocation transaction, for baseline (Shore-MT without the optimizations presented in Chapter 5) and DORA. We can see that the result of DORA’s drastic change of the transaction execution model is a dramatic change in the anatomy of critical sections. We see that at the expense of a few message passes, which belong to the fixed contention type, DORA eliminates nearly all the critical sections related to the centralized lock manager and metadata. Overall, DORA reduces the number of unscalable critical sections acquired by more than 75%. We also observe that two of the bigger remaining sources of critical sections come from the log manager and the page latches. The critical sections of the log manager are taken care with the composable log buffer inserts mechanism, presented in Section 5.4; while the page latches are taken care with the design presented in the following chapter (Chapter 7). 6.5 Weaknesses Even though there is a fairly wide design space where data-oriented transaction execution outperforms conventional transaction processing, there is also a set of cases where DORA is not suitable. Next we summarize some of the limitations of DORA. 6.5. WEAKNESSES 133 Applications that put less pressure to the storage manager First of all, data-oriented execution is designed for high performance transaction processing that imposes pressure on the internal of the database storage layer. Thus, certain classes of applications may not benefit from it, or even get penalized. For example, for most of our evaluation we use the specialized TATP and TPC-B benchmarks instead of the more popular TPC-C. The reason for that is that the baseline system (Shore-MT) does not encounter any of the issues we try to address in TPC-C and there is less room for improvement. Another example, are business intelligence applications with large file scans or joins. In such workloads DORA may penalize performance since it may require the transfer of large volumes of data between the participating threads (we showed an example of that in Section 6.4.6). It is common practice, however, to employ dedicated database engines (usually column-stores [SAB+ 05, BZN05]) for processing such business intelligence workloads. Limited hardware parallelism As Section 6.4.7 showed, DORA under-performs when the hardware parallelism is limited. The main source of problems come from the data-flow and the need for frequent context switches. This problem is related both to DORA’s design but also to a limitation of our prototype, where each executor thread is assigned datasets from a single database table. If we were able to assign datasets from multiple tables to a single executor thread, then we could have as many executor threads as the number of available cores, and there would be no need for context switches. But in that case the problem would be mitigated within each executor thread and the scheduling decisions it would have to make. That is, if each executor thread serves requests from multiple queue, it would have to make “intelligent” decisions on which queue to serve first. From the graphs presented in Section 6.4.7 we observe that DORA under-performs when the available hardware parallelism is smaller than the number of tables touched in the workload. But the number of tables used in transactional workloads increases with much lower rate than the rate at which multicore parallelism increases (there is no indication that Moore’s Law while slow down in the near future and the roadmap of all the major processor lines predicts increases in the number of cores per chip). For example, the TPC-E benchmark, the most recent transactional benchmark from TPC which introduced in early 2007, uses 33 tables [TPC10]. The TPC-C benchmark, TPC-E’s predecessor and the de facto OLTP benchmark that has been in use since 1992, uses 9 tables [TPC07]. That is a 3.6x increase in the number of tables over a period of 15 years, not even close to the rate at which hardware parallelism increases. Therefore, if not now, in the near future there would hardly be any 134 CHAPTER 6. DATA-ORIENTED TRANSACTION EXECUTION hardware platform which will not have as much hardware parallelism as the number of tables used in a transactional workload. Thus, we predict that this limitation of DORA will not raise concerns in the future. Non partitioning-aligned index accesses DORA partitions each table using range-based partitioning to the keys of a specific subset of the columns of the table. The DBA, however, may have decided to build indexes (usually non-clustered secondary indexes) that do not contain the routing fields – the columns that DORA uses for the partitioning. We analyzed this case in Section 6.3.2 and evaluated how DORA behaves in this case in Section 6.4.5. As Figure 6.16 showed, such non-partitioning aligned secondary indexes can be burdensome for DORA. To tackle this problem we take both proactive and reactive measures. As a proactive measure, we demonstrated a tool that helps the application developer and the DBA to avoid very frequent such index accesses. This tool analyzes the workload and suggests a partitioning scheme (routing fields for each table) that tries to minimize the frequency of secondary actions [PTB+ 11]. As a reactive measure, the resource manager monitors the performance of the system and warns the DBA for sudden increases in the frequency on non partitioning-aligned index accesses and drops in performance. The DBA can react modifying the partitioning to eliminate any problematic secondary actions. Since DORA’s partitioning is only logical, runtime modifications of the partitioning are possible and lightweight, in contrast with shared-nothing systems where repartitioning is expensive since it involves the physical movement of data from one database instance to the other [CJZM10, PJZ11]. Producing transaction flow graphs As Section 6.3.1 described, the DORA runtime does not accept transactions in the form of a sequence of SQL queries. Instead, the transactions need to be analyzed and divided into smaller actions based on the data accessed in different parts of the transaction. These actions are represented as a directed graph (transaction flow graph) to understand the transaction flow and dependencies among the actions. This representation also helps us to exploit intratransaction parallelism for the independent actions. However, it introduces the initial cost of identifying these actions. Analyzing transactions at runtime, even though possible, it would increase the response time of the system which would make the system design less appealing. On the other hand, an OLTP application usually has a limited number of transactions that execute at runtime and are heavily optimized — ad-hoc transactions are not frequent. In [PTB+ 11] we demonstrated a tool for DORA application developers. This tool automatically forms transaction flow graphs given a transaction’s SQL statement. 6.6. RELATED WORK 6.6 135 Related Work DORA improves the scalability of transaction processing system by employing a threadto-data assignment of work policy. The improvement mostly comes from converting the unscalable communication in the centralized lock manager to message-passing and decentralized thread-local lock management. Locking overhead is a known problem even for singlethreaded systems. Harizopoulos et al. [HAMS08] analyze the behavior of the single-threaded SHORE storage manager [CDF+ 94] running two transactions from the TPC-C benchmark. When executing the Payment transaction, the system spends 25% of its time on code related to logical locking, while with the NewOrder transaction it spends 16%. We corroborate the results and reveal the lurking problem of latch contention that makes the lock manager the system bottleneck when increasing the hardware parallelism. Rdb/VMS [Jos91] is a parallel database system design optimized for the inter-node communication bottleneck. In order to reduce the cost of nodes exchanging lock requests over the network, Rdb/VMS keeps a logical lock at the node which last used it until that node returns it to the owning node or a request from another node arrives. Cache Fusion [LSC+ 01], used by Oracle RAC, is designed to allow shared-disk clusters to combine their buffer pools and reduce accesses to the shared disk. Like DORA, Cache Fusion does not physically partition the data but distributes the logical locks. However, neither Rdb/VMS nor Cache Fusion handle the problem of contention. A large number of threads may access the same resource at the same time leading to poor scalability. DORA ensures that the majority of resources are accessed by a single thread. A conventional system could potentially achieve DORA’s functionality if each transactionexecuting thread holds an exclusive lock on a region of records. The exclusive lock is associated with the thread,rather than any transaction, and it is held across multiple transactions. Locks on separator keys [Gra07a] could be used to implement such behavior. Our work on Speculative lock inheritance (SLI) [JPA09] (and Section 5.3.1) detects “hot” locks at run-time and those locks may be held by the transaction-executing threads across transactions. SLI, similar to DORA, reduces the contention on the lock manager. However, it does not reduce the other overheads inside the lock manager. Reducing lock contention with data-oriented execution is also studied for data-streams’ operators [DAAEA09] by making threads delegate the work on some data to the thread that already holds the lock for that data and move to the next operation in their queues. Advancements in virtual machine technology [BDGR97] enable the deployment of sharednothing systems on multicores. In shared-nothing configurations, the database is physi- 136 CHAPTER 6. DATA-ORIENTED TRANSACTION EXECUTION cally distributed and there is replication of both instructions and data. For transactions that span multiple partitions, a distributed consensus protocol needs to be applied. HStore [SMA+ 07] takes the shared-nothing approach to the extreme by deploying a set of single-threaded engines that serially execute requests, avoiding concurrency control. While Jones et al. [JAM10] study a “speculative” locking scheme for H-Store for workloads with few multi-partition transactions. The complexity of coordinating distributed transactions [Hel07, DHJ+ 07] and the imbalances caused by skewed data or requests are significant problems for shared-nothing systems. DORA, by being shared-everything, is less sensitive to such problems and can adapt to load changes more readily. Staged database systems [HSA05] share similarities with DORA. A staged system splits queries into multiple requests which may proceed in parallel. The splitting is operatorcentric and designed for pipeline parallelism. Pipeline parallelism, however, has little to offer to typical OLTP workloads. On the other hand, similar to staged systems, DORA exposes work-sharing opportunities by sending related requests to the same queue. DORA uses intra-transaction parallelism to reduce contention. Intra-transaction parallelism has been a topic of research for more than two decades (e.g. [GMS87, SLSV95]). Colohan et al. [CASM05] use thread-level speculation to execute transactions in parallel. They show the potential of intra-transaction parallelism, achieving up to 75% lower response times than a conventional system. Thread-level speculation, however, is an hardware-based technique not available in today’s hardware. DORA achieves lower response times by exploiting the intra-transaction parallelism, but at the same time, its mechanism requires only fast inter-core communication that is already available in multicore hardware. Finally, optimistic concurrency control schemes [KR81, BG83] may improve concurrency by resolving conflicts lazily at commit time instead of eagerly blocking them at the moment of a potential conflict. When conflicts are rare this allows the system to avoid the overhead of enforcing database locks. On the other hand, if the conflicts occur frequently the performance of the system drops rapidly, since the transaction abort rate is how. There is a great body of work that compares the concurrency control schemes in database systems. Notable is the work by Agrawal et al. [ACL87], while the book of Bernstein et al. [BHG87] and Thomasian’s survey [Tho98] are good starting points for the interested reader. On the other hand, the focus of DORA is on the contention for accessing the locks rather than the concurrency scheme used. For example, in a recent work, [LBD+ 12] presents a new flavor of lightweight multiversioning concurrency control mechanism for main-memory databases. This system applies lessons learned from our study on data-oriented execution and uses a decentralized lock manager. 6.7. CONCLUSION 6.7 137 Conclusion The thread-to-transaction assignment of work of conventional transaction processing systems fails to realize the full potential of the multicores. The resulting contention within the transaction processing system becomes burden on scalability (usually expressed as bottleneck in the lock manager). This chapter shows the potential for thread-to-data assignment to eliminate this bottleneck and improve both performance and scalability. As multicore hardware continues to stress scalability within the storage manager and as DORA matures, the gap with conventional systems will only continue to widen. 138 CHAPTER 6. DATA-ORIENTED TRANSACTION EXECUTION 139 Chapter 7 Page Latch-free and Dynamically Balanced Shared-everything OLTP Developments in transaction processing technology, such as those presented in the previous chapters, remove locking and logging from being scalability bottlenecks on transaction processing systems, leaving page latching as the next potential problem. To tackle the page latching problem, we design a system around physiological partitioning (PLP). PLP employs the dataoriented transaction execution model, maintaining the desired properties of shared-everything designs. In addition, it introduces a multi-rooted B+Tree index structure (MRBTree) that enables partitioning of the accesses at the physical page level. That is, logical partitioning (inherited by data-oriented execution), along with MRBTrees ensure that all accesses to a given index page come from a single thread and, hence, can be entirely latch-free. We extend the design to make heap page accesses thread-private as well. The elimination of page latching allows us to simplify key code paths in the system such as B+Tree operations leading to more efficient yet easier maintainable code. The combination of data-oriented execution and MRBTrees also offers an infrastructure for quickly detecting load imbalances and easily repartitioning to adapt to load changes. We present one such a lightweight dynamic load balancing mechanism (DLB) for PLP systems. Profiling of a prototype PLP system shows that it acquires 85% and 68% fewer contentious critical sections per transaction than an optimized conventional design and one based on logical-only partitioning respectively. As a result the PLP prototype improves performance by up to 50% and 25% over the existing systems on two multicore machines. While the dynamic load balancing mechanism enhances the system with rapid and robust behavior in both detecting and handling load imbalances. 1 7.1 Introduction Due to concerns over power draw and heat dissipation processor vendors have stopped improving processors’ performance by clocking them into higher operational frequency or using 1 This chapter is based on the work presented in VLDB 2011 [PTJA11] and at [TPJA11]. 140 CHAPTER 7. PHYSIOLOGICAL PARTITIONING complicated micro-architectural techniques. Instead, they try to improve the overall chip performance by fitting as many independent processing cores as they can within a single chip’s area. The resulting multicore designs move the pressure on the software’s side for converting Moore’s Law into performance. The software must provide enough execution parallelism to exploit the abundant and rapidly growing hardware parallelism. However, this is not an easy task. Especially when there is lots of resource sharing between the parallel threads of the application, because the accesses to those shared resources need to be coordinated. On-line transaction processing (OLTP) is an important and particularly complex application with excessive resource sharing, which needs to perform efficiently in modern computing environments. It has been shown that conventional shared-everything OLTP systems may face significant scalability problems in highly parallel hardware [JPH+ 09]. There is increasing evidence that one source of scalability problems arises from the transaction-oriented policy of assigning work to threads in conventional systems [PJHA10]. According to this policy, a worker thread is assigned the execution of a transaction; the transaction, along with the physical distribution of records within the data pages, determines what resources (e.g. records and pages) each thread will access. The random nature of transaction processing requests leads to unpredictable data accesses [SWH+ 04, PJHA10] that complicate resource sharing and concurrency control. Such conventional systems are therefore pessimistic and clutter the transaction’s execution path with many lock and latch acquisitions to protect the consistency of the data. These critical sections often lead to contention which limits scalability [JPH+ 09] and in the best case they impose significant overhead to single-thread performance [HAMS08]. In addition, the performance of shared-everything systems is amenable to the application design due to the possibility of page false sharing effects where hot but unrelated records happen to reside on the same page. Careful tuning and expensive DBAs are often needed to detect and resolve such issues, for example by padding the hot records to spread them out to different data pages. Following a different approach, shared-nothing systems deploy a set of independent database instances which collectively serve the workload [Sto86, DGS+ 90]. In shared-nothing designs the contention for shared data resources can be explicitly tuned (the database administrator determines the number of processors assigned to each instance), potentially leading to superior performance. The H-Store [SMA+ 07] and HyPer [KN11] systems take this approach to the extreme instantiating only a single software thread per database instance and eliminating critical sections altogether. However, shared-nothing systems physically partition 7.1. INTRODUCTION 141 the data and deliver poor performance when the workload triggers distributed transactions [Hel07, CJZM10] or when skew causes load imbalance [CJZM10]. Further, repartitioning to rebalance load requires the system to physically move and reorganize all affected data. These weaknesses become especially problematic as partitions become smaller and more numerous in response to multicore hardware trend. 7.1.1 Multi-rooted B+Trees To alleviate the difficulties imposed by page latching and repartitioning, we propose a new physical access method, a type of multi-rooted B+Tree called MRBTree. The root of each sub-tree in this structure corresponds to a logical partition of the data, and the mapping of key ranges to sub-tree roots forms a durable part of the index’s metadata. Partition sizes are non-uniform, making the tree robust against skewed access patterns, and repartitioning is cheap because it involves very few data movement. When deployed in a conventional shared-everything system, the MRBTree has the immediate benefit of eliminating latch contention at the tree root with requesting threads distributed over many partitions (partitions sized so as to equalize traffic), and effectively reducing the height of the tree by one level. Thanks to the tree’s fast repartitioning capabilities, the system can respond quickly to changing access patterns. Further, the MRBTree can also potentially benefit systems which use shared-nothing parallelism in a shared-memory environment (e.g. possibly H-Store [SMA+ 07] and HyPer [KN11]). 7.1.2 Dynamically-balanced physiological partitioning To address problems with conventional execution while avoiding the weaknesses of sharednothing approaches, data-oriented execution, presented in the previous chapter, employs logical-only partitioning. Logical-only partitioning assigns each partition to one thread; the latter manages the data locally without the overheads of centralized locking. However, purely logical partitioning does not prevent conflicts due to false sharing, nor does it address the overhead and complexity of page latching protocols and the contention page latching imposes. Ideally, we would like a system with the best properties of both shared-everything and shared-nothing designs: a centralized data store that sidesteps the challenges of moving data during (re)partitioning, and a partitioning scheme that eliminates contention and the need for page latches. This chapter presents physiological partitioning (PLP), a transaction processing approach that partitions logically the physical data accesses. Briefly PLP employs data-oriented transaction execution on top of MRBTrees. Under PLP, a partition manager assigns threads to 142 CHAPTER 7. PHYSIOLOGICAL PARTITIONING sub-tree roots of MRBTrees and ensures that requests distributed to each thread reference only the corresponding sub-tree. As a result, threads can bypass the partition mapping and their accesses to the sub-tree are entirely latch-free. In addition, PLP can extend the partitioning down into the heap pages where non-clustered records are actually stored, eliminating another class of page latching (similar to shared-nothing systems). The combination of data-oriented execution (where all the data remain in a single database instance) and MRBTrees enabled the implementation of lightweight yet effective dynamic load balancing and repartitioning mechanism (called DLB) on top of PLP. The data-oriented execution obviates the need of distributed transactions when repartitioning takes place. While MRBTrees enable the fast repartitioning within a single database instance. DLB monitors the request queues of the partitions and employs a simple data structure, called aging two-level histogram, to collect information about the current access patterns and load in a workload to dynamically guide the decision on the partition maintenance. 7.1.3 Contributions and organization This chapter introduces the physiological partitioning (PLP) design. The structure and contributions of the remaining of this paper is as follows: • Section 7.2 categorize the communication patterns with in a transaction processing. An- alyzing the communication patterns of a software system clearly highlights the latent scalability bottlenecks. Using this categorization we identify page latching as a lurking performance and scalability bottleneck in modern transaction processing systems, whose effect is proportional to the available hardware parallelism. • Section 7.3 discusses the pros and cons of various deployment strategies and partitioning schemes for efficient transaction processing within a single node: shared-everything vs. physical partitioning (or shared-nothing) vs. logical partitioning (like the one applied by DORA), and concludes that we need a solution that combines the pros of the three. • Section 7.4 shows the need for page latching during accesses to both index and heap pages can be eliminated within a shared-everything OLTP system by deploying a design based on physiological partitioning. Physiological partitioning (PLP) extents the idea of data-oriented transaction execution, logically partitioning the physical accesses as well. • Section 7.5 makes the case for the need of dynamic load balancing capabilities by partitioning- based systems to protect them against sudden load changes and imbalances. It also analyses the repartitioning cost for PLP and shows that this cost is much lower than the cost of repartitioning of physically-partitioning (shared-nothing systems). 7.2. COMMUNICATION PATTERNS IN OLTP 143 CS per Transaction 80 Uncategorized 70 Message passing 60 Xct mgr 50 Base Log mgr 40 Metadata 30 Bpool 20 Page Latches 10 Lock mgr 0 Conventional SLI & Aether Logical PLP-Regular PLP-Leaf Figure 7.1: Breakdown of the critical sections when running the TATP UpdLocation transaction. The PLP variants enter on average almost an order of magnitude fewer unscalable critical sections than state-of-the-art conventional systems. • Section 7.6 presents a lightweight yet effective dynamic load balancing mechanism (DLB) for PLP, which is enabled by PLP’s low repartitioning cost. • Section 7.7 presents a thorough evaluation of a prototype implementation of PLP in- tegrated with DLB. PLP acquires 85% and 68% fewer contentious critical sections per transaction than an optimized conventional design and a vanilla data-oriented system that applies logical-only partitioning, respectively. PLP improves scalability and yields up to almost 50% higher performance on multicore machines. In the meantime, the overhead of DLB is minimal in regular processing (in the worst case at most 8%), and it achieves low response times in both detecting and balancing imbalances. • Finally, Section 7.8 presents related work, and Section 7.9 concludes by promoting PLP as a very promising OLTP system design in the light of the upcoming hardware trends. 7.2 Communication patterns in OLTP In Section 4.2 we made the important observation that not all forms of communication in a software system pose the same threat to its scalability. We concluded that the only way that leads to scalability lies in converting all unscalable communication to either the fixed or composable type, thus removing the potential for bottlenecks to arise. The two left-most bars of Figure 7.1 compare the number and types of critical sections executed by the conventional systems presented in Chapter 5: baseline Shore-MT ([JPH+ 09] and Section 5.2, labeled “Conventional”) and a Shore-MT variant with speculative lock CHAPTER 7. PHYSIOLOGICAL PARTITIONING 144 Breakdown of Page Latches 100% 80% 60% INDEX HEAP 40% CATALOG / SPACE 20% 0% TATP TPC-B TPC-C Figure 7.2: Page latch breakdown for three popular OLTP benchmarks. The majority of page latches reside in index structures. inheritance ([JPA09] and Section 5.3.1) as well as consolidated log buffer inserts ([JPS+ 10] and Section 5.4). The second bar, labeled “SLI & Aether”, is essentially a highly optimized conventional system. The third left-most bar is from a data-oriented transaction execution prototype ([PJHA10] and Chapter 6) built on top of Shore-MT. We labeled this bar as Logical because by definition data-oriented execution applies logical-only partitioning of the accesses. Each bar shows the number of critical sections entered during the execution of the TATP UpdLocation transaction, categorized by the storage manager service that triggers them. Locking and latching form a significant fraction of the total communication in the baseline system. SLI achieves a performance boost by sidestepping the most problematic critical sections associated with the lock manager, but fails to address the remaining (still-unscalable) communication in that category 2 . The data-oriented system with its logical partitioning, in contrast, eliminates nearly all types of locking replacing both contention and overhead of centralized communication with efficient, fixed communication via message passing queues. Once locking is removed, latching remains by far the largest source of critical sections. There is no predefined limit to the number of threads which might attempt to access a given page simultaneously, so page latching represents an unscalable form of communication which should be either eliminated or converted to a scalable type. The remaining categories represent either fixed communication (e.g. transaction management), composable operations (e.g. logging), or a minor fraction of the total unscalable component. 2 With SLI we can cache information and transfer across transactions which are not related with locking. One piece of such information is the metadata. That’s why the metadata component in SLI is much lower the the Baseline 7.3. SHARED-EVERYTHING VS. PHYSICAL VS. LOGICAL PARTITIONING 145 Figure 7.3: Comparison of logical and physiological partitioning schemes (bottom) with shared-everything and shared-nothing designs (top). Examining page latching more closely, Figure 7.2 decomposes the page latches acquired by both the conventional systems and the logical-only partitioning prototype during the execution of three popular OLTP benchmarks (TATP, TPC-B and TPC-C). We categorize the database pages into different types: metadata, index pages, and heap pages. The majority of the page latches, between 60% and 80%, are due to index structures. Heap page latches are another non-negligible component, accounting for nearly all remaining page latches. 7.3 Shared-everything vs. physical vs. logical partitioning With the preceding characterization of communication patterns in mind, we now return to the question of which configuration for transaction processing is more appropriate for deployment within a single node. Does the traditional shared-everything approach remain the optimal, or system designer should explore other approaches such as physical partitioning (or shared-nothing) and logical partitioning. Shared-everything. The previous chapter (Chapter 6) outlines the scalability problems faced by a conventional shared-everything transaction processing system: assigning trans- 146 CHAPTER 7. PHYSIOLOGICAL PARTITIONING actions to threads means that any worker thread might access any data item at any time, requiring careful concurrency control at both the logical and physical levels of the system. This is illustrated by the “shared-everything” case of Figure 7.3 (top left). Physical partitioning (shared-nothing). Another possibility is to partition the data physically to reduce contention. This is the so-called “shared-nothing” approach. As illustrated in Figure 7.3 (top right), shared-nothing approaches physically separate the partitions to set of independent database instances which collectively serve the workload [Sto86, DGS+ 90]. Shared-nothing deployments are an appealing design even within a single node, because the designer has explicit control on the number of threads and processing cores that participate in each instance, and, thus, control or even eliminate the contention on each component of the system. For example, the H-Store system assigns a single worker thread to each partition [SMA+ 07]. This arrangement naturally produces a thread-to-data assignment of work and achieves the desirable elimination of any contention-prone critical sections. However, such designs give up too much by eliminating all communication within the engine. That is, the physical separation produces at least three undesirable side effects: • Even the composable and fixed types of critical sections, which do not threaten scalability become problematic. For example, database logging is not amenable to distribution [JPS+ 10, JPS+ 11], and physically-partitioned systems either use a shared log [LARS92] or eliminate it completely [SMA+ 07]. • Perhaps the biggest challenge of physical partitioning is that transactions which access data in more than one partition must coordinate using some distributed protocol, such as two phase commit [GR92]. The scalable execution of distributed transactions has been an active field of research for the past three decades, with researchers, from both academia and industry, persuasively arguing that they are fundamentally not scalable [Bre00, Hel07]. • Furthermore, the performance of shared-nothing systems is very sensitive to imbalances in load arising from skew in either data or requests where some partitions see high load and others see very little [CJZM10]; while non-partition-aligned operations (such as nonclustered secondary indexes) may pose significant barriers to physical partitioning. For example, in the figure, the rightmost partition of the shared-nothing example is accessed by three transactions while its neighbor is accessed by only one. Unfortunately, frequent repartitioning is prohibitively expensive under the shared-nothing discipline because the data involved must be physically migrated to a different location. 7.4. PHYSIOLOGICAL PARTITIONING 147 Logical partitioning. To achieve the benefits of physical partitioning without the costs that usually accompany it, in the previous chapter we observed that partitioning is effective because it forces single-threaded access to each item in the database. However, physical partitioning is not necessary: any scheme which arranges for a thread-to-data policy, or applies “logical partitioning”, should achieve the same regularity and reduced reliance on centralized mechanisms. As illustrated in Figure 7.3 (bottom left), a data-oriented transaction execution system logically partitions the data among worker threads, breaking transactions into smaller actions which access only one logical partition (similar to how shared-nothing systems need to distribute data accesses among physical partitions). As its name suggests, logical partitioning eliminates most unscalable communication at the logical level, namely database locking. However, it has little impact on the remaining communication, which arises in the physical layers of the system and cannot be managed cleanly from the application level. As a result, threads must acquire page latches and potentially perform other unscalable communication, even though there is no communication between requests at the the application level. 7.4 Physiological partitioning We have seen how both logically- and physically-partitioned designs offer desirable properties, but also suffer from weaknesses which threaten their scalability. In this chapter we therefore propose a hybrid of the two approaches, physiological partitioning (or PLP), which combines the best properties of both: like a physically-partitioned system the majority of physical data accesses occur in a single-threaded environment which renders page latching unnecessary; like the logically-partitioned system, locking is distributed without resorting to distributed transactions and load balancing is inexpensive because almost no data movement is necessary. 7.4.1 Design overview Transactions in a typical OLTP workload access a very small subset of records. The system relies on index structures because sequential scans are prohibitively expensive by comparison. PLP therefore centers around the indexing structures of the database. The left-most graph of Figure 7.4 gives a high-level overview of a physiologically-partitioned system. We adapt the traditional B+Tree [BM70] for PLP by splitting it into multiple sub-trees which each covers a contiguous subset of the key space. A partitioning table becomes the new root and maintains the partitioning as well as pointers to the corresponding sub-trees. We call the resulting structure a multi-rooted B+Tree (MRBTree). The MRBTree partitions the data but unlike a horizontally-partitioned workload (e.g. top right of Figure 7.3), all sub-trees 148 CHAPTER 7. PHYSIOLOGICAL PARTITIONING Figure 7.4: Different variations of physiological partitioning (PLP). PLP-Regular logically partitions the index page accesses, while PLP-Partition and PLP-Leaf logically partition the heap page accesses as well. belong to the same database file and can exchange pages easily; the partitioning, though durable, is dynamic and malleable rather than static. With the MRBTree in place, the system assigns a single worker thread to each sub-tree, guaranteeing it exclusive access for latch-free execution. A partition manager layer controls all partition tables and makes assignments to worker threads. The worker threads in PLP do not reference partition tables during normal processing, which might otherwise become a bottleneck, instead the partition manager ensures that all work given to a worker thread involves only data that this thread owns. The transactions are broken down into a directed graph of potentially parallel partition accesses which are passed to worker threads to assemble a complete transaction. Since in PLP each table is assigned to have different set of worker threads, whenever a transaction touches more than one table it becomes a multi-site transaction, according to the terminology of [JAM10]. However, multi-site transactions are not expensive as in a shared-nothing system because PLP has a shared-everything setting. All indexes in the system –primary, secondary, clustered, non-clustered– can be implemented as MRBTrees; data are stored directly in clustered indexes, or in tightly integrated heap file pages referenced by record ID (RID). Secondary (non-clustered) indexes that can align to the partitioning scheme (i.e. contain the fields that are used for the partitioning decision) are managed by the worker thread of the corresponding partition. On the other hand, secondary indexes that cannot align to the partitioning scheme are accessed as in the conventional system, but each leaf entry contains the associated fields used for the partitioning so that the result of each probe can be passed to its partition’s owning thread for further processing, as we discussed in Section 6.3.2. 7.4. PHYSIOLOGICAL PARTITIONING 7.4.2 149 Multi-rooted B+Tree The “root” of an MRBTree is a partition table that identifies the disjoint subsets of the key range assigned to each sub-tree as well as a pointer to the root of each tree. Because the routing information is cached in memory as a ranges map by the partition manager, the on-disk layout favors simplicity rather than optimal access performance. We, therefore, employ a standard slotted page format to store key-root pairs. If the partitioning information cannot fit on a single page (for example, if the number of partitions is large or the keys are very long) the routing page is extended as a linked list of routing pages. In our experiments we have never encountered the need to extend the routing page, however, as several dozen mappings fit easily in 8KB, even assuming rather large keys. Record insertion (deletion) takes place as in regular B+Trees. When the key to insert (delete) is given, the ranges map routes it to the sub-tree that corresponds to the key range the key belongs to and the insert (delete) operation is performed as in a regular B+Tree in that sub-tree. The other sub-trees, ranges map, and the routing page do not get affected by the insert (delete) operation at all. When deployed in a conventional shared-everything system, the MRBTree eliminates latch contention at the index root; fewer threads attempt to grab the latch for the same index root at a time. Partitioning also reduces the expected tree level by at least one, which reduces the index probe time. 7.4.3 Heap page accesses In PLP a heap file scan is distributed to the partition-owning threads and performed in parallel. Large heap file scans reduce concurrency of OLTP applications and PLP has little to offer. Still, heap page management opens up an additional design option, since we can extend the partitioning of the accesses at the heap pages. That is, when records reside in a heap file rather than in the MRBTree leaf pages, PLP can ensure that accesses to pages are partitioned in the same way as index pages. There are three options on how to place and access records in the heap pages, leading to three variations of PLP depicted in Figure 7.4: • Keep the existing heap page design, called PLP-Regular. • Each heap page keeps records of only one logical partition, called PLP-Partition. • Each heap page is pointed by only one leaf page of the primary MRBTree, called PLP- Leaf. PLP-Regular simply keeps the existing heap page operations. Without any modification, the heap pages still need to be latched because they can be accessed by different threads in 150 CHAPTER 7. PHYSIOLOGICAL PARTITIONING parallel. But, the heap page accesses are not the biggest fraction of the total page accesses in OLTP. According to Figure 7.2, where we categorized the types of pages that are latched in a conventional OLTP system, the heap pages can be as low as only 30% of the pages that are latched. Thus, there is room for significant improvement even if we ignore them. However, allowing heap pages to span partitions prevents the system from responding automatically to false sharing or other sources of heap page contention. In PLP-Partition and PLP-Leaf the MRBTree and heap operations are modified so that heap page accesses are partitioned as well. The difference between the two is that in the former a heap page can be pointed by many leaf pages as long as they belong to the same partition, while in the latter a heap page is pointed by only one leaf page. Even though those two variations provide latch-free heap page accesses, they also have some disadvantages. Forcing a heap page to contain records that belong to a specific partition results in some fragmentation. In the worst case, each leaf has room for one more entry than fits in the heap page, resulting in nearly double the space requirement (Section 7.7.8 measures this cost). Further, in PLP-Leaf every leaf split must also split one or more heap pages, increasing the overhead of record insertion (deletions are simple because a leaf may point to many heap pages). On the other hand, in PLP-Partition allowing multiple leaf pages from a partition to share a heap page forces the system to reorganize potentially significant numbers of heap pages with every repartitioning, which goes against the philosophy of physiological partitioning. We therefore opt for the PLP-Partition option and favor the PLP-Leaf. The two extensions impose one additional piece of complexity for record inserts. When a typical system inserts a record to a table it follows a straightforward procedure. It first inserts the record in a free slot of a random heap page and then updates any index to that table, by inserting a corresponding entry for the new record along with its record ID, that identifies the heap page and slot. In contrast, when PLP-Partition and PLP-Leaf insert a new record they must first identify the leaf page in the MRBTree that will eventually point to the record, and then find an free slot in an existing heap page (or allocate a new one) according to the partitioning strategy. Because the storage management layer is completely unaware of the partitioning strategy (by design), it must make callbacks into the upper layers of the system to identify an appropriate heap page for each insertion. Similarly, a partition split in PLP-Partition and PLP-Leaf may split heap pages as well, invalidating the record IDs of migrated records. The storage manager, therefore, exposes another callback so the metadata management layer can update indexes and other structures 7.4. PHYSIOLOGICAL PARTITIONING 151 that reference the stale RIDs. We note that when PLP-Leaf splits leaf pages during record insertion, the same kinds of record relocations arise and use the same callbacks. 7.4.4 Page cleaning Page cleaning cannot be performed naively in PLP. Conventionally there is a set of page cleaning threads in the system that are triggered when the system needs to clean dirty pages (for example, when it needs to truncate log entries). Those threads may access arbitrary pages in the buffer pool, which breaks the invariant of PLP where a single thread can access a page at each point of time. To handle the problem of page cleaning in PLP each worker thread does the page cleaning for its logical partition. Each logical partition has an additional input queue which is for system requests, and the page cleaning requests go to that queue. The system queue has higher priority than the queue of completed actions. Their execution won’t be delayed by more than the execution time of one action (typically very short). In addition, because page cleaning is a read-only operation, the worker thread can continue to work (and even re-dirty pages) during the write-back I/O. 7.4.5 Benefits of physiological partitioning Under physiological partitioning, each partition is permanently locked for exclusive physical access by a single thread, which then handles all the requests for that partition. This allows the system to avoid several sources of overhead, as described in the following paragraphs. Latching contention and overhead. Though page latching is inexpensive compared with acquiring a database lock, the sheer number of page latches acquired imposes some overhead and can serialize B+Tree operations as transactions crab down the tree during a probe. The problem becomes more acute when the lower levels of the tree do not fit in memory, because a thread which fetches a tree node from disk holds a latch on the node’s parent until the I/O completes, preventing access to 80-100 other siblings which may well be memory-resident. Section 7.7.3 evaluates a case where latching becomes expensive for B+Tree operations and how PLP can eliminate this problem by allowing latch-free accesses on index pages. False sharing of heap pages. One significant source of latch contention arises when multiple threads access unrelated records which happen to reside on the same physical database page. In a conventional system false sharing requires padding to force problematic database records to different pages. A PLP design that allows latch-free heap page accesses achieves 152 CHAPTER 7. PHYSIOLOGICAL PARTITIONING the same effect automatically (without the need of expensive tuning) as it splits hot pages across multiple partitions. Section 7.7.3 evaluates this case as well. Serialization of structural modification operations (SMOs). The traditional ARIES/KVL indexes [Moh90] allow only one structural modification operation (SMO), such as a leaf split, to occur at a time, serializing all other accesses until the SMO completes. Partitioning the tree physically with MRBTrees eases the problem by distributing SMOs across sub-trees (whose roots are fixed) without having to apply more complicated protocols, as such those described in [ML92, JSSS06]. The benefits of parallel SMOs are apparent in the case of insert-heavy workloads, which we evaluate in Section 7.7.5. Repartitioning and load monitoring costs. In PLP, repartitioning can occur at a higher level in the partition manager and therefore can be latch-free as well; the partition manager simply quiesces affected threads until the process completes. Moreover, MRBTrees require very few pointer updates and data movement in order to map to an updated partitioning assignment. In addition, the partition manager can easily determine if there are any load imbalances in the system, by simply inspecting the incoming request queues of each partition. The combination of those characteristics enables the implementation of a robust and lightweight dynamic load balancing mechanism, as we further discuss in Section 7.5 and Section 7.6 and evaluate in Section 7.7.9. Code complexity. Finally, with all latching eliminated, the code paths that handle contention and failure cases can be eliminated as well, simplifying the code significantly. To the extend that the index can be substituted with a much simpler implementation. For example, a huge source of complexity in traditional B+Trees arises due to the sophisticated protocols that maintain consistency during an SMO in spite of concurrent probes from other threads. The simpler code not only is more efficient but also it is easier to maintain. While building the prototype we used at the evaluation section, we did not attempt the code refactoring required to exploit these opportunities, and the performance results we report are therefore conservative. We note that index probes are the most expensive remaining component of PLP. Therefore, we expect significant performance improvements if we substitute the B+Tree implementation of our prototype with, for example, a cache-conscious [RR99, RR00] or prefetchingbased B+Tree [CGMV02]. 7.5. NEED AND COST OF DYNAMIC REPARTITIONING 7.5 153 Need and cost of dynamic repartitioning Although partitioning is an increasingly popular solution for scaling up the performance of database management systems, it is not the panacea since there are many challenges associated with it. One of these challenges is the system behavior under skewed and dynamically changing workloads, which is the rule rather than an exception in real settings – consider, for example, the Shashdot effect [Adl05]. This section shows how even mild access skew can severely hurt performance in a statically partitioned database, rendering partitioning useless in many realistic workloads (Section 7.5.1). Then exhibits that PLP provides an adequate infrastructure for dynamic repartitioning, mainly because it is based on data-oriented execution and because of its usage of MRBTrees (Section 7.5.2). The low repartitioning cost facilitates the implementation of a robust yet lightweight dynamic load balancing mechanism for PLP, which is presented in the following section (Section 7.6). 7.5.1 Static partitioning and skew This dissertation argues that in order to scale up the performance of transaction processing we need to reduce the level of unscalable communication within the system by employing a form of partitioning, like data-oriented execution or physiological partitioning. In general one of the disadvantages of partitioning-based transaction processing designs is that those systems are vulnerable to skewed and dynamically changing workloads; in contrast with shared-everything systems that do not employ any form of partitioning and tend to suffer less. Unfortunately, skewed and dynamically changing workloads are the rule rather than an exception in transaction processing. Thus, it is imperative for partitioning-based designs to alleviate the problem of skewed and dynamically changing accesses. To exhibit how vulnerable partitioning-based systems are to skew, Figure 7.5 plots the throughput of a non-partitioned (shared-everything) system and a statically partitioned system when all the clients in a TATP database submit the GetSubscriberData read-only transaction [NWMR09]. Initially the distribution of requests is uniform to the entire database. But at time point 10 (sec) the distribution of the load changes with 50% of the requests being sent to 30% of the database (see Section 7.7.1 for experimental setup details). As we can see from the graph, initially and as long as the distribution of requests is uniform the performance of the non-partitioned system is around 15% lower than the partitioned one. After the load change the performance of the non-partitioned system remains pretty much the same (at around 325Ktps), while the performance of the partitioned system drops sharply by around 35% (from 375Ktps to around 360Ktps). The drop in the performance CHAPTER 7. PHYSIOLOGICAL PARTITIONING 154 400 Throughput (Ktps) 350 300 250 200 150 Not partitioned 100 50 Statically partitioned 0 0 5 10 15 20 Time (sec) Figure 7.5: Throughput of a statically partitioned system when load changes at runtime. Initially the requests are distributed uniformly, at time t=10 50% of the requests are sent to 30% of the database. is severe even though the skew is not that extreme; easily a higher fraction of the requests could go to a smaller portion of the database, for example following the 80-20 rule of thumb where the 80% of the accesses go to only 20% of the database. There are two ways to attack the problem of skewed access in partitioning-based transaction processing systems: proactively by configuring the system with an appropriate initial partitioning scheme; and reactively by using a dynamic balancing mechanism. Starting with the appropriate partitioning configuration is key. If the workload characteristics are known a priori, previously proposed techniques [RZML02, CJZM10] can be used to create effective initial configurations. If the workload characteristics are not known, then simpler approaches like round-robin, hash-based, and range-based partitioning can be used [DGS+ 90]. As time progresses, however, skewed access patterns gradually lead to load imbalance and lower performance, as the initial partitioning configuration eventually becomes useless no matter how carefully it was chosen. Thus, it is far more important and challenging to dynamically balance the load through repartitioning based on the observed, and ever changing, access patterns. A robust dynamic load balancing mechanism should eliminate any bad choices made during initial assignment. 7.5.2 Repartitioning cost As the previous subsection argued, a dynamic load balancing mechanism would be useless if the cost of repartitioning in a partitioning-based transaction processing system is high. The lower the cost of repartitioning, the more frequently the system can trigger load balancing procedures and the faster it will react to load changes. This subsection models the cost 7.5. NEED AND COST OF DYNAMIC REPARTITIONING 155 of repartitioning for a physically-partitioned (shared-nothing) system and the three PLP variations to highlight the clear advantage of PLP-Regular and PLP-Leaf. It also describes the way to perform repartitioning for the three PLP designs. The basic case of repartitioning that we need to calculate its cost, is when a partition needs to split into two. Thus, for all the PLP variations and the physically-partitioned case our repartitioning cost model calculates the number of records and index entries that have to be moved, the number of update/insert/delete operations on the indexes, the number of pointer updates on the index pages and the routing page, and the remaining number of read operations that have to be performed when a partition is split into two. We also discuss merging two partitions but do not give as detailed cost model. Let’s assume that there is a heap file (table) with an index on it, which in the case of PLP it is an MRBTree. When a partition needs to be split into two, that means that a sub-tree in the index needs to be split to two as well. In that case we define as: h the height of the tree; n the number of entries in an internal B+Tree node; mi the number of entries to be moved from the B+Tree at level i; and, M the number of records in the heap file that have to be moved. The number of read operations during a key value search in the B+Tree is omitted since it is the same for all the systems (a binary search at each level from root to leaf). 7.5.3 Splitting non-clustered indexes The first case we consider, is when the heap file that needs to be re-partitioned has a unique non-clustered primary and a secondary index and the data are partitioned based on the primary index key values. PLP-Regular. The cost of repartitioning in PLP-Regular is very low. Only a few index entries need to move from one sub-tree of the MRBTree index(es) to another newly created sub-tree. Algorithm 1 shows the procedure that needs to be executed to split an MRBTree sub-tree. First, we need to find the leaf page that the starting key of the new partition should reside (Lines 4–8 in Algorithm 1). Let’s assume that there are m1 entries that are greater than or equal to the starting key on the leaf page where the slot for this key is found. All needs to be done is to move these m1 entries on that leaf page to a newly created (MRBTree) index node page and this procedure has repeat as the tree is traversed from this leaf page to the root (Lines 9–13 in Algorithm 1). It is not necessary to move any entry from the pages that keep the key values greater than the ones in the leaf page containing the starting key. Setting the previous/next pointers of the pages at the boundaries of the old and new CHAPTER 7. PHYSIOLOGICAL PARTITIONING 156 Algorithm 1 Splitting an MRBTree sub-tree. 1: {binary-search routine used below performs binary search to find the key on the page. If an exact match for the key is found, f ound is returned as true and the function returns the slot for the key on the page. Otherwise, f ound is false and the function returns which slot on the page the key should reside.} 2: page = root 3: f ound = f alse 4: while page! = N U LL & !f ound do 5: slot = binary-search(page, key, f ound) 6: slots.push(slot) 7: pages.push(page) 8: page = page[slot].child 9: while nodes.size > 0 do 10: slot = slots.pop() 11: page = pages.pop() 12: Create pagenew 13: Move starting from slot at page to pagenew Table 7.1: Repartitioning costs for splitting a partition into two System PLP-Regular PLP-Leaf PLP-Partition Physically-partitioned #Records Moved (M) #Entries Moved h mk k=1 h m1 k=1 mk h−2 h−l−1 h m1 + (n × (mh−l − 1)) k=1 mk m1 + l=0 h−2 (nh−l−1 × (mh−l − 1)) l=0 PLP (Clustered) Shared Nothing (Clustered) m1 + h−2 l=0 m1 (nh−l−1 × (mh−l − 1)) h k=2 - Primary Index Secondary Index #Reads #Pages Read #Pointer Updates Changes Changes 2×h+1 M 1 2×h+1 M updates M updates M M mk 1+ 1+ M −m1 n M −m1 n 2×h+1 - - - 2×h+1 - - - M updates M updates M inserts M deletes - M inserts M deletes M updates M inserts M deletes M inserts M deletes partitions is sufficient. Finally, a new entry to the routing page should be added for the new partition. The overall cost is given in the first row of Table 7.1. The cost model in Table 7.1 describes the worst case scenario for PLP-Regular. If the starting key of the new partition is in one of the internal index node pages, there is no need to move any entries from the pages that are below this page because the moved entries from the internal node page already have pointers to their corresponding child pages; resulting in fewer reads, updates, and moved entries. PLP-Leaf. The partition splitting cost related with the MRBTree index structure is the same as in PLP-Regular. But, as mentioned in Section 7.4.3, in addition to modifying the index structure, when repartitioning in PLP-Leaf, we also have to move records from the 7.5. NEED AND COST OF DYNAMIC REPARTITIONING (a) (b) 157 (c) Figure 7.6: Example of splitting a partition in PLP-Leaf, which is a three-step process. In the worst case, m1 heap pages need to be touched. Algorithm 2 Splitting heap pages in PLP-Leaf and PLP-Partition. 1: leaf = leftmost leaf node 2: Create pagenew 3: while leaf ! = N U LL {Omit for PLP-Leaf } do 4: for all t pointed by leafcurrent do 5: if pagenew does not have space then 6: Create pagenew 7: Move t to pagenew 8: Update pointers at all the secondary indexes 9: leaf = leaf.next {Omit for PLP-Leaf } heap file to new heap pages. Figure 7.6 shows the three-step process for splitting a partition to two in PLP-Leaf. The height of the sub-tree is 2 and the dark slot in Figure 7.6 (a) indicates the slot which contains the leaf entry with the starting key of the new partition. Figure 7.6 (b) shows that a new sub-tree is created as a result of the split. Those two steps are the same with the repartitioning process in PLP-Regular. In PLP-Leaf, however, we also have to move the records at the heap file that belong to the new partition to a new set of heap data pages. Algorithm 2 shows the pseudo code for updating the heap pages upon a partition split in PLP-Leaf (and PLP-Partition). The dark records on the heap pages in Figure 7.6 (b) indicate those records that belong to the new partition (sub-tree) and need to move. Those records are pointed by the m1 leaf page entries that moved to the newly created sub-tree. Thus, in the worst case m1 records have to move (Lines 4-7 in Algorithm 2). Since the index is non-clustered, we have to scan these m1 entries in order to get the RIDs of the records to be moved and spot their heap pages. The result of the split after the records are moved is shown in Figure 7.6 (c). Whenever a record 158 CHAPTER 7. PHYSIOLOGICAL PARTITIONING Figure 7.7: Example of splitting a partition in PLP-Partition. To identify which records from the heap pages need to move to the new partition, the system needs to scan all the heap pages of the old partition, increasing significantly the cost of repartitioning. moves its RID changes. Thus, once all the records are moved, all the indexes (primary and secondary) need to update their entries (Line 8 in Algorithm 2). The cost for repartitioning in PLP-Leaf is given in the second row of Table 7.1. This cost, again, illustrates the worst case scenario. If the starting key of the new partition is found in one of the internal nodes, then no record movement has to be done since there will be no leaf page splits and the constraint of having all heap pages pointed by only one leaf page is already preserved. Moreover, even if the key is found on the leaf page, we might not have to move all the records that are specified by the model above. If all the records on a heap page are pointed only by leaf entries of the new partition, then these records can stay on that heap page. PLP-Partition. In PLP-Partition, the process for splitting the index structure is the same as in PLP-Regular and PLP-Leaf. Therefore, it is omitted from Figure 7.7, which shows the rest of the process for splitting a partition into two in PLP-Partition. In the worst case, in PLP-Partition we may have to move records from all the heap pages that belong to the old partition. Those records are indicated with the dark rectangles in the heap pages of Figure 7.7 (a). The number of records be to moved is equal to the number of entries that are on the leaf pages of the new sub-tree. As in PLP-Leaf, the RIDs of the records are retrieved with an index scan of the newly created sub-tree, the records are moved to new heap pages and they get new RIDs, and all the indexes are updated with the new RIDs after the record movement is completed (shown in Lines 3-9 in Algorithm 2). The result of the partitioning is shown in Figure 7.7 (b); while the cost model for PLP-Partition is given in the third row of Table 7.1. Physically-partitioned (shared-nothing). In a physically-partitioned or shared-nothing system, the cost for the record movement is equal to the worst case of PLP-Partition. Be- 7.5. NEED AND COST OF DYNAMIC REPARTITIONING 159 cause, the entire old partition needs to be scanned for records that belong to the new partition. But, in addition to that, the cost of index maintenance may be prohibitively expensive. That is, in a physically-partitioned system each record move across partitions results to a deletion of an index entry (or entries if there are multiple indexes) from the old partition and an insertion of an index entry to the new partition. In contrast with the PLP variant where every record move is a result of a few MRBTree entries updates. The cost of index maintenance when repartitioning physically-partitioned systems sometimes can be prohibitive. In order to avoid the index maintenance, a common technique is to drop and bulkload the index from scratch upon every repartition. For physically-partitioned systems which employ replication, like H-Store [SMA+ 07], this procedure has to be repeated for all the partition replicas. The repartitioning cost for one replica in a physically-partitioned system is given as in the fourth row of Table 7.1. Given how expensive repartitioning can be, physically-partitioned systems are reluctant in frequently triggering repartitioning. 7.5.4 Splitting clustered indexes Let’s consider the case where we have a unique clustered primary index and a secondary index, and the data partitioning is done using the primary index key columns. In this setup, no heap file exists, since the primary index contains the actual data records rather than RIDs, and the three PLP variations are equivalent, because their differences lie on how they treat the records in the heap pages. When the actual records are part of the clustered primary index, the cost of record movement for PLP equals with the number of leaf page entries that need to move. While the cost of the primary index maintenance equals with the entry movements in the internal nodes of the MRBTree index. The cost model is given in the fifth row of Table 7.1. On the other hand, the repartitioning cost for the physically-partitioned system is similar to the non-clustered case. Because there is not a common index structure and data need to move from the index of the one partition to the other. The only difference is that there is no need to scan the leaf page entries to get the RIDs of the records to be moved since the leaf pages have the actual records. Therefore, the repartitioning cost model for a replica in the last row of Table 7.1. 7.5.5 Moving fewer records With some additional information we can actually move fewer data during repartitioning with the cost of increased number of reads. For example, in PLP-Partition instead of directly moving all the records that belong to a new partition, we can scan all the index leaf pages CHAPTER 7. PHYSIOLOGICAL PARTITIONING 160 Table 7.2: Repartitioning costs when splitting a partition with 466 MB data in half (U: Updates, D: Deletes, I: Inserts). System PLP-Regular PLP-Leaf PLP-Partition Physically-partitioned PLP (Clustered) Physically-partitioned (Clustered) Records Moved 8.3KB 233MB 233MB 8.3KB 233MB Primary Index Secondary Entries #Pages #Pointer Changes Index Moved Read Updates Changes 8KB 7 8KB 1 7 85 U 85 U 8KB 14365 7 2.44M U 2.44M U 14365 2.44M I + 2.44M D 2.44M I + 2.44M D 5.3KB 7 85 U - - - 2.44M I + 2.44M D 2.44M I + 2.44M D that are going to be split and collect information for all the records. With this information, we can determine whether a heap page has more records that belong to the old partition or the new partition and act accordingly. That is, if a heap page has more records that belong to the new partition, we can move out of the page the records that belong to the old partition. The number of reads when scanning the leaf pages can easily become a bottleneck in disk-resident databases, due to the number of I/O operations that have to be performed. On the other hand, in in-memory databases or systems that use flash storage devices, the I/O bottleneck can be prevented [Che09] and the above mentioned technique can reduce the amount of data movement during repartitioning. This technique, unfortunately, cannot be used in a physically-partitioned system because the pages of the two partitions do not share the same storage space. 7.5.6 Example of repartitioning cost Table 7.2 gives an example of the repartitioning cost for the different systems under consideration based on the cost model given in Table 7.1. In this example, a partition, which contains 433MB of 100-byte data records in a heap file is split in half. We assume that there is a primary index of height 3 with 170 32-byte entries on each page. The first four rows of the table assume there is a unique non-clustered primary index and a secondary index in the system, whereas for the last two rows there is a unique clustered primary index and a secondary index. The cost for the physically-partitioned system is just for one replica (if we assume that it uses replication for durability). For the PLP variations the number of moved records represents the worst case scenario. As Table 7.2 shows, the PLP variations, except for PLP-Partition, move very few records compared to the physically-partitioned one. In the worst case, PLP-Partition moves the same number of records as the physically-partitioned system. For the clustered index case, 7.5. NEED AND COST OF DYNAMIC REPARTITIONING 161 PLP is cheaper to repartition than the physically-partitioned system, both in terms of record movement and index maintenance. When we calculate the corresponding costs for a larger heap file with a index of height 4, the repartitioning cost for the physically-partitioned system (and PLP-Partition) becomes prohibitive. 7.5.7 Cost of merging two partitions Another cost related to repartitioning is the cost of merging two partitions to one. For any PLP variation, the merge operation only requires index reorganization and no data movement, again in contrast with the physically-partitioned design. During the index reorganization in PLP, there are three cases to consider depending on the height of the two sub-trees to be merged, which we present in the next paragraphs. After the merging of the two sub-trees is completed, the partitioning table of the MRBTree is updated accordingly. One entry of the two that correspond to the two partitions in the partitioning map is removed while the other is updated with the new key range and possibly a new root page id. Merging two sub-trees that have the same height. When the two sub-trees to be merged have the same height, the entries of Th ’s root are appended at the end of the entries of Tl ’s root. Since the entries of the root page have information about the pointers to the internal nodes, copying the entries of the root page is sufficient for this merge operation. In this case the cost of the merge operation only depends on the number of entries in the root page of Th . If the number of entries destined to the new root exceeds the page capacity, a new root page is created the same way a page split happens after a record insert (through a structure modification operation – SMO). Merging a sub-tree with lower key values (Tl ) which is taller than the other subtree. When Tl is taller than Th , Tl is traversed down to one level higher than the height of Th . Then an entry is inserted at the right-most node of this level that points to Th and has the key value equal to the starting key of the key range of Th . Therefore, the cost of the merge operation is only a tree traversal, which depends on the height difference between the two trees and an insert operation. Merging a sub-tree with higher key values (Th ) which is taller than the other sub-tree. When Th is taller, the merge operation is very similar to the second case and the cost is the same. Th is traversed down to one level higher than the height of Tl and instead of the right-most node, the left-most node gets the entry that points to Tl and has the key value equal to the starting key of the key range of Tl . CHAPTER 7. PHYSIOLOGICAL PARTITIONING 162 Overall the cost of merging two partitions in PLP is quite low. On the other hand, a physically-partitioned system has to copy all the records from one partition to the other and insert the corresponding index entries at the resulting partition. Therefore, in a physicallypartitioned system the cost of the merge operation is proportional to the number of records in a partition and its way higher than the merge cost for any PLP variation. We conclude that, in contrast with physically-partitioned systems, the PLP-Regular and PLP-Leaf designs provide low repartitioning costs which allow frequent repartitioning attempts and facilitate the implementation of responsive and lightweight dynamic load balancing mechanisms. We present one such mechanism in the next section. 7.6 A dynamic load balancing mechanism for PLP At the high level, any dynamic load balancing mechanism performs the same functionality. During normal execution it has to observe the access patterns and detect any skew that causes load imbalance among the partitions. Once the mechanism detects the troublesome imbalance, it triggers a repartition procedure. It is very important for the detection mechanism to incur minimal overhead during normal operation and to not trigger repartitioning when it is not really needed. After the mechanism decides to proceed to a repartition, it needs to determine a new partitioning configuration, so that the load is again uniformly distributed. This decision depends on various parameters, such as the recent load of each partition and the available hardware parallelism. Finally, after the new configuration has been determined, the system has to perform the actual repartitioning. The repartitioning should be done in a way that minimizes the drop in performance and the duration of this process. Thus, any dynamic load balancing mechanism that we build on top of PLP (or any partitioning-based system in general) should: • Perform lightweight monitoring. • Make robust decisions on the new partition configuration. • Repartition efficiently, when such decision is made. We have already shown that PLP provides the infrastructure for efficient repartitioning, in Section 7.5. In this section, we present techniques for lightweight monitoring and decision making. The overall mechanism is called DLB. 7.6. A DYNAMIC LOAD BALANCING MECHANISM FOR PLP 163 Figure 7.8: A two-level histogram for MRBTrees and the aging algorithm 7.6.1 Monitoring DLB needs to monitor some indicators of the system behavior, and based on the collected information to decide: (a) when to trigger a repartition operation and (b) what the new partitioning configuration to be. Candidate indicators can be the overall throughput of the system, the frequency of accesses in each partition and the amount of work each partition should do. There is need for DLB to continuously collect information of multiple indicators. For example, let’s consider that DLB monitors only the overall throughput of the system and raises flags when changes in throughput are larger than a threshold value. If the initial partitioning configuration of the system was not optimal (for example, with load imbalance among partitions) then its throughput to be low without fluctuations –the effect caught when monitoring only the throughput, and the monitoring would fail. Or there could be uniform drops or increases in the incoming request traffic, which would trigger unnecessary repartitioning. Thus, DLB needs to maintain additional information about the load of each partition. In addition, the information about the throughput is not useful for the component that decides on the new configuration (presented in Section 7.6.2). Thus, DLB needs to collect and maintain information about the load not only across partitions, but also within each partition. To that end, DLB uses the length of the request queue of each partition and a two-level histogram structure that employs aging. The histogram structure is depicted on the left side of Figure 7.8. To monitor the differences in the load across partitions, DLB monitors the number of requests waiting at each partition’s request queue. To have accurate information about the load distribution within each partition, in addition to the one bucket it maintains CHAPTER 7. PHYSIOLOGICAL PARTITIONING 164 for each partition (left side of the figure), the histogram has sub-buckets on ranges within each partition’s key range (shown on the right side of the figure). The number of sub-buckets within each partition is tunable and determines the monitoring granularity. DLB frequently checks whether the partition loads are balanced or not. The load of each partition is calculated based on an aging algorithm. Each bucket in the histogram is implemented as an array of age-buckets, shown on the right side of Figure 7.8. At each point of time there is one active age-bucket. When a record is accessed, the active age-bucket of the sub-bucket of the range where the record belongs to increments by one. At regular time intervals the age of the histogram increases. Whenever the age of the histogram increases, the next age-bucket is reset and starts to count the accesses. When calculating the load of a sub-bucket in the histogram, the recent age-buckets are given more weight than the older ones. More specifically, if a sub-bucket consists of A agebuckets, the load for the ith age-bucket is li , and the current age-bucket is the cth bucket, then we calculate the total load L for the sub-bucket as follows: L= A+c−1 i=c 100 × limod(A) . (i − c + 1) Figure 7.8 (right) shows an example of the aging algorithm, when the load to a particular sub-bucket increases by 10 for five consecutive time intervals (T 1 to T 5). W is the weight of each age-bucket and L is the load value of this sub-bucket at each interval, calculated by the formula above. Because they are both light-weight, DBL very frequently monitors the throughput and the length of the request queues. On the other hand, the histograms are analyzed only whenever an imbalance is observed. The overall monitoring mechanism does not incur much overhead and it also provides adequate information for DLB to decide on the new partitions. 7.6.2 Deciding new partitioning The algorithm DLB employs for reconfiguring the partition key-ranges is highly dependent on the request queues and two-level aging-histogram structure discussed previously. First we describe the algorithm that determines the partitioning configuration within a single table, and then we consider that case when we decide the partition across all tables. Deciding the partitioning within a single table To describe the algorithm, let N be the total number of partitions, and Qi be the number of requests at the request queue of the ith partition. Then, the ideal number of requests for each partitions queue is: 7.6. A DYNAMIC LOAD BALANCING MECHANISM FOR PLP 165 Algorithm 3 Calculating ideal loads. 1: for i = 1 → N − 1 do 2: while Li < LIi − t do 3: Move leftmost sub-bucket range from i + 1 to i 4: Li ⇐ Li + Lsubbucket 5: if Li > LIi + t then 6: Distribute sub-bucket range into µ sub-buckets 7: while Li > LIi + t do 8: Move rightmost sub-bucket range from i to i + 1 9: Li ⇐ Li − Lsubbucket 10: if Li < LIi − t then 11: Distribute sub-bucket range into µ sub-buckets Qideal = N Qi i=1 N . Knowing Qideal , we have to decide on the ideal data access load for each partition. Let Li be the aging load of the ith partition, which can be calculated as the sum of the aging loads of its sub-buckets. We have to calculate the ideal data access load for partition i, LIi , based on the ideal request load and how much request load, Qi , each Li creates. Therefore, LIi is: LIi = Qideal × Li . Qi Because the granularity on the load information is determined by the number of subbuckets in the histogram, it is difficult for DLB to achieve precise ideal loads.That’s why DLB only tries to approximate the precise ideal value. Algorithm 3 sketches how the new keyranges are assigned. It iterates over all partitions except the last one. While the estimated load Li at a partition is less than LIi − t for some t value, it moves the range of the leftmost sub-bucket from the (i + 1)th partition to ith. Similarly, while the load at a partition is larger than LIi + t it moves the range of right-most sub-bucket from the ith partition to (i + 1)th. If the moved sub-bucket causes a significant change in the calculated load (more than 2 × t), then this sub-bucket is substituted by a larger number of sub-buckets to observe that range in finer-granularity. Figure 7.9 shows an example of how Algorithm 3 is applied. In the example, there are three partitions on a table and Figure 7.9 shows the two-level histogram for each partition. The first-level of the histogram tracks down the number of accesses to a partition’s range, 166 CHAPTER 7. PHYSIOLOGICAL PARTITIONING Figure 7.9: Example of how DLB decides on the new partition ranges. which is 40 units in this example. The second-level of the histogram, the 4 sub-buckets, keeps the number of accesses to sub-ranges in a partition, which is 10 units in this example. The higher bar in a sub-bucket indicates that the sub-range that corresponds to that subbucket has more load. Initially each partition has equal key-ranges, shown in the left part of Figure 7.9. If we assume that each partition has to perform equal amount of work per request, the loads in this configuration are not balanced among the partitions. In that case, the repartition manager triggers repartitioning. Based on Algorithm 3 the new partitions are decided by moving around the sub-buckets to create almost-equal loads among the partitions. The result is shown on the right part of Figure 7.9; the most loaded regions end up in partitions with smaller ranges, like the second partition in Figure 7.9, and the lightly loaded regions are merged together. Deciding the number of partitions of each table The algorithm presented previously, is just for one table and assumes that the number of partitions before and after the repartitioning operation does not change. To determine how many partitions a table should have is another issue and requires knowledge on all of the tables in the database. Next, we provide a formulation to determine the number of partitions for a table. In our setting, initially, the number of partitions for a table is determined automatically to be equal to the number of hardware contexts supported by the underlying machine. To find what the number of partitions for a table should be dynamically, based on the workload trends; let T be the number of tables, Ntotal be the upper limit on the total number of partitions for the whole database, Qi be the total number of requests for table i, Ni be the number of partitions for table i, QTavg be the average number of requests for all the tables, Navg be the average number of partitions for a table, and #CT X be the total number of 7.7. EVALUATION 167 available hardware contexts supported by the machine that executes the transactions run on this database. Based on the initial total number of partitions, we define Ntotal as: Ntotal = T × #CT X. As a result, Navg will be: Navg = Ntotal = #CT X. The QTi values are known from the T T i=1 QTi request queues and therefore QTavg can be calculated as: QTavg = T . The goal is to avg i find the Ni values, which can be derived from the following formula: QT = QT . Using Navg Ni the formulas and algorithm presented above, DLB efficiently decides on the new partitioning configuration. 7.6.3 Using control theory for load balancing In our prototype implementation, the system immediately tries to adjust to a new configuration, once a target load value is determined for each partition. Thus there is always the danger of over-fitting, especially for the workloads that observe access skew with frequently changing hot-spots. Since repartitioning is not expensive for PLP (except for PLP-Partition), it can repartition again very quickly to alter the bad effects of a previous bad partitioning choice. Rather than directly aiming to reach the target load, a more robust technique would be to employ control theory while converging to the target load [LSD+ 07]. Control theory can increase the robustness of our algorithm, prevent the system from repartitioning unnecessarily and/or resulting with wrong partitions, and reduce the downtime faced by PLP-Partition during repartitioning. Nevertheless, it is orthogonal with the remaining infrastructure, and it could be easily integrated in the current design. The prototype implementation does not employ control theory techniques. But the evaluation, presented next, shows that DLB allows PLP to balance the load effectively. 7.7 Evaluation The evaluation consists of four parts. 1. In the first part we measure how useful PLP can be. In particular. Section 7.7.2 quantifies how different designs impact page latching and critical section frequency; Section 7.7.3 examines how effectively PLP reduces latch contention on index and heap page latches; and Section 7.7.4 shows the performance impact of those changes. 2. In the second part we try to quantify any overheads related to PLP. To do that we measure PLP’s behavior in challenging workloads that seem to not fit well with physiological partitioning, such as transactions with joins (Section 7.7.6) and secondary index 168 CHAPTER 7. PHYSIOLOGICAL PARTITIONING accesses that can be aligned with the partitioning or not (Section 7.7.7). In addition, Section 7.7.8 inspects the fragmentation overhead of the three PLP variations. 3. In the third part (Section 7.7.5) we quantify how useful MRBTrees can be also for nonPLP systems, like in conventional or logically-partitioned systems. 4. In the last part of the evaluation, we measure the overhead and effectiveness of the dynamic load balancing mechanism of PLP (Sections Section 7.7.9–Section 7.7.10). Finally, In Section 7.7.11, we highlight the key conclusions of the whole evaluation. 7.7.1 Experimental setup To ensure reasonable comparisons, all the prototypes are built on top of the same version of the Shore-MT storage manager [JPH+ 09] (and Section 5.2), incorporate the logging optimizations of [JPS+ 10] (and Section 5.4), and share the same driver code. We consider five different designs: • A optimized version of a conventional, non-partitioned system, labeled as “Conventional”. This system employs speculative lock inheritance [JPA09] (and Section 5.3.1) to reduce the contention in the lock manager, and essentially corresponds to the second bar of Figure 7.1. • Logical-only or DORA is a data-oriented transaction processing prototype [PJHA10] (and Chapter 6) that applies logical-only partitioning. • PLP or PLP-Regular prototypes the basic PLP variation. This variation accesses the MRBTree index pages without latching. • PLP-Partition extends PLP-Regular, so that one logical partition “owns” each heap page, allowing latch-free both index and heap page accesses. • PLP-Leaf assigns heap pages to leaves of the primary MRBTree index, also allowing latch-free index and heap page accesses. In addition, we experiment with the PLP variations with the dynamic load balancing mechanism integrated. We label those systems with a “-DLB” suffix (PLP-Reg-DLB, PLPPart-DLB, and PLP-Leaf-DLB). All experiments were performed on two machines: an x64 box, with four sockets of quadcore AMD Opteron 8356 processors, clocked at 2.4GHz and running Red Hat Linux 5; and a Sun UltraSPARC T5220 server with a 64-core Sun Niagara II chip clocked at 1.4GHz and running Solaris 10. Due to unavailability of a suitably fast I/O sub-system, all the Thousands Page latches acquired 7.7. EVALUATION 169 800 700 600 500 INDEX 400 HEAP 300 CATALOG / SPACE 200 100 0 Conventional Logical PLP-Regular PLP-Leaf Figure 7.10: Average number of page latches acquired by the different systems when the run the TATP benchmark. The PLP variants by design eliminate the majority of page latching. experiments are with memory-resident databases. But the relative behavior of the systems will be similar with larger databases. 7.7.2 Page latches and critical sections First we measure how PLP reduces the number of page latch acquisitions in the system. Figure 7.10 shows the number and type of page latches acquired by the conventional, the logically-partitioned and two variations of the PLP design, PLP-Regular and PLP-Leaf. Each system executes the same number of transactions from the TATP benchmark. PLP-Regular reduces the amount of page latching per transaction by more than 80%; while PLP-Leaf reduces the total further to roughly 1% of the initial page latching. The remaining latches are associated with metadata and free space management. The two right bars of Figure 7.1 compare total critical section entries of PLP vs. the conventional and logically-partitioned systems. The two PLP variants eliminate the vast majority of lock- and latch-related critical sections, leaving only metadata and space management latching as a small fraction of the critical sections. Transaction management, which is the largest remaining component, mostly employs fixed-contention communication to serialize threads that attempt to modify the transaction object’s state. Similarly, the buffer pool-related critical sections are mostly due to the communication between cleaner threads, which again do not impact scalability. Overall, PLP-Leaf acquires 85% and 65% fewer contentious critical sections than the conventional and logically-partitioned systems respectively. CHAPTER 7. PHYSIOLOGICAL PARTITIONING 160 140 120 100 80 60 40 20 0 Other Heap Latch Cont. Idx Latch Cont. 16 32 48 PLP Logical Conv. PLP Logical Conv. PLP Logical Conv. PLP Logical Latching Conv. Time breakdown (per xct) 170 64 # HW Contexts Figure 7.11: Time breakdown per transaction in an insert/delete-heavy benchmark. 7.7.3 Reducing index and heap page latch contention Having established that PLP effectively reduces the number of page latch acquisitions and critical sections, we next measure what is the impact of that change in the time breakdown. Figure 7.11 shows the impact in the transaction execution time as PLP eliminates the contention on index page latches. The graph gives the time breakdown per transaction for the different designs as an increasing number of threads run an insert/delete-heavy workload on the TATP database. In this benchmark, each transaction makes an insertion or a deletion to the CallFwd table, causing page splits and contention for the index pages that lead to the records being inserted/deleted. As Figure 7.11 shows, the conventional and the logicallypartitioned systems experience contention on the index page latches. They both spend 15-20% of their time waiting, while PLP eliminates the contention. We expect PLP to achieve proportional performance improvements. Similarly, Figure 7.12 shows the time breakdown per transaction when 16 and 40 hardware contexts are utilized by the conventional, the logically-partitioned and PLP-Partition systems when they run a slightly modified version of the StockLevel transaction of the TPC-C benchmark. StockLevel contains a join, and in this version, 2000 tuples are joined. We see that the conventional system wastes 20-25% of its time in contention in the lock manager and for page latching. Interestingly enough, the logically-partitioned system eliminates the contention in the lock manager, but this elimination is not translated to performance improvements. Instead the contention is shifted and aggravated to the page latches. On the other hand, PLP eliminates the contention both in the lock manager and for page latches and achieves higher performance. Time breakdown (per xct) 7.7. EVALUATION 171 60 50 Other 40 Btree 30 BPool 20 Latching 10 Locking 0 16 40 Conventional 16 40 Logical 16 40 PLP-Partition #HW Contexts Figure 7.12: Time breakdown per TPC-C StockLevel transaction, when 2000 tuples joined. The PLP variation eliminates the contention related to both locking and latching. Figure 7.13 gives the time breakdown per transaction when we run the TPC-B benchmark [TPC94]. In this experiment we do not pad records to force them onto different pages. Transactions often wait for others because the record(s) they update happen to reside on latched heap pages. The conventional, logically-partitioned, and PLP-Regular all suffer from this false sharing of heap pages. At high utilization this contention wastes more than 50% of execution time. On the other hand, PLP-Leaf is immune, reducing response time by 13-60% and achieving proportional performance improvement. In a way, PLP-Leaf provides automatic and more robust padding for the workloads that require manual padding in the conventional system to reduce contention on the heap pages. 7.7.4 Impact on scalability and performance Since PLP effectively reduces the contention (and the time wasted) to acquire and release index and heap page latches, we next measure its impact on performance and overall system scalability. The four graphs of Figure 7.14 show the throughput of the optimized conventional system, as well as the DORA and PLP prototypes, as we increase hardware utilization of the two multicore machines. On the two left-most graphs the workload consists of clients that repeatedly submit the GetSubscriberData transaction of the TATP benchmark [NWMR09], while on the two right-most graphs the workload consists of clients that repeatedly submit the StockLevel transaction of the TPC-C benchmark [TPC07]. Both transactions are read-only and ideally should impose no contention whatsoever. Those two workloads corresponding to the time breakdowns presented in Figure 7.11 and Figure 7.13 respectively. CHAPTER 7. PHYSIOLOGICAL PARTITIONING 350 300 250 200 150 100 50 0 Useful Heap Latch Cont. Idx Latch Cont. 16 32 48 PLP-Leaf PLP-Reg DORA Conv. PLP-Leaf PLP-Reg DORA Conv. PLP-Leaf PLP-Reg DORA Conv. PLP-Leaf PLP-Reg DORA Latching Conv. Time breakdown (per xct) 172 64 # HW Contexts Figure 7.13: Time breakdown per transaction in TPC-B with false sharing on heap pages. As expected, PLP shows superior scalability, evidenced by the widening performance gap with the other two systems as utilization increases. For example, from the right-most graph we see that for StockLevel DORA delivers a 11% speedup over the baseline case in the 4-socket Quad x64 system. In its turn, PLP delivers an additional 26% over DORA, or nearly 50% over the conventional. The corresponding improvements in the Sun machine’s slower but more numerous cores are 13% and 34%. Note that eight cores of the x64 machine match the fully-loaded Sun machine, so the latter does not expose bottlenecks as strongly in spite of its higher parallelism. A significant fraction of the speedup actually comes from the MRBTree probes, which are effectively one level shallower, since threads bypass the “root” partition table node during normal operation. 7.7.5 MRBTrees in non-PLP systems The MRBTree can improve performance even in the case of conventional systems in three ways. First, since it effectively reduces the height of the index by one level, each index probe traverses one fewer node and hence it is faster. Second, any possible delay due to contention on the root index page is also reduced roughly proportionally with the number of sub-trees. We see the effect of those two in Figure 7.15, which highlights the difference in the peak performance of the conventional and the logically-partitioned system when they run with and without MRBTrees. The workload is the TATP benchmark. In both case the improvement in performance is in the order of 10%. Third, MRBTrees allow each sub-tree to have a structure modification operation (SMO) in flight at any time; in contrast with traditional B+Trees that can have only one SMO in 7.7. EVALUATION 173 TATP GetSubscriberData TPCC StockLevel 2.5 700 600 Thousands Throughput (Ktps/sec) 400 350 300 250 200 150 100 50 0 0 16 32 48 64 HW ctxs util. Sun Niagara II 2.0 500 400 1.5 300 1.0 200 100 0 0 4 8 12 16 HW ctxs util. Intel x64 6 5 4 3 2 0.5 1 0.0 0 0 16 32 48 64 HW ctxs utl. Sun Niagara II Opt. Shore-MT DORA PLP-Partition 0 4 8 12 16 HW ctxs util. Intel x64 Figure 7.14: Throughput when the systems run the GetSubscriberData transaction of the TATP benchmark, and the StockLevel transaction of the TPC-C benchmark in two multicore machines. PLP shows superior scalability, as evidenced by the widening performance gap with the other two systems as utilization increases. flight. Consequently, in workloads with high entry insertion (deletion) rates, the MRBTree improves performance by parallelizing the SMOs. Figure 7.16 shows the time breakdown of the conventional system with and without MRBTrees as we run a microbenchmark that consists of either a record probe or insert, and we increase the percentage of inserts. Without MRBTrees, the system spends an increasing amount of time blocked waiting for SMOs to complete as the insertion rate increases. When MRBTrees are used, there is no time wasted waiting for SMOs and performance improves by up to 25%. Overall, there are compelling reasons for systems other than PLP to adopt MRBTrees. 7.7.6 Transactions with joins in PLP Next we turn our attention to workloads that seem to not fit well with physiological partitioning. First, we inspect how PLP behaves on workloads with transactions with join operations. To evaluate the performance of PLP on transactions with joins, we slightly modified the StockLevel transaction from the TPC-C benchmark [TPC07] to determine the number of tuples joined. (This experiment is the same with the one presented in Section 6.4.6.) In its un-modified version, StockLevel joins 200 tuples between two tables. We created different versions of the transaction where 20, 200, 2000, 20000, and 200000 tuples are joined. For CHAPTER 7. PHYSIOLOGICAL PARTITIONING Throughput (Ktps/cpu) 174 70 60 50.0 50 55.8 65.3 59.4 40 30 20 10 0 Normal MRBT Normal MRBT Conv. Logical 75 Other Bpool 50 TxMgr Log 25 Locking Latch-smo 0% 20% 40% 60% Percentage of Inserts 80% MRBT Normal MRBT Normal MRBT Normal MRBT Normal MRBT Normal MRBT 0 Normal Time breakdown (per xct) Figure 7.15: Performance of the conventional and the logically-partitioned system in TATP. MRBTree is beneficial also for non-PLP systems. 100% Figure 7.16: Time-breakdown of conventional transactions when parallel SMOs are allowed with MRBTrees. each different number of tuples joined, Figure 7.17 plots the maximum throughput the conventional, the logically-partitioned and the PLP-Partition systems achieved, normalized to the maximum throughput of the conventional. The three systems achieved their maximum throughput when the 4-socket Quad x64 machine was 100% utilized, which means that there were no significant scalability bottlenecks. Figure 7.17 shows that the PLP variation achieves higher performance than the conventional system regardless of the number of tuples joined. When only 20 tuples are joined PLP achieves 2.1x higher performance than conventional, while when 200K tuples are joined PLP achieves 33% higher performance. PLP achieves higher performance because it eliminates the contention for page latches, as Figure 7.12 Normalized Throughput 7.7. EVALUATION 175 2.25 2.00 1.75 1.50 1.25 1.00 0.75 0.50 0.25 0.00 Conventional Logical PLP-Partition 20 200 2000 20000 200000 Tuples Joined Figure 7.17: Maximum throughput when running the TPC-C StockLevel transaction, normalized the throughput of Conventional. illustrates. That is in contrast with the logically-partitioned system (DORA), which for large number of tuples joined performs lower than conventional. 7.7.7 Secondary index accesses Non-clustered secondary indexes are pervasive in transaction processing, since they are the only means to speed up transactions that access records using non-primary key columns. Nevertheless, secondary index accesses pose several challenges to PLP, which we explore in Figure 7.18. We break the analysis of secondary index accesses to two cases: when the secondary index is aligned with the partitioning scheme and when it is not. We conduct an experiment where we modify TATP’s GetSubscriberData transaction to perform a range index scan on the secondary index with built on the names of the Subscribers and we control the number of matched records. In the original version of the transaction only one Subscriber is found. In the modified version, we probe for 10, 100, 1000, and 10000 Subscribers, even though index scans for thousands of records are not typical in high-throughput transactional workloads. This experiment is very similar to the one conducted in Section 6.4.5. If the secondary index columns are a subset of the routing columns, then the secondary index is aligned with the partitioning scheme. In that case, a secondary index scan may return a large number of matched RIDs (record ids of entries that match the selection criteria) from several partitions. All the executors need to send the probed data to a coordination point where an aggregation of the partial results takes place. As the range of the index scans become larger (or the selectivity drops), this causes a bottleneck due to excessive data transfers. When CHAPTER 7. PHYSIOLOGICAL PARTITIONING 176 300 50 Range = 10 Throughput (Ktps) 250 40 200 30 150 20 100 50 10 0 0 PLP-Aligned PLP-NonAligned Conventional 0 4 5 Throughput (Ktps) Range=100 8 12 #HW Contexts 0 16 4 0.50 Range = 1000 8 12 16 # HW Contexts Range=10000 4 0.40 3 0.30 2 0.20 PLP-Aligned 1 0.10 PLP-NonAligned 0 0.00 0 4 8 12 # HW Contexts 16 Conventional 0 4 8 12 16 # HW Contexts Figure 7.18: Performance on transactions with aligned and non-aligned secondary index scans. the secondary index is not aligned with the partitioning scheme, then on top of the above mentioned bottleneck there is also an important overhead. This overhead is because each record probe becomes a two step process, where the secondary index probe is done by one thread conventionally and then requests from the appropriate executor threads to retrieve the selected records. Figure 7.18 compares the performance of Conventional system with PLP-Part-Aligned, which performs partitioning aligned secondary index accesses, and PLP-Part-NonAligned, which performs non-partitioning aligned secondary index accesses, as more hardware contexts are utilized in the system. PLP-Part-Aligned improves performance over Conventional by 46%, 14%, 8%, and 1% respectively for ranges 10, 100, 1000, 10000. On the other hand, even though PLP-Part-NonAligned improves performance by 11% when 10 records are scanned, for larger ranges it hinders performance. PLP-Part-Aligned is 3%, 11%, and 38% slower than Conventional for ranges 100, 1000, and 10000, respectively. Normalized # of Heap Pages 7.7. EVALUATION 177 2.00 1.75 1.50 1.25 1.00 0.75 0.50 0.25 0.00 Conventional PLP-Regular PLP-Partition PLP-Leaf 1MB 10MB 100MB 1GB 10GB 1MB 10MB 100MB 1GB 100B 10GB 1000B Record and Database Size Figure 7.19: Space overhead of the PLP variations. As expected, the performance improvement for PLP-Part-Aligned gets smaller as the range of the index scan increases. However, as long as the index scans of partitioning-aligned secondary indexes are selective and touch a relatively small number of records, PLP provides decent performance improvement. For PLP-Part-NonAligned, however, such workloads are very unfriendly, though unless the scan range is over 1000 records it is not disastrous. 7.7.8 Fragmentation overhead PLP-Partition and PLP-Leaf, create some fragmentation on the heap file since they change the regular heap file structure (see Section 7.4.3). Given the increased number of data pages due to fragmentation, we expect the heap file scan times to increase proportionally. Figure 7.19 shows the ratio between the number of pages used in the three PLP variations and the conventional system as we increase the database size. The x-axis shows the total size of the database when each record is 100B (left side of the graph) and 1000B (right side of the graph). The y-axis is the ratio between the number of pages used in each design and the conventional system. The conventional system has one partition, where the PLP variations have 100 and 10 partitions for the cases where record size is 100B and 1000B, respectively. The heap page size is 8KB. As expected, PLP-Regular does not create any fragmentation since it maintains the regular heap file format. For PLP-Partition, the amount of fragmentation becomes negligible as the database size increases for small records. However, PLP-Leaf uses up to 80% more heap pages than a conventional system for the same case creating a visible fragmentation on the heap file. On the other hand, as we increase the record size, the fragmentation decreases because each heap page is able to keep fewer records, and thus the amount of empty space on each heap page is reduced. CHAPTER 7. PHYSIOLOGICAL PARTITIONING 178 Throughput (Ktps) 700 600 No Histogram 500 No Sub-bucket 400 2 Sub-buckets 300 5 Sub-buckets 200 10 Sub-buckets 100 20 Sub-buckets 0 0 5 10 15 #CPUs Utilized 20 Figure 7.20: Overhead of DLB under normal operation. Overall, among the PLP variations, only PLP-Leaf may introduce some significant fragmentation when a heap page can keep many database records. As the number of records a heap page can keep decreases, this cost becomes less significant. We also note that PLP is a design optimized for high-performing transactional applications, where entire heap file scans are rare. 7.7.9 Overhead and effectiveness of DLB In this section we first quantify the overhead of the dynamic load balancing mechanism (DLB) under normal operation. Then we measure how quickly and effectively DBL reacts against skew and load imbalances. All the experiments use the GetSubscriberData transaction from TATP benchmark. Overhead in normal operation Under normal operation, DLB should impose minimal overhead. DLB’s monitoring component performs three operations: it maintains the histograms with access information, it continuously monitors the throughput, and it periodically analyzes the request queues of the worker threads for load imbalances. In an optimally configured system (where the load is precisely balanced across partitions) we measure the performance of the system as we increase the load in the system (the number of concurrent clients that submit transactions). Figure 7.20 shows the overhead caused by updating the aging histogram for each data access. Since the number of threads that try to update the histogram increases, as we utilize more CPUs, the overhead of updating the histogram increases as well. On the other hand, increasing the number of sub-buckets does not have much effect. 7.7. EVALUATION 179 400 Throughput (Ktps) 350 300 Conventional 250 PLP-Regular 200 PLP-Reg- DLB 150 PLP-Part-DLB 100 PLP-Leaf-DLB 50 0 0 5 10 15 20 25 30 Time (sec) Figure 7.21: Example of dynamic load balancing in action. At time t=10 50% of the requests are sent to 30% of the database. Overall, we observe that the monitoring component of DLB is fairly lightweight. On average histogram updates cause 6% drop in throughput compared to the system running without a histogram and maximum drop is 7-8%. Considering that the transaction we execute in our system is a read-only transaction, we actually evaluate the worst case behavior here. For a transaction with updates, the number of transactions executed per second and hence the number of data accesses would be lower. Fewer data accesses would cause fewer updates in histogram and therefore less overhead. Reacting to load imbalances In order to evaluate how effectively DLB handles load imbalances, we execute the same experiment as the one in Figure 7.5. The PLP variations (PLP-Regular, PLP-Reg-DLB, PLP-Part-DLB, and PLP-Leaf-DLB) use 64 partitions, apply aging in every 1 sec, and the load difference threshold value t is 10%. Initially the requests are distributed uniformly and at time point 10 (sec), 30% of the database starts to receive 50% of the requests. As Figure 7.21 shows, the change in the access pattern causes a 30% drop in the throughput of PLP-Regular, making its performance worse than the performance of the non-partitioned Conventional system. On the other hand, the DLB-integrated PLP variations quickly detect the skew and bring the performance back to the pre-skew levels in less than 10 secs. In particular, 2 secs after the change in the access pattern, DLB has already decided on the new partitioning configuration, and around 8 secs later it has performed 126 repartition operations (63 splits and 63 merges). The throughput has some spikes for a short time after repartitioning, but in the end settles down. 180 CHAPTER 7. PHYSIOLOGICAL PARTITIONING Figure 7.22: Partitions Before & After the repartitioning In PLP-Reg-DLB, very few index entries are updated, leading to a shorter dip in throughput during repartitioning. PLP-Leaf-DLB experiences an almost equally short dip. PLPPart-DLB suffers a much longer dip. For the statically partitioned PLP, Figure 7.21 has only the results for the statically partitioned PLP-Regular since the drop in throughput is almost the same for the other two statically partitioned PLP variations (PLP-Partition and PLP-Leaf). DLB triggers a global repartitioning process which affects all the partitions in the system. PLP-Regular and PLP-Leaf can handle this process very well. However, such global repartitioning is not suitable for PLP-Partition. PLP-Partition is the closest to a physicallypartitioned (shared-nothing) system in terms of repartitioning cost since it reorganizes a large number of heap pages (see Section 7.5.2). Therefore, its non-optimal behavior with DLB is as expected. Speeding up accesses to hot spots When DLB is effective, the “hot” regions end up to narrow partitions. The indexes for these partitions are shallower and provide shorter access times for the “hot” records. In addition, “hot” records that could previously belong to the same partition, due to their key proximity, end up to different partitions. Figure 7.22 illustrates graphically the impact of DLB on the ranges of 10 partitions before and after an repartitioning. The area within the rectangular region highlights the “hot” range; it is 10% of the total area that receives the 50% of the total load. Initially, labeled Before, the system has equal-length range partitions. After DLB kicks in and repartitioning completes, labeled After, the “hot” region has shorter-length range partitions while the not-so-loaded regions have larger-length partitions. 7.7. EVALUATION 181 Table 7.3: Average index probe times (in microseconds) for a hot record, as skew increases. Skewed region (%) Before Skew 50 69 20 67 10 69 5 68 2 68 After Skew 67 66 66 64 64 After Repartitioning 65 63 62 61 60 Table 7.4: Average record probes per sec for a hot record, as skew increases. Skewed region (%) After Skew After Repartitioning 50 13 13 20 7 29 10 7 73 5 32 108 2 63 155 Table 7.3 shows the average index probe time (in microseconds) for a hot record as we increase the skew. For this experiment we use a single table with 640000 records for a total size of around 1GB. There is a index of this table, with 8KB pages and the primary key is an integer (4B). When there are 10 equal-range partitions, the height of each partition’s sub-tree is 3. Each row in the table shows the average access time of a randomly picked record from a “hot” region which gets 50% of all the requests, as the range of the “hot” region decreases –and the skew increases. The first column (“Before Skew”) shows the average access time when the requests are uniformly distributed. The second column (“After Skew”) shows the average access time when DLB is disabled and there request distribution is skewed. The thirds column is the average access time after DLB kicked in and completed a repartitioning. As Table 7.3 shows, the access times for the randomly picked record is lower after we set the skew. This is probably due to some caching effect since the record is accessed more frequently when there is skew in data accesses. However, the access time after repartitioning is the shortest since the height of the sub-tree in the new “hot” partition 2 whereas in the old partition it was 3 (the height of the sub-trees for the other partitions remains as 3) Table 7.4 shows the number of finished requests for the “hot” record after the skew and after DLB’s repartitioning. Before repartitioning fewer requests are satisfied for the picked record because its partition is highly loaded with requests for other records in the same “hot” partition range. DLB distributes the “hot” range between multiple shorter-range partitions. CHAPTER 7. PHYSIOLOGICAL PARTITIONING 182 Throughput (Ktps) 18 18 15 15 12 12 9 9 6 6 3 3 0 0 0 5 10 15 20 No index 1 sec. idx 2 sec. idx 3 sec. idx 4 sec. idx 0 5 10 Time (sec) Time (sec) PLP-Leaf PLP-Partition 15 20 Figure 7.23: Overhead of updating secondary indexes during repartitioning. At time t=5 50% of the requests are sent to only 10% of the database, which triggers repartitioning. Therefore, a single partition can serve more requests for the “hot” record. This results in the observe small throughput increase after repartitioning in Figure 7.21. 7.7.10 Overhead of updating secondary indexes for DLB In PLP-Leaf and PLP-Partition, whenever a record moves every non-clustered index of the table needs to be updated with the record’s new RID (see Section 7.4.3). In this section, we measure the overhead of updating the secondary indexes during repartitioning. Figure 7.23 shows the effect of repartitioning on throughput as we increase the number of secondary indexes for a table for PLP-Leaf (left) and PLP-Partition (right). For this experiment we use the Subscribers table of the TATP database. Initially, there are 2 partitions of 320000 records each that receive uniform requests. After 5 seconds, 50% of the requests are sent to only 10% of the table and DLB triggers a repartitioning. We measure the throughput of the system as we increase the number of secondary indexes on the table, from none up to 4 secondary indexes. Figure 7.23 (left) shows that the overhead for PLP-Leaf to update the secondary is relatively low, because very few or none of the records needs to be moved. On the other hand, the overhead for PLP-Partition is much higher. PLP-Partition has to move more records and update more entries in the secondary indexes. Therefore, repartitioning in PLPPartition takes longer time as we increase the number of secondary indexes for a table. 7.8. RELATED WORK 7.7.11 183 Summary As the experimental results show, PLP, successfully, manages to eliminate two major sources of unscalable critical sections in conventional shared-everything systems; locking and latching. In addition, it provides a good infrastructure for easy repartitioning and dynamic load balancing. It is important to note that each PLP variation has its drawbacks. For example, PLP-Leaf comes with some fragmentation (Section 7.7.8) and PLP-Partition cannot repartition efficiently (Section 7.7.9 and Section 7.7.10). Considering the long lasting throughput drops during repartitioning for the PLP-Partition, we favor PLP-Leaf for workloads that need dynamic load balancing. If the workload does not heavily suffer from heap page latching, but only index page latching, then PLP-Regular is definitely a great design choice because it neither has fragmentation nor faces long and sharp drops in throughput during repartitioning. 7.8 Related work The related work for this chapter can be categorized in three: analyzing and reducing the critical sections in DBMSs, partitioned B+Trees and concurrency control mechanisms, and dynamic load balancing and repartitioning. 7.8.1 Critical Sections The complexity and overheads of database management systems are well-known. For example, [HAMS08] shows that, even in a single-threaded OLTP system, logging, locking, latching, and buffer pool accesses contribute roughly equal overheads and together account for the majority of machine instructions executed during a transaction. The previous chapters and work show that these overheads become scalability burdens in multicore hardware [JPH+ 09, PJHA10]. PLP eliminates at its entirety a category of serializations, along with the corresponding bottlenecks, page latching. In the shared-everything arena, the two techniques presented in the previous chapter, speculative lock inheritance [JPA09] and data-oriented transaction execution [PJHA10], minimize the need for interaction with a centralized lock manager. Where speculative lock inheritance allows the system to spread lock operations across multiple transactions to reduce contention, data-oriented systems replace the central lock manager with thread-local lock management. Reducing lock contention with data-oriented execution is also studied for data-streams’ operators [DAAEA09]. 184 CHAPTER 7. PHYSIOLOGICAL PARTITIONING Other proposals tackle the weakness posed by the centralized log manager, with [JPS+ 10] (and Section 5.4) presenting a scalable log buffer and [Che09] exploiting flash technology to reduce logging latencies. These proposals show even seemingly-pervasive forms of communication can be reduced or sidestepped to great effect. However, none of them addresses physical data accesses involving page latching and the buffer pool, the other two major overheads in the system, which PLP eliminates. Oracle RAC, with Cache-Fusion [LSC+ 01], allows database instances in the shared-disk cluster to share their buffer pools and avoid accesses to the shared-disk. It can also partition the data to reduce both logical and physical contention on a particular portion of the data. However, it does not enforce each partition to be accessed only by a single thread. Therefore, it does not eliminate physical latch contention while accessing pages from the shared-cache as much as PLP does. As discussed previously, shared-nothing [Sto86, DGS+ 90, SMA+ 07] systems have an appealing design that eliminates critical sections altogether. However, they struggle both proactively to reduce the need to execute distributed transactions through efficient partitioning [CJZM10, PJZ11] as well as re-actively to reduce overheads when distributed transactions cannot be avoided [JAM10]. On the other hand, PLP, in addition to eliminating a big portion of the unscalable critical sections, offers a less costly way of load balancing and communication for distributed transactions since partitions share the same memory space. 7.8.2 B+Trees and alternative concurrency control protocols Alternatives to traditional B+Tree concurrency control protocol are studied to allow multiple concurrent SMOs [ML92, JSSS06]. The MRBTree index structure provides an alternative to these techniques, allowing concurrent SMOs with less code complexity. However, these techniques could be implemented alongside with MRBTrees to achieve concurrency within a partition, should that be desirable for a conventional system. As an addition to these techniques MRBTrees also allow multiple root split operations in parallel. Several earlier works propose B+Trees having multiple roots to reduce contention due to locking [MOPW00, Gra03]. However, again none of these proposals targets physical latch contention in the system. In addition, there are latch-free B+Tree implementations that use alternative synchronization methods. CO B-Tree [BFGK05] uses load-linked/store-conditional (LL/SC) instead of latching to synchronize operations on a B+Tree. However, it does not eliminate contention on the B+Tree. PALM [SCK+ 11] eliminates both page latching and contention on the B+Trees by using Bulk Synchronous Parallel model. However, it has to perform B+Tree 7.8. RELATED WORK 185 operations in batches in order to exploit this technique, which might not be desirable all the time and harder to integrate within a database management system. Finally, optimistic and multiversioning concurrency control schemes [KR81, BG83, LBD+ 12] may improve concurrency by resolving conflicts lazily at commit time instead of eagerly blocking them at the moment of a potential conflict. When conflicts are rare this allows the system to avoid the overhead of enforcing database locks. On the other hand, if the conflicts occur frequently the performance of the system drops rapidly, since the transaction abort rate is high. Moreover, there is work that compares the concurrency control schemes in database systems. Notable is the work by Agrawal et al. [ACL87], while the book of Bernstein et al. [BHG87] and Thomasian’s survey [Tho98] are good starting points for the interested reader. On the other hand, the focus of PLP is on the contention for latches rather than the concurrency scheme used. We also note that there is a large body of work on cache-conscious index implementations (e.g. [RR99, RR00, CGMV02]). Such indexes are not being used on transaction processing systems. Instead, they target business intelligence workloads, which lack updates and therefore do not need complicated concurrency control mechanisms. PLP eliminates the need for latching and concurrency control at the index level. Therefore, we expect to get a significant performance boost if we substitute the index implementation with a cache-friendlier B+Tree alternative, since the B+Tree probes are the most expensive remaining component of PLP. 7.8.3 Load balancing There is a large body of related work, but most of it focuses on clustered (shared-nothing) environments. For example, [AON96] analyzes and compares different approaches for index reorganization during repartitioning in shared-nothing deployments. Lee et al. [LKO+ 00] propose an index structure similar to the MRBTree, which eases the index reorganization during repartitioning in a shared-nothing system and Mondal et al. [MKOT01] extend this design by keeping statistics for each branch pointed by the root node of a partition’s subtree. While the structure of [MKOT01] enables the observation of access patterns at a fine granularity all the accesses have the same weight, no matter how recent or old they are. The two-level aging-based histogram assigns higher weight to the recent accesses. This allows us to have a more accurate view of the skewed access patterns and detect load imbalances quickly. Shinobi [WM11] uses a cost model to decide whether the benefits of a new partitioning configuration worth to pay the cost of repartitioning. Shinobi focuses on insert-heavy workloads where data is rarely queried and when queried the queries focus on a small region CHAPTER 7. PHYSIOLOGICAL PARTITIONING 186 of the most recently inserted records. Its benefits primarily come from avoiding to index the large infrequently accessed parts of the database. We consider mainstream transactional workloads where the entire database is accessed and we cannot drop any indexes. The histogram-based technique we use is influenced from previous work on maintaining dynamic histograms on data distributions for accurately estimating the selectivity of query predicates [GMP02, DIR00]. In DLB’s case, we are interested in the frequency of accesses to a particular region, rather than the data distribution, and on the access pattern. Finally, our work is orthogonal to techniques that decide initial partitioning configuration. For example, Schism [CJZM10] create partitions that minimize the number of distributed transactions by representing the workload as a graph and using a graph partitioning algorithm. Houdini [PJZ11] uses a Markov model in order to decide the partitioning. While, in [RZML02] the query optimizer is used to get suggestions for the initial partitions. These tools will only create the initial configuration; if the workload characteristics change over time, however, the initial configuration is useless and the system has to re-calculate the partitioning configuration and perform the repartitioning. 7.8.4 PLP and future hardware As multicore hardware trends evolve, PLP becomes increasingly attractive for several reasons. Conventional OLTP is ill-suited to modern and upcoming hardware for at least three reasons: • The code of OLTP system is full of unscalable critical sections [JPH+ 09]. • The access patterns are unpredictable [SWH+ 04] that even the most advanced prefetchers fail to detect [SWAF09]. • The majority of the accesses are shared read-write and hence they under-perform on caches with non-uniform access latency [BW04, HFFA09]. As we have seen, PLP, combined with previous advances in logging, eliminates all three problems. The majority of unscalable critical sections are completely eliminated, access patterns are regularized by the thread assignments, and threads no longer share data to communicate, eliminating the shared R/W problem. This regularity will become increasingly important as hardware continues to make more and more demands of the software. For example, it is almost inevitable that processor cache access latencies will be non-uniform [BW04, HPJ+ 07, HFFA09]. Unfortunately, OLTP will only be able to utilize effectively these new architectures if it can eliminate the majority of accesses which are shared among multiple processors. 7.9. CONCLUSIONS 187 Another important trend in hardware design is toward non-coherent many-core processors that are based on message passing, e.g. [V+ 07, H+ 10]. In the area of operating systems, they have already recognized this trend and have proposed message-passing system designs, such as Barrelfish [BBD+ 09]. PLP by design requires a small amount of communication between threads. There is no fundamental difficulty to extend its design to a pure message-passing shared-everything transaction processing system in order to fit naturally to such hardware. In short, by eliminating a large class of non-crucial communication, PLP leaves OLTP engines much better-poised to take advantage of upcoming hardware, whatever form it may take. 7.9 Conclusions Unlike conventional systems, which either embrace fully shared-everything or shared-nothing philosophies, physiological partitioning takes the best features of both to produce a hybrid system that operates nearly latch- and lock-free, while still retaining the convenience of a common underlying storage pool and log. We achieve this result with a new multirooted B+Tree structure and careful assignment of threads to data, adopting the thread-todata transaction execution principle. This design allows easy repartitioning and enables a lightweight, robust, and efficient dynamic load balancing mechanism. 188 CHAPTER 7. PHYSIOLOGICAL PARTITIONING 189 Chapter 8 Future Direction and Concluding Remarks 8.1 Hardware/data-oriented software co-design We already argued in Section 7.8.4 that systems built around data-oriented execution are very well-suited for emerging hardware. But, we can go beyond that with a hardware/dataoriented software co-design. 8.1.1 Hardware enhancements The hardware enhancements to data-oriented software can be as simple as hardware mechanisms to efficiently pass messages from one thread to the other (e.g. [RSV87]) to implementations of entire sub-components. In addition, data-oriented software designs can benefit from various hardware optimizations which do not seem to be very beneficial to mainstream software, and thus have not gained popularity until now. For example, since the memory accesses of the threads in a data-oriented system can be clearly separated, optimistic hardware technologies could potentially work very well of it. Two of them are hardware transactional memory [Her91] and speculative lock elision [RG01]. Both technologies rely on that only few conflicts actually happen between concurrent threads, in a way similar to optimistic concurrency control in transaction processing [KR81]. Unfortunately, the harsh realization was that commercial software, such as database workloads, was not exhibiting such behavior. As a result, those technologies never gained popularity, and, for example, in 2009 Sun canceled its $1B Rock processor project, which was featuring hardware transactional memory [CCE+ 09]. 1 8.1.2 Co-design for energy-efficiency Let’s consider the problem of energy-efficient database processing, since one of the biggest challenges for the years to come will be the implementation of energy-efficient and energyproportional systems [BH07, Ham08]. A recent study showed that if we can modify only the database system configuration, then the most energy-efficient configuration is nearly always 1 http://bits.blogs.nytimes.com/2009/06/15/sun-is-said-to-cancel-big-chip-project/ 190 CHAPTER 8. FUTURE DIRECTION AND CONCLUDING REMARKS the one that has the highest performance [THS10]. On the other hand, recent work in the computer architecture community showed that major improvements in energy-efficiency are achieved with custom hardware designs and appropriate modifications of the application [HQW+ 10]. This particular work implements a 720p HD H.264 encoder which is orders of magnitude more energy-efficient. Achieving something similar for an application as complex as a transaction processing system will be an extremely more difficult and challenging task; but also the dividends will be much greater. A hardware/data-oriented transaction processing co-design is very appealing. For example, data-oriented software gives the opportunity to drastically reduce the complexity of the transaction processing codepaths (see Section 7.4.5) making large parts of them implementable in hardware. 8.2 Summary and conclusion The overall goal of this dissertation was to improve the scalability of transaction processing. First, we provided evidence that conventional transaction processing designs inevitably will face significant scalability problems, due to their complexity and the unpredictability of access patterns, result of the way they assign work to the concurrent worker threads. Then, we showed that not all the points of serialization (also known as critical sections) are threats to the scalability of software systems, even though their sheer number imposes significant overhead to single-thread performance. Then, based on the categorization we attacked the biggest lurking scalability problems in a conventional design, providing solutions based on caching data across transactions and downgrading specific critical sections. But, no matter the optimizations we did, the codepath of the conventional design was still full of critical sections. To alleviate the problems of conventional execution, we then made the case for dataoriented transaction execution. The data-oriented execution is based on a thread-to-data work assignment policy that results into coordinated accesses. The coordination on the accesses allows all sorts of optimizations, breaking the inherent limitations of conventional transaction processing. To prove that, we presented two designs each of them removing a significant source of un-scalable critical sections: the ones inside the centralized lock manager and the page latching. Finally, we showed how difficult it is to scale the performance of transaction processing on non-uniform hardware, such as multisocket multicores. Only software systems that dis- 8.2. SUMMARY AND CONCLUSION 191 tribute the accesses, such as data-oriented systems, can fully exploit such non-uniform or heterogeneous hardware; certainly conventional transaction processing cannot. We project that as hardware parallelism and heterogeneity continue to increase, the gap between conventional and data-oriented transaction execution will only continue to widen. 192 CHAPTER 8. FUTURE DIRECTION AND CONCLUDING REMARKS 193 Bibliography [A+ 85] Anon. et al. A measure of transaction processing power. Datamation, 31(7), 1985. 2.4.1, 6.4.1 [ACL87] Rakesh Agrawal, Michael J. Carey, and Miron Livny. Concurrency control performance modeling: alternatives and implications. ACM TODS, 12, 1987. 3.2, 6.6, 7.8.2 [ADH01] Anastasia Ailamaki, David J. DeWitt, and Mark D. Hill. Walking four machines by the shore. In CAECW, 2001. 5.2 [ADHW99] Anastasia Ailamaki, David J. DeWitt, Mark D. Hill, and David A. Wood. DBMSs on a modern processor: Where does time go? In VLDB, 1999. 1.2 [Adl05] Stephen Adler. The Slashdot effect: An analysis off three internet publications, 2005. Available at: http://hup.hu/old/stuff/slashdotted/SlashDotEffect.html. 7.5 [AFR09] Mohammad Alomari, Alan Fekete, and Uwe Röhm. A robust technique to ensure serializable executions with snapshot isolation dbms. In ICDE, 2009. 2.2.6, 5.5.1 [Amd67] Gene M. Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. In AFIPS, 1967. 1.2, 3.3.3 [And90] Thomas E. Anderson. The performance of spin lock alternatives for sharedmemory multiprocessors. IEEE Trans. Parallel Distrib. Syst., 1(1), 1990. 4.3.1 [AON96] Kiran J. Achyutuni, Edward Omiecinski, and Shamkant B. Navathe. Two techniques for on-line index modification in shared nothing parallel databases. In SIGMOD, 1996. 7.8.3 [AVDBF+ 92] Peter Apers, Care Van Den Berg, Jan Flokstra, Paul Grefen, Martin Kersten, and Annita Wilschut. PRISMA/DB: A parallel main memory relational DBMS. IEEE TKDE, 4, 1992. 3.1 [BAC+ 90] Haran Boral, William Alexander, Larry Clay, George P. Copeland, Scott Danforth, Michael J. Franklin, Brian E. Hart, Marc G. Smith, and Patrick Val- 194 BIBLIOGRAPHY duriez. Prototyping Bubba, a highly parallel database system. IEEE Transactions on Knowledge and Data Engineering, 2, 1990. 3.1 [BBD+ 09] Andrew Baumann, Paul Barham, Pierre-Evariste Dagand, Tim Harris, Rebecca Isaacs, Simon Peter, Timothy Roscoe, Adrian Schüpbach, and Akhilesh Singhania. The multikernel: a new OS architecture for scalable multicore systems. In SOSP, 2009. 7.8.4 [BDGR97] Edouard Bugnion, Scott Devine, Kinshuk Govil, and Mendel Rosenblum. DISCO: running commodity operating systems on scalable multiprocessors. ACM TOCS, 15(4), 1997. 6.6 [Bea11] Peter Beaumont. The truth about Twitter, Facebook and the uprisings in the Arab world. The Guardian, 2011. Available at http://www.guardian.co.uk/world/2011/feb/25/twitter-facebook-uprisingsarab-libya. 1.1 [BFGK05] Michael A. Bender, Jeremy T. Fineman, Seth Gilbert, and Bradley C. Kuszmaul. Concurrent cache-oblivious B-trees. In SPAA, 2005. 7.8.2 [BG83] Philip A. Bernstein and Nathan Goodman. Multiversion concurrency control—theory and algorithms. ACM TODS, 8(4), 1983. 2.2.6, 6.6, 7.8.2 [BGB98] Luiz André Barroso, Kourosh Gharachorloo, and Edouard Bugnion. Memory system characterization of commercial workloads. In ISCA, 1998. 1.2 [BGM+ 00] Luiz André Barroso, Kourosh Gharachorloo, Robert McNamara, Andreas Nowatzyk, Shaz Qadeer, Barton Sano, Scott Smith, Robert Stets, and Ben Verghese. Piranha: a scalable architecture based on single-chip multiprocessing. In ISCA, 2000. 1.2 [BGMP79] Mike Blasgen, Jim Gray, Mike Mitoma, and Tom Price. The convoy phenomenon. SIGOPS Oper. Syst. Rev., 13(2), 1979. 4.3.1, 6.4.7 [BH07] Luiz André Barroso and Urs Hölzle. The case for energy-proportional computing. Computer, 40, 2007. 8.1.2 [BHG87] Philip A. Bernstein, Vassos Hadzilacos, and Nathan Goodman. Concurrency control and recovery in database systems. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1987. 6.6, 7.8.2 [BJB09] Luc Bouganim, Bjön Jónsson, and Philippe Bonnet. uFLIP: Understanding flash IO patterns. In CIDR, 2009. 2.3.1 BIBLIOGRAPHY 195 [BJK+ 97] William Bridge, Ashok Joshi, M. Keihl, Tirthankar Lahiri, Juan Loaiza, and N. MacNaughton. The Oracle universal server buffer. In VLDB, 1997. 2.2.6, 5.5.1 [BM70] Rudolf Bayer and Edward M. McCreight. Organization and maintenance of large ordered indices. In SIGFIDET, 1970. 1.7, 2.2.3, 7.4.1 [Bre00] Eric A. Brewer. Towards robust distributed systems (abstract). In PODC, 2000. 7.3 [BW04] Bradford M. Beckmann and David A. Wood. Managing wire delay in large chip-multiprocessor caches. In IEEE MICRO, 2004. 6.3.3, 7.8.4 [BWCM+ 10] Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich. An analysis of Linux scalability to many cores. In OSDI, 2010. 1.2 [BZN05] Peter Boncz, Marcin Zukowski, and Niels Nes. Monetdb/X100: Hyperpipelining query execution. In VLDB, 2005. 1.4, 6.5 [CAA+ 10] Shimin Chen, Anastasia Ailamaki, Manos Athanassoulis, Phillip B. Gibbons, Ryan Johnson, Ippokratis Pandis, and Radu Stoica. TPC-E vs. TPC-C: Characterizing the new TPC-E benchmark via an I/O comparison study. SIGMOD Record, 39, 2010. 9, 2.4.4 [CASM05] Christopher B. Colohan, Anastasia Ailamaki, J. Gregory Steffan, and Todd C. Mowry. Optimistic intra-transaction parallelism on chip multiprocessors. In VLDB, 2005. 6.6 [CCE+ 09] Shailender Chaudhry, Robert Cypher, Magnus Ekman, Martin Karlsson, Anders Landin, Sherman Yip, Hakan Zeffer, and Marc Tremblay. Rock: A highperformance sparc cmt processor. IEEE Micro, 29(2), 2009. 8.1.1 [CDF+ 94] Michael J. Carey, David J. DeWitt, Michael J. Franklin, Nancy E. Hall, Mark L. McAuliffe, Jeffrey F. Naughton, Daniel T. Schuh, Marvin H. Solomon, C. K. Tan, Odysseas G. Tsatalos, Seth J. White, and Michael J. Zwilling. Shoring up persistent applications. In SIGMOD, 1994. 3.1, 3.3.1, 4.4.3, 1, 5.2, 6.3.4, 6.6 [CDG+ 06] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. Bigtable: A distributed storage system for structured data. In OSDI, BIBLIOGRAPHY 196 2006. 1.4, 2.4 [CGMV02] Shimin Chen, Phillip B. Gibbons, Todd C. Mowry, and Gary Valentin. Fractal prefetching B+-Trees: optimizing both cache and disk performance. In SIGMOD, 2002. 7.4.5, 7.8.2 [Che09] Shimin Chen. FlashLogging: exploiting flash devices for synchronous logging performance. In SIGMOD, 2009. 2.3.1, 5.4, 7.5.5, 7.8.1 [CJZM10] Carlo Curino, Evan Jones, Yang Zhang, and Sam Madden. Schism: a workload-driven approach to database replication and partitioning. PVLDB, 3, 2010. 1.7, 6.5, 7.1, 7.3, 7.5.1, 7.8.1, 7.8.3 [Cra93] Travis S. Craig. Building FIFO and priority-queueing spin locks from atomic swap. Technical Report TR 93-02-02, University of Washington, Department of Computer Science, 1993. 4.3.1 [DAAEA09] Sudipto Das, Shyam Antony, Divyakant Agrawal, and Amr El Abbadi. Thread cooperation in multicore architectures for frequency counting over multiple data streams. PVLDB, 2, 2009. 6.6, 7.8.1 [DG92] David J. DeWitt and Jim Gray. Parallel database systems: the future of high performance database systems. Commun. ACM, 35, 1992. 3.1 [DGS+ 90] David J. Dewitt, Shahram Ghandeharizadeh, Donovan A. Schneider, Allan Bricker, Hui-i Hsiao, and Rick Rasmussen. The Gamma database machine project. IEEE Transactions on Knowledge and Data Engineering - TKDE, 2 (1):44–62, 1990. 3.1, 6.1, 7.1, 7.3, 7.5.1, 7.8.1 [DHJ+ 07] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. Dynamo: Amazon’s highly available key-value store. SIGOPS Oper. Syst. Rev., 41(6), 2007. 1.4, 2.4, 6.1, 6.6 [DIR00] Donko Donjerkovic, Yannis E. Ioannidis, and Raghu Ramakrishnan. Dynamic histograms: Capturing evolving data sets. In ICDE, page 86, 2000. 7.8.3 [DKO+ 84] David J. DeWitt, Randy H. Katz, Frank Olken, Leonard D. Shapiro, Michael R. Stonebraker, and David A. Wood. Implementation techniques for main memory database systems. In SIGMOD, 1984. 2.3.1, 5.5.2 [DLMN09] Dave Dice, Yossi Lev, Mark Moir, and Daniel Nussbaum. Early experience with a commercial hardware transactional memory implementation. In ASP- BIBLIOGRAPHY 197 LOS, 2009. 2 [DLO05] John D. Davis, James Laudon, and Kunle Olukotun. Maximizing CMP throughput with mediocre cores. In PACT, 2005. 1.2, 3.3.1 [FNPS79] Ronald Fagin, Jurg Nievergelt, Nicholas Pippenger, and H. Raymond Strong. Extendible hashing–a fast access method for dynamic files. ACM TODS, 4, 1979. 2.2.3 [FR04] Mikhail Fomitchev and Eric Ruppert. Lock-free linked lists and skip lists. In PODC, 2004. 4.3.2 [GHOS96] Jim Gray, Pat Helland, Patrick O’Neil, and Dennis Shasha. The dangers of replication and a solution. In SIGMOD, 1996. 5.5.2 [GL92] Vibby Gottemukkala and Tobin J. Lehman. Locking and latching in a memoryresident database system. In VLDB, 1992. 3.2 [GMP02] Phillip B. Gibbons, Yossi Matias, and Viswanath Poosala. Fast incremental maintenance of approximate histograms. ACM TODS, 27, 2002. 7.8.3 [GMS87] Hector Garcia-Molina and Kenneth Salem. Sagas. SIGMOD Rec., 16(3), 1987. 6.6 [GR92] Jim Gray and Andreas Reuter. Transaction Processing: Concepts and Techniques. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1992. 1.1, 1.4, 2.1, 1, 5.2, 5.3, 6.1.2, 6.3.1, 7.3 [Gra90] Goetz Graefe. Encapsulation of parallelism in the Volcano query processing system. In SIGMOD, 1990. 4.2.2 [Gra03] Goetz Graefe. Sorting and indexing with partitioned B-trees. In CIDR, 2003. 7.8.2 [Gra07a] Goetz Graefe. Hierarchical locking in B-tree indexes. In BTW, 2007. 6.3.1, 6.6 [Gra07b] Jim Gray. Tape is dead, disk is tape, flash is disk, RAM locality is king. In CIDR, 2007. 3.1 [GSHS09] Colleen Graham, Bhavish Sood, Hideaki Horiuchi, and Dan Sommer. Market share: Database management system software, worldwide, 2009. See http://www.gartner.com/DisplayDocument?id=1044912. 1.1 [H+ 10] Jason Howard et al. A 48-Core IA-32 message-passing processor with DVFS in 45nm CMOS. In IEEE ISSCC, 2010. 7.8.4 198 BIBLIOGRAPHY [Ham08] James R. Hamilton. Where does the power go and what to do about it? In HotPower, 2008. 8.1.2 [HAMS08] Stavros Harizopoulos, Daniel J. Abadi, Sam Madden, and Michael Stonebraker. OLTP through the looking glass, and what we found there. In SIGMOD, 2008. 1.3, 5.3, 5.3.2, 5.5.2, 6.6, 7.1, 7.8.1 [Hel07] Pat Helland. Life beyond distributed transactions: an apostate’s opinion. In CIDR, 2007. 1.7, 5.3, 6.6, 7.1, 7.3 [Her91] Maurice Herlihy. Wait-free synchronization. ACM Trans. Program. Lang. Syst., 13(1), 1991. 4.3.2, 8.1.1 [HFFA09] Nikos Hardavellas, Michael Ferdman, Babak Falsafi, and Anastasia Ailamaki. Reactive NUCA: near-optimal block placement and replication in distributed caches. In ISCA, 2009. 6.3.3, 7.8.4 [HM93] Maurice Herlihy and J. Eliot B. Moss. Transactional memory: architectural support for lock-free data structures. SIGARCH Comput. Archit. News, 21 (2), 1993. 3.2, 4.3.2 [HM08] Mark D. Hill and Michael R. Marty. Amdahl’s law in the multicore era. Computer, 41, 2008. 4.2 [HP02] John L. Hennessy and David A. Patterson. Computer architecture: a quantitative approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2002. 1.2 [HPJ+ 07] Nikos Hardavellas, Ippokratis Pandis, Ryan Johnson, Naju Mancheril, Anastasia Ailamaki, and Babak Falsafi. Database servers on chip multiprocessors: Limitations and opportunities. In CIDR, 2007. 3.1, 7.8.4 [HQW+ 10] Rehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov, Benjamin C. Lee, Stephen Richardson, Christos Kozyrakis, and Mark Horowitz. Understanding sources of inefficiency in general-purpose chips. In ISCA, 2010. 8.1.2 [HSA05] Stavros Harizopoulos, Vladislav Shkapenyuk, and Anastasia Ailamaki. QPipe: a simultaneously pipelined relational query engine. In SIGMOD, 2005. 6.6 [HSH07] Joseph M. Hellerstein, Michael Stonebraker, and James Hamilton. Architecture of a database system. Foundations and Trends (R) in Databases, 1(2), 2007. 1 BIBLIOGRAPHY 199 [HSIS05] Bijun He, William N. Scherer III, and Michael L. Scott. Preemption adaptivity in time-published queue-based spin locks. In HiPC, 2005. 4.3.1, 6.2 [HSL+ 89] Pat Helland, Harald Sammer, Jim Lyon, Richard Carr, Phil Garrett, and Andreas Reuter. Group commit timers and high volume transaction systems. In HPTS, 1989. 2.3.1 [HSY04] Danny Hendler, Nir Shavit, and Lena Yerushalmi. A scalable lock-free stack algorithm. In SPAA, 2004. 5.4.1 [IBM11] IBM. IBM DB2 9.5 information center Linux, UNIX, and Windows, 2011. Available http://publib.boulder.ibm.com/infocenter/db2luw/v9r5/index.jsp. 5.5.1 for at [Int12] Intel. Intel solid-state drive 520 series: Product specification, 2012. Available at http://www.intel.com/content/www/us/en/solid-statedrives/ssd-520-specification.html. 2.3.4, 7 [JAM10] Evan Jones, Daniel J. Abadi, and Samuel Madden. Low overhead concurrency control for partitioned main memory databases. In SIGMOD, 2010. 1.7, 6.6, 7.4.1, 7.8.1 [JASA09] Ryan Johnson, Manos Athanassoulis, Radu Stoica, and Anastasia Ailamaki. A new look at the roles of spinning and blocking. In DaMoN, 2009. 6.2 [JFRS07] Sudhir Jorwekar, Alan Fekete, Krithi Ramamritham, and S. Sudarshan. Automating the detection of snapshot isolation anomalies. In VLDB, 2007. 2.2.6, 5.3, 5.5.1 [JN07] Tim Johnson and Umesh Nawathe. An 8-core, 64-thread, 64-bit power efficient SPARC SoC (Niagara2). In ISPD, 2007. 1.3, 5.3.2, 6.1.1, 6.4.1 [Jos91] Ashok M. Joshi. Adaptive locking strategies in a multi-node data sharing environment. In VLDB, 1991. 5.5.1, 6.6 [JPA08] Ryan Johnson, Ippokratis Pandis, and Anastasia Ailamaki. Critical sections: Re-emerging scalability concerns for database storage engines. In DaMoN, 2008. 1.3, 3.2, 1 [JPA09] Ryan Johnson, Ippokratis Pandis, and Anastasia Ailamaki. Improving OLTP scalability using speculative lock inheritance. PVLDB, 2(1), 2009. 1, 6.6, 7.2, 7.7.1, 7.8.1 200 BIBLIOGRAPHY [JPH+ 09] Ryan Johnson, Ippokratis Pandis, Nikos Hardavellas, Anastasia Ailamaki, and Babak Falsafi. Shore-MT: a scalable storage manager for the multicore era. In EDBT, 2009. 1.3, 1.9, 3, 2.3.4, 1, 4.2, 4.4.3, 1, 5.1, 6.1, 6.1.1, 6.2, 6.3.4, 7.1, 7.2, 7.7.1, 7.8.1, 7.8.4 [JPS+ 10] Ryan Johnson, Ippokratis Pandis, Radu Stoica, Manos Athanassoulis, and Anastasia Ailamaki. Aether: a scalable approach to logging. PVLDB, 3, 2010. 1.4, 2.3.1, 4.2, 1, 5.4, 10, 5.4.2, 5.17, 7.2, 7.3, 7.7.1, 7.8.1 [JPS+ 11] Ryan Johnson, Ippokratis Pandis, Radu Stoica, Manos Athanassoulis, and Anastasia Ailamaki. Scalability of write-ahead logging on multicore and multisocket hardware. The VLDB Journal, 20, 2011. 2.3.1, 1, 10, 11, 7.3 [JSAM10] Ryan Johnson, Radu Stoica, Anastasia Ailamaki, and Todd C. Mowry. Decoupling contention management from scheduling. SIGPLAN Not., 45(3), 2010. 4.3.1 [JSSS06] Ibrahim Jaluta, Seppo Sippu, and Eljas Soisalon-Soininen. B-tree concurrency control and recovery in page-server database systems. ACM TODS, 31:82–132, 2006. 7.4.5, 7.8.2 [KAO05] Poonacha Kongetira, Kathirgamar Aingaran, and Kunle Olukotun. Niagara: A 32-way multithreaded Sparc processor. IEEE MICRO, 25(2), 2005. 3.1, 3.3.1, 4.3.3 [KCK+ 00] Jong Min Kim, Jongmoo Choi, Jesung Kim, Sam H. Noh, Sang Lyul Min, Yookun Cho, and Chong Sang Kim. A low-overhead high-performance unified buffer management scheme that exploits sequential and looping references. In OSDI, 2000. 2.2.5 [Kel11] Kate Kelly. How twitter is transforming trading in commodities, 2011. Available at http://www.cnbc.com/id/41948275. 1.1 [KN11] Alfons Kemper and Thomas Neumann. HyPer – a hybrid OLTP&OLAP main memory database system based on virtual memory snapshots. In ICDE, 2011. 7.1, 7.1.1 [KR81] H. T. Kung and John T. Robinson. On optimistic methods for concurrency control. ACM TODS, 6, 1981. 6.6, 7.8.2, 8.1.1 [KSSF10] Ron Kalla, Balaram Sinharoy, William J. Starke, and Michael Floyd. Power7: IBM’s next-generation server processor. IEEE MICRO, 30(2), 2010. 1.3 BIBLIOGRAPHY 201 [KSUH93] Orran Krieger, Michael Stumm, Ron Unrau, and Jonathan Hanna. A fair fast scalable reader-writer lock. In ICPP, 1993. 4.3.1, 4.3.3 [LARS92] Dave Lomet, Rick Anderson, T. K. Rengarajan, and Peter Spiro. How the Rdb/VMS data sharing system became fast. Technical Report CRL-92-4, DEC, 1992. 7.3 [LBD+ 12] Per-Ake Larson, Spyros Blanas, Cristian Diaconu, Craig Freedman, Jignesh M. Patel, and Mike Zwilling. High-performance concurrency control mechanisms for main-memory databases. In VLDB, 2012. 6.1, 6.6, 7.8.2 [Lit80] Witold Litwin. Linear hashing: A new tool for file and table addressing. In VLDB, 1980. 2.2.3 [LKO+ 00] Mong-Li Lee, Masaru Kitsuregawa, Beng Chin Ooi, Kian-Lee Tan, and Anirban Mondal. Towards self-tuning data placement in parallel database systems. In SIGMOD, 2000. 7.8.3 [LMP+ 08] Sang-Won Lee, Bongki Moon, Chanik Park, Jae-Myung Kim, and Sang-Woo Kim. A case for flash memory SSD in enterprise database applications. In SIGMOD, 2008. 2.3.1, 5.4, 8 [Lof96] Geoffrey R. Loftus. Psychology will be a much better science when we change the way we analyze data. Current Directions in Psychological Science, 5(6), 1996. 1.1 [LSC+ 01] Tirthankar Lahiri, Vinay Srihari, Wilson Chan, N. MacNaughton, and Sashikanth Chandrasekaran. Cache fusion: Extending shared-disk clusters with shared caches. In VLDB, 2001. 6.6, 7.8.1 [LSD+ 07] Sam Lightstone, Maheswaran Surendra, Yixin Diao, Sujay S. Parekh, Joseph L. Hellerstein, Kevin Rose, Adam J. Storm, and Christian GarciaArellano. Control theory: a foundational technique for self managing databases. In ICDE Workshops, 2007. 7.6.3 [LW92] Monica S. Lam and Robert P. Wilson. Limits of control flow on parallelism. SIGARCH Comput. Archit. News, 20(2), 1992. 1.2 [Mal11] Eric Malinowski. Hoops 2.0: Inside the NBA’s data-driven revolution. Wired, 4, 2011. Available at http://www.wired.com/playbook/2011/04/nba-datarevolution/. 1.1 202 BIBLIOGRAPHY [MCS91a] John M. Mellor-Crummey and Michael L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans. Comput. Syst., 9(1), 1991. 4.3.1 [MCS91b] John M. Mellor-Crummey and Michael L. Scott. Scalable reader-writer synchronization for shared-memory multiprocessors. SIGPLAN Not., 26(7), 1991. 4.3.1 [MDO94] Ann Marie Grizzaffi Maynard, Colette M. Donnelly, and Bret R. Olszewski. Contrasting characteristics and cache performance of technical and multi-user commercial workloads. SIGPLAN Not., 29(11), 1994. 1.2 [MHL+ 92] C. Mohan, Don Haderle, Bruce Lindsay, Hamid Pirahesh, and Peter Schwarz. ARIES: A transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging. ACM TODS, 17(1), 1992. 2.2.2, 2.3.1, 5, 5.2, 5.4, 5.4.1, 5.5.2, 6.3.4 [Mic02] Maged M. Michael. High performance dynamic lock-free hash tables and listbased sets. In SPAA, 2002. 4.3.2 [MKOT01] Anirban Mondal, Masaru Kitsuregawa, Beng Chin Ooi, and Kian-Lee Tan. Rtree-based data migration and self-tuning strategies in shared-nothing spatial databases. In GIS, 2001. 7.8.3 [ML92] C. Mohan and Frank Levine. ARIES/IM: an efficient and high concurrency index management method using write-ahead logging. In SIGMOD, 1992. 5.5.2, 7.4.5, 7.8.2 [MLH94] Peter S. Magnusson, Anders Landin, and Erik Hagersten. Queue locks on cache coherent multiprocessors. In ISPP, 1994. 4.3.1 [MM03] Nimrod Megiddo and Dharmendra S. Modha. ARC: A self-tuning, low overhead replacement cache. In FAST, 2003. 2.2.5 [MNSS05] Mark Moir, Daniel Nussbaum, Ori Shalev, and Nir Shavit. Using elimination to implement scalable and lock-free FIFO queues. In SPAA, 2005. 1.5, 4.2.2, 5.4.1 [Moh90] C. Mohan. ARIES/KVL: a key-value locking method for concurrency control of multiaction transactions operating on B-tree indexes. In VLDB, 1990. 5.2, 5.5.2, 7.4.5 BIBLIOGRAPHY 203 [Moo65] Gordon Moore. Cramming more components onto integrated circuits. Electronics, 38(6), 1965. 1.2 [MOPW00] Peter Muth, Patrick O’Neil, Achim Pick, and Gerhard Weikum. The LHAM log-structured history data access method. The VLDB Journal, 8, 2000. 7.8.2 [NWMR09] Simo Neuvonen, Antoni Wolski, Markku Manner, and Vilho Raatikka. Telecom application transaction processing benchmark (TATP), 2009. See http://tatpbenchmark.sourceforge.net/. 2.4.5, 6.1.1, 6.1.1, 6.4.1, 7.5.1, 7.7.4 [ONH+ 96] Kunle Olukotun, Basem A. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang. The case for a single-chip multiprocessor. In ASPLOS-VII, 1996. 1.2 [Ora05] Oracle. Asynchronous commit: Oracle database advanced application developer’s guide, 2005. Available at http://download.oracle.com/docs/cd/B19306 01/appdev.102/b14251/ adfns sqlproc.htm. 2.3.1 [PJHA10] Ippokratis Pandis, Ryan Johnson, Nikos Hardavellas, and Anastasia Ailamaki. Data-oriented transaction execution. PVLDB, 3(1), 2010. 1.3, 2.3.4, 1, 7.1, 7.2, 7.7.1, 7.8.1 [PJZ11] Andrew Pavlo, Evan P. C. Jones, and Stanley Zdonik. On predictive modeling for optimizing transaction execution in parallel oltp systems. PVLDB, 5(2), 2011. 1.7, 6.5, 7.8.1, 7.8.3 [Pos10] PostgreSQL. PostgreSQL 9.0.3 documentation: Asynchronous commit, 2010. Available at http://www.postgresql.org/docs/9.0/static/wal-asynccommit.html. 2.3.1 [Pos11] PostgreSQL. PostgreSQL archives: literature on write-ahead logging, 2011. Available at http://archives.postgresql.org/pgsql-hackers/201106/msg00701.php. 12 [PR01] Rasmus Pagh and Flemming Friche Rodler. Cuckoo hashing. In ESA, 2001. 4.4.3 [PTB+ 11] Ippokratis Pandis, Pinar Tözün, Miguel Branco, Dimitris Karampinas, Danica Porobic, Ryan Johnson, and Anastasia Ailamaki. A data-oriented transaction execution engine and supporting tools. In SIGMOD, 2011. 6, 6.3.2, 6.5, 6.5 204 BIBLIOGRAPHY [PTJA11] Ippokratis Pandis, Pinar Tözün, Ryan Johnson, and Anastasia Ailamaki. PLP: page latch-free shared-everything OLTP. PVLDB, 4(10), 2011. 1.3, 2.3.4, 1, 6.3.2, 1 [Raw10] Mazen Rawashdeh. eBay - how one fast growing company is solving its infrastructure and data center challenges, 2010. Keynote at Gartner Data Center Conference. 1.1 [RD89] Abbas Rafii and Donald DuBois. Performance tradeoffs of group commit logging. In CMG Conference, 1989. 2.3.1, 2.3.1 [RD01] Antony Rowstron and Peter Druschel. Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems. In Middleware, pages 329–350, 2001. 4.2.2 [RG01] Ravi Rajwar and James R. Goodman. Speculative lock elision: enabling highly concurrent multithreaded execution. In IEEE MICRO, 2001. 4.3.4, 8.1.1 [RG02] Ravi Rajwar and James R. Goodman. Transactional lock-free execution of lock-based programs. In ASPLOS-X, 2002. 4.3.2 [RG03] Raghu Ramakrishnan and Johannes Gehrke. Database Management Systems. McGraw-Hill, Inc., New York, NY, USA, 2003. 2.1, 1 [RGAB98] Parthasarathy Ranganathan, Kourosh Gharachorloo, Sarita V. Adve, and Luiz André Barroso. Performance of database workloads on shared-memory systems with out-of-order processors. In ASPLOS-VIII, 1998. 1.2 [RK79] David P. Reed and Rajendra K. Kanodia. Synchronization with eventcounts and sequencers. Commun. ACM, 22(2), 1979. 4.3.1 [Rob85] John T. Robinson. A fast general-purpose hardware synchronization mechanism. In SIGMOD, 1985. 3.2 [RR99] Jun Rao and Kenneth A. Ross. Cache conscious indexing for decision-support in main memory. In VLDB, 1999. 7.4.5, 7.8.2 [RR00] Jun Rao and Kenneth A. Ross. Making B+-trees cache conscious in main memory. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data, 2000. 7.4.5, 7.8.2 [RS84] Larry Rudolph and Zary Segall. Dynamic decentralized cache schemes for mimd parallel processors. In ISCA, 1984. 3.3.2, 4.3.1 BIBLIOGRAPHY 205 [RSV87] Umakishore Ramachandran, Marvin Solomon, and Mary Vernon. Hardware support for interprocess communication. In ISCA, 1987. 8.1.1 [RZML02] Jun Rao, Chun Zhang, Nimrod Megiddo, and Guy Lohman. Automating physical database design in a parallel database. In SIGMOD, 2002. 7.5.1, 7.8.3 [SAB+ 05] Mike Stonebraker, Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Samuel Madden, Elizabeth O’Neil, Pat O’Neil, Alex Rasin, Nga Tran, and Stan Zdonik. C-store: a column-oriented DBMS. In VLDB, 2005. 1.4, 6.5 [SCK+ 11] Jason Sewall, Jatin Chhugani, Changkyu Kim, Nadathur Satish, and Pradeep Dubey. PALM: Parallel architecture-friendly latch-free modifications to b+trees on many-core processors. PVLDB, 4(11), 2011. 7.8.2 [SKPO88] Michael Stonebraker, Randy H. Katz, David A. Patterson, and John K. Ousterhout. The design of XPRS. In VLDB, 1988. 3.1 [SLSV95] Dennis Shasha, Francois Llirbat, Eric Simon, and Patrick Valduriez. Transaction chopping: algorithms and performance studies. ACM TODS, 20, 1995. 6.6 [SMA+ 07] Michael Stonebraker, Samuel Madden, Daniel J. Abadi, Stavros Harizopoulos, Nabil Hachem, and Pat Helland. The end of an architectural era: (it’s time for a complete rewrite). In VLDB, 2007. 1.4, 1.7, 3.1, 5.4.2, 5.5.2, 6.1, 6.6, 7.1, 7.1.1, 7.3, 7.5.3, 7.8.1 [Smi78] Alan Jay Smith. Sequentiality and prefetching in database systems. ACM TODS, 3, 1978. 2.2.5, 5.2 [SMK+ 01] Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, and Hari Balakrishnan. Chord: A scalable peer-to-peer lookup service for internet applications. In SIGCOMM, pages 149–160, 2001. 4.2.2 [SR86] Michael Stonebraker and Lawrence A. Rowe. The design of POSTGRES. SIGMOD Rec., 15(2), 1986. 3.1, 3.3.1 [SSY95] Eljas Soisalon-Soininen and Tatu Ylönen. Partial strictness in two-phase locking. In ICDT, 1995. 2.3.1 [ST95] Nir Shavit and Dan Touitou. Software transactional memory. In PODC, 1995. 4.3.2 BIBLIOGRAPHY 206 [ST97] Nir Shavit and Dan Touitou. Elimination trees and the construction of pools and stacks. Theory of Computing Systems, Special Issue, 30, 1997. 5.4.1, 5.4.1 [STH+ 10] Jinuk Luke Shin, Kenway Tam, Dawei Huang, Bruce Petrick, Ha Pham, Changku Hwang, Hongping Li, Alan Smith, Timothy Johnson, Francis Schumacher, David Greenhill, Ana Sonia Leon, and Allan Strong. A 40nm 16-core 128-thread CMT SPARC SoC processor. In IEEE ISSCC, 2010. 1.3 [Sto86] Michael Stonebraker. The case for shared nothing. IEEE Database Eng. Bull., 9, 1986. 1.7, 7.1, 7.3, 7.8.1 [SWAF09] Stephen Somogyi, Thomas F. Wenisch, Anastasia Ailamaki, and Babak Falsafi. Spatio-temporal memory streaming. In ISCA, 2009. 6.3.3, 7.8.4 [SWH+ 04] Stephen Somogyi, Thomas F. Wenisch, Nikolaos Hardavellas, Jangwoo Kim, Anastasia Ailamaki, and Babak Falsafi. Memory coherence activity prediction in commercial workloads. In WMPI, 2004. 1.3, 6.3.3, 7.1, 7.8.4 [TA10] Alexander Thomson and Daniel J. Abadi. database systems. PVLDB, 3, 2010. 5.5.2 The case for determinism in [Tho98] Alexander Thomasian. Concurrency control: methods, performance, and analysis. ACM Comput. Surv., 30, 1998. 6.6, 7.8.2 [THS10] Dimitris Tsirogiannis, Stavros Harizopoulos, and Mehul A. Shah. Analyzing the energy efficiency of a database server. In SIGMOD, 2010. 8.1.2 [TPC94] TPC. TPC benchmark B standard specification, revision 2.0, 1994. Available at http://www.tpc.org/tpcb. 2.4.2, 6.2, 6.4.1, 7.7.3 [TPC06] TPC. TPC benchmark H (decision support) standard specification, revision 2.6.0, 2006. Available at http://www.tpc.org/tpch. 6.4.1 [TPC07] TPC. TPC benchmark C (OLTP) standard specification, revision 5.9, 2007. Available at http://www.tpc.org/tpcc. 2.4.3, 6.1.1, 6.1.1, 6.3.1, 6.4.1, 6.5, 7.7.4, 7.7.6 [TPC10] TPC. TPC benchmark E standard specification, revision 1.12.0, 2010. Available at http://www.tpc.org/tpce. 2.4.4, 6.5 [TPJA11] Pinar Tözün, Ippokratis Pandis, Ryan Johnson, and Anastasia Ailamaki. Scalable and dynamically balanced shared-everything OLTP with physiological partitioning. Technical Report EPFL-REPORT-170525, EPFL, 2011. 6.3.2, 1 BIBLIOGRAPHY 207 [TSJ+ 10] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang, Suresh Anthony, Hao Liu, and Raghotham Murthy. Hive - a petabyte scale data warehouse using Hadoop. In ICDE, 2010. 1.1 [V+ 07] Sriram Vangal et al. An 80-tile 1.28TFLOPS network-on-chip in 65nm CMOS. In IEEE ISSCC, 2007. 7.8.4 [Vog09] Werner Vogels. Eventually consistent. Commun. ACM, 52, 2009. 1.4, 2.4, 5.3 [Vog12] Werner Vogels. Amazon DynamoDB - a fast and scalable NoSQL database service designed for internet scale applications, 2012. See http://www.allthingsdistributed.com/2012/01/amazon-dynamodb.html. 1.4 [Wei99] Mark Weiser. The computer for the 21st century. SIGMOBILE Mob. Comput. Commun. Rev., 3, 1999. 1.1 [WM11] Eugene Wu and Samuel Madden. Partitioning techniques for fine-grained indexing. In ICDE, 2011. 7.8.3 [WOT+ 95] Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. The SPLASH-2 programs: characterization and methodological considerations. In ISCA, 1995. 4.2.2 [ZL11] Paul Zubulake and Sang Lee. The High Frequency Game Changer: How Automated Trading Strategies Have Revolutionized the Markets. Wiley Trading. John Wiley & Sons, 2011. 1.1