Scalable Transaction Processing through Data-Oriented

Transcription

Scalable Transaction Processing through Data-oriented Execution
Submitted in partial fulfillment of the requirements for
the degree of
Doctor of Philosophy
in
Electrical and Computer Engineering
Ippokratis Pandis
Diploma, Computer Engineering & Informatics, University of Patras, Greece
M.Sc., Information Networking, Carnegie Mellon University
Carnegie Mellon University
Pittsburgh, PA
May 2011
ii
Keywords: Database management systems, transaction processing, multicore and multisocket hardware, scalability, contention, data-oriented execution, physiological partitioning.
iii
Dedicated to my family.
Andreas, Niki, Titina and Soula.
iv
v
Abstract
Data management technology changes the world we live in by providing efficient access
to huge volumes of constantly changing data and by enabling sophisticated analysis of those
data. Recently there has been an unprecedented increase in the demand for data management
services. In parallel, we have witnessed a tremendous shift in the underlying hardware
technology toward highly parallel multicore processors. In order to cope with the increased
demand and user expectations, data management systems need to fully exploit abundantly
available hardware parallelism.
Transaction processing is one of the most important and challenging database workloads
and this dissertation contributes to the quest for scalable transaction processing software.
Our research shows that in a highly parallel multicore landscape, rather than improving
single-thread performance, system designers should prioritize the reduction of critical sections where hardware parallelism increases contention unbounded. In addition, this thesis
describes solid improvements in conventional transaction processing technology. New transaction processing mechanisms show gains by avoiding the execution of unbounded critical
sections in the lock manager through caching, and in the log manager by downgrading the
critical sections to composable ones.
More importantly, this dissertation shows that conventional transaction processing has inherent scalability limitations due to the unpredictable access patterns caused by the requestoriented execution model it follows. Instead, it proposes adopting a data-oriented execution
model, and shows that transaction processing systems designed around data-oriented execution break the inherent limitations of conventional execution.
The data-oriented design paves the way for transaction processing systems to maintain
scalability as parallelism increases for the foreseeable future; as hardware parallelism increases, the benefits will only increase. In addition, the principles used to achieve scalability
can be generally applied to other software systems facing similar scalability challenges as the
shift to multicore hardware continues.
vi
vii
Acknowledgments
This dissertation wouldn’t have been completed without significant contribution of multiple people to whom I owe a lot. Below is a list of people that helped me during the past
six years. The list is long but most probably missing some people to whom I am indebted.
First and foremost, I would like to thank my academic advisor Natassa Ailamaki. I
cannot find words to describe how influential Natassa has been for me. She was an excellent,
extremely patient and inspirational advisor as well as a person I could trust and rely on.
Her energy and passion for the field will always be an example for me. Natassa, that coffee
at Starbucks changed my life. Thank you.
Before asking Goetz Graefe to join my committee as the external member, I knew him
only as a prominent member of the database systems research community with very important contributions in the field. Goetz’s interest in my thesis significantly improved the
overall work, especially this document. Greg Ganger and Christos Faloutsos not only were
valuable members of my committee, but also helped me in a rough moment of my PhD.
I remained at school and I owe much of that to Greg, Christos and the members of their
research groups (PDL and DB@CMU).
Babak Falsafi and Stavros Harizopoulos have been excellent collaborators and two people
I would seek for their advise. Jingren Zhou and Shimin Chen both honored me by selecting
me to spend a summer working with them and learning a lot at Microsoft Research and Intel
Research respectively.
Various collaborators contributed significantly to this work: Ryan Johnson, Nikos Hardavellas, Pınar Tözün, Naju Mancheril, Debabrata Dash, Manos Athanassoulis, Radu Stoica,
Miguel Branco, Danica Porobic, and Dimitris Karampinas. Ryan especially has been a key
support in most of my work, both directly through joint research and indirectly through
innumerable and round the clock discussions.
Members of labs at CMU and EPFL have provided valuable feedback and support on
various papers: Michael Abd El Malek, Kyriaki Levanti, Mike Fredman, Tom Wenisch, Brian
Gold, Ioannis Alagiannis and others. In addition, three members of the administrative stuff
of CMU helped a lot: Joan Digney, Karen Lindenfelser and Charlotte Yano. Joan helped to
improve the quality of this dissertation even though it was not her responsibility.
Some special people honored me with their friendship and support during the years
in Pittsburgh and Lausanne: Kyriaki Levanti, Michael Abd El Malek, Panickos Neofytou, Thodoris Strigkos, Leonidas Georgakopoulos, and Iris Safaka. As well as, numerous
friends from back home: Angela Paschali, Vassilis Papadatos, Haris Markakis, Dora Kaggeli,
Valentinos Georgiou to name just few.
viii
Most importantly, I would like to thank my family. They encouraged and supported me
during this long and stressful period. They mean the world to me.
Thesis Committee:
Anastasia Ailamaki (CMU & EPFL), Chair
Christos Faloutsos (CMU)
Gregory Ganger (CMU)
Goetz Graefe (HP Labs)
This research has been supported by grants and equipment from Intel and Sun; a Sloan research fellowship;
an IBM faculty partnership award; NSF grants CCR-0205544, CCR-0509356, IIS-0133686, and IIS-0713409;
an ESF EurYI award; and Swiss National Foundation funds.
ix
Contents
Abstract
v
Table of Contents
ix
List of Figures
xv
List of Tables
xix
1 Introduction
I
1
1.1
Data management and transaction processing . . . . . . . . . . . . . . . . .
1
1.2
The emergence of multicore hardware . . . . . . . . . . . . . . . . . . . . . .
2
1.3
Limitations of conventional transaction processing . . . . . . . . . . . . . . .
4
1.4
Focus of this dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
1.5
Not all serial computations are the same . . . . . . . . . . . . . . . . . . . .
7
1.6
Improving the scalability of conventional designs . . . . . . . . . . . . . . . .
8
1.7
Data-oriented transaction execution . . . . . . . . . . . . . . . . . . . . . . .
10
1.8
Thesis statement and contributions . . . . . . . . . . . . . . . . . . . . . . .
11
1.9
Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
Scalability of transaction processing systems
2 Background: Transaction Processing
15
17
2.1
The concept of transaction and transaction processing . . . . . . . . . . . . .
17
2.2
A typical transaction processing engine . . . . . . . . . . . . . . . . . . . . .
18
2.2.1
Transaction management . . . . . . . . . . . . . . . . . . . . . . . . .
19
2.2.2
Logging and recovery . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
2.2.3
Access methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
2.2.4
Metadata management . . . . . . . . . . . . . . . . . . . . . . . . . .
20
2.2.5
Buffer pool management . . . . . . . . . . . . . . . . . . . . . . . . .
21
2.2.6
Concurrency control . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
CONTENTS
x
2.3
2.4
I/O activities in transaction processing . . . . . . . . . . . . . . . . . . . . .
23
2.3.1
Logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
2.3.2
On-demand reads & evictions . . . . . . . . . . . . . . . . . . . . . .
25
2.3.3
Dirty page write-backs . . . . . . . . . . . . . . . . . . . . . . . . . .
26
2.3.4
Summary and a note about experimental setups . . . . . . . . . . . .
27
OLTP Workloads and Benchmarks . . . . . . . . . . . . . . . . . . . . . . .
28
2.4.1
TPC-A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
2.4.2
TPC-B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
2.4.3
TPC-C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
2.4.4
TPC-E . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
2.4.5
TATP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
3 Scalability Problems in Database Engines
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
3.2
Critical sections inside a database engine . . . . . . . . . . . . . . . . . . . .
35
3.3
Scalability of existing engines . . . . . . . . . . . . . . . . . . . . . . . . . .
37
3.3.1
Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
3.3.2
Evaluation of performance and scalability
. . . . . . . . . . . . . . .
39
3.3.3
Ramifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
3.4
II
33
Addressing scalability bottlenecks
4 Critical Sections
45
47
4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
4.2
Communication patterns and critical sections . . . . . . . . . . . . . . . . .
49
4.2.1
Types of communication . . . . . . . . . . . . . . . . . . . . . . . . .
49
4.2.2
Categories of critical sections . . . . . . . . . . . . . . . . . . . . . .
49
4.2.3
How to predict and improve scalability . . . . . . . . . . . . . . . . .
52
Enforcing critical sections . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
4.3.1
Synchronization primitives . . . . . . . . . . . . . . . . . . . . . . . .
53
4.3.2
Alternatives to locking . . . . . . . . . . . . . . . . . . . . . . . . . .
54
4.3.3
Choosing the right approach . . . . . . . . . . . . . . . . . . . . . . .
55
4.3.4
Discussion and open issues . . . . . . . . . . . . . . . . . . . . . . . .
59
Handling problematic critical sections . . . . . . . . . . . . . . . . . . . . . .
60
4.3
4.4
CONTENTS
4.5
xi
4.4.1
Algorithmic changes . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
4.4.2
Changing synchronization primitives . . . . . . . . . . . . . . . . . .
61
4.4.3
Both are needed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
5 Attacking Un-scalable Critical Sections
5.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
5.2
Shore-MT: a reliable baseline . . . . . . . . . . . . . . . . . . . . . . . . . .
67
5.2.1
Critical section anatomy of Shore-MT . . . . . . . . . . . . . . . . . .
69
Avoid un-scalable critical sections with SLI . . . . . . . . . . . . . . . . . . .
70
5.3.1
Speculative lock inheritance . . . . . . . . . . . . . . . . . . . . . . .
72
5.3.2
Evaluation of SLI . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
Downgrading log buffer insertions . . . . . . . . . . . . . . . . . . . . . . . .
84
5.4.1
Log buffer designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
85
5.4.2
Evaluation of log buffer re-design . . . . . . . . . . . . . . . . . . . .
88
Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
94
5.5.1
Reducing lock overhead and contention . . . . . . . . . . . . . . . . .
94
5.5.2
Handling logging-related overheads . . . . . . . . . . . . . . . . . . .
95
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
95
5.3
5.4
5.5
5.6
III
65
Re-architecting transaction processing
6 Data-oriented Transaction Execution
6.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
97
99
99
6.1.1
Thread-to-transaction vs. Thread-to-data . . . . . . . . . . . . . . . 101
6.1.2
When DORA is needed . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.1.3
Contributions and chapter organization . . . . . . . . . . . . . . . . . 104
6.2
Contention in the lock manager . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.3
A Data-ORiented Architecture for OLTP . . . . . . . . . . . . . . . . . . . . 107
6.4
6.3.1
Design overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.3.2
Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.3.3
Improving I/O and microarchitectural behavior . . . . . . . . . . . . 116
6.3.4
Prototype Implementation . . . . . . . . . . . . . . . . . . . . . . . . 117
Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.4.1
Experimental Setup and Workloads . . . . . . . . . . . . . . . . . . . 118
CONTENTS
xii
6.4.2
Eliminating Contention in the Lock Manager . . . . . . . . . . . . . . 120
6.4.3
Intra-transaction Parallelism . . . . . . . . . . . . . . . . . . . . . . . 122
6.4.4
Maximizing Throughput . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.4.5
Secondary index accesses . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.4.6
Transactions with joins . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.4.7
Limited hardware parallelism . . . . . . . . . . . . . . . . . . . . . . 129
6.4.8
Anatomy of critical sections . . . . . . . . . . . . . . . . . . . . . . . 132
6.5
Weaknesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.6
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.7
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7 Physiological Partitioning
7.1
139
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.1.1
Multi-rooted B+Trees . . . . . . . . . . . . . . . . . . . . . . . . . . 141
7.1.2
Dynamically-balanced physiological partitioning . . . . . . . . . . . . 141
7.1.3
Contributions and organization . . . . . . . . . . . . . . . . . . . . . 142
7.2
Communication patterns in OLTP . . . . . . . . . . . . . . . . . . . . . . . . 143
7.3
Shared-everything vs. physical vs. logical partitioning . . . . . . . . . . . . . 145
7.4
Physiological partitioning
7.5
7.6
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.4.1
Design overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.4.2
Multi-rooted B+Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
7.4.3
Heap page accesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
7.4.4
Page cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
7.4.5
Benefits of physiological partitioning . . . . . . . . . . . . . . . . . . 151
Need and cost of dynamic repartitioning . . . . . . . . . . . . . . . . . . . . 153
7.5.1
Static partitioning and skew . . . . . . . . . . . . . . . . . . . . . . . 153
7.5.2
Repartitioning cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
7.5.3
Splitting non-clustered indexes . . . . . . . . . . . . . . . . . . . . . . 155
7.5.4
Splitting clustered indexes . . . . . . . . . . . . . . . . . . . . . . . . 159
7.5.5
Moving fewer records . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
7.5.6
Example of repartitioning cost . . . . . . . . . . . . . . . . . . . . . . 160
7.5.7
Cost of merging two partitions . . . . . . . . . . . . . . . . . . . . . . 161
A dynamic load balancing mechanism for PLP . . . . . . . . . . . . . . . . . 162
7.6.1
Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
7.6.2
Deciding new partitioning . . . . . . . . . . . . . . . . . . . . . . . . 164
CONTENTS
7.7
7.8
7.9
xiii
7.6.3 Using control theory for load balancing . . . . . . . . .
Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.7.1 Experimental setup . . . . . . . . . . . . . . . . . . . .
7.7.2 Page latches and critical sections . . . . . . . . . . . .
7.7.3 Reducing index and heap page latch contention . . . .
7.7.4 Impact on scalability and performance . . . . . . . . .
7.7.5 MRBTrees in non-PLP systems . . . . . . . . . . . . .
7.7.6 Transactions with joins in PLP . . . . . . . . . . . . .
7.7.7 Secondary index accesses . . . . . . . . . . . . . . . . .
7.7.8 Fragmentation overhead . . . . . . . . . . . . . . . . .
7.7.9 Overhead and effectiveness of DLB . . . . . . . . . . .
7.7.10 Overhead of updating secondary indexes for DLB . . .
7.7.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . .
Related work . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.8.1 Critical Sections . . . . . . . . . . . . . . . . . . . . . .
7.8.2 B+Trees and alternative concurrency control protocols
7.8.3 Load balancing . . . . . . . . . . . . . . . . . . . . . .
7.8.4 PLP and future hardware . . . . . . . . . . . . . . . .
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8 Future Direction and Concluding Remarks
8.1 Hardware/data-oriented software co-design .
8.1.1 Hardware enhancements . . . . . . .
8.1.2 Co-design for energy-efficiency . . . .
8.2 Summary and conclusion . . . . . . . . . . .
Bibliography
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
167
167
168
169
170
171
172
173
175
177
178
182
183
183
183
184
185
186
187
.
.
.
.
189
189
189
189
190
193
xiv
CONTENTS
xv
List of Figures
1.1
Number of hardware contexts per chip . . . . . . . . . . . . . . . . . . . . .
3
1.2
Conventional and data-oriented access patterns . . . . . . . . . . . . . . . .
5
1.3
Dissertation roadmap based on the number and type of critical sections . . .
9
2.1
Components of a transaction processing engine. . . . . . . . . . . . . . . . .
19
2.2
An OLTP installation and I/O activities . . . . . . . . . . . . . . . . . . . .
22
3.1
Scalability of four popular open-source database engines . . . . . . . . . . .
34
3.2
Efficiency comparison for several storage engines. . . . . . . . . . . . . . . .
40
3.3
Accuracy of Amdahl’s Law . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
4.1
Communication patterns and types of critical sections . . . . . . . . . . . . .
50
4.2
Mutex sensitivity analysis – contention . . . . . . . . . . . . . . . . . . . . .
56
4.3
Mutex sensitivity analysis – duration . . . . . . . . . . . . . . . . . . . . . .
57
4.4
Reader-writer lock sensitivity analysis . . . . . . . . . . . . . . . . . . . . . .
58
4.5
Usage space of critical section types . . . . . . . . . . . . . . . . . . . . . . .
59
4.6
Algorithmic changes and tuning combine to give best performance. . . . . .
61
5.1
Shore-MT scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
5.2
Efficiency on TPC-C transactions . . . . . . . . . . . . . . . . . . . . . . . . .
69
5.3
Breakdown of critical sections of the conventional designs . . . . . . . . . . .
70
5.4
Contention and overhead in the lock manager . . . . . . . . . . . . . . . . .
71
5.5
SLI in a nutshell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
72
5.6
Example of SLI-induced deadlock. . . . . . . . . . . . . . . . . . . . . . . . .
75
5.7
Breakdown of overhead due to lock manager vs rest of system . . . . . . . .
78
5.8
Lock manager bottleneck . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
79
5.9
Suitability of locks for use with SLI . . . . . . . . . . . . . . . . . . . . . . .
80
5.10 Analysis of SLI-eligible locks . . . . . . . . . . . . . . . . . . . . . . . . . . .
81
5.11 CPU utilization breakdown with SLI active . . . . . . . . . . . . . . . . . . .
82
5.12 Performance improvement due to SLI, for TATP and TPC-B. . . . . . . . . . .
84
5.13 Log buffer designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
86
LIST OF FIGURES
xvi
5.14 Contention in the baseline log buffer . . . . . . . . . . . . . . . . . . . . . .
90
5.15 Sensitivity analysis of the C-Array
. . . . . . . . . . . . . . . . . . . . . . .
91
5.16 Sensitivity to the number of slots in C-Array . . . . . . . . . . . . . . . . . .
92
5.17 Performance improvement by hybrid log buffer design . . . . . . . . . . . . .
93
6.1
Comparison of access patterns.
. . . . . . . . . . . . . . . . . . . . . . . . . 100
6.2
Throughput per hardware context for baseline and DORA. . . . . . . . . . . 102
6.3
Time breakdown of baseline and DORA running as load increases. . . . . . . 103
6.4
Time breakdowns of baseline and DORA on TATP and TPC-C OrderStatus.
6.5
Inside the lock manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.6
Breakdown of time spent in the lock manager . . . . . . . . . . . . . . . . . 106
6.7
DORA as a layer on top of the storage manager . . . . . . . . . . . . . . . . 107
6.8
A transaction flow graph for TPC-C Payment.
6.9
Execution example of a transaction in DORA. . . . . . . . . . . . . . . . . . 112
104
. . . . . . . . . . . . . . . . . 109
6.10 Locks acquired, by type, in Baseline and DORA . . . . . . . . . . . . . . . . 121
6.11 Performance of baseline and DORA on TATP, TPC-B and TPC-C OrderStatus. 122
6.12 Single-transaction response times . . . . . . . . . . . . . . . . . . . . . . . . 123
6.13 Performance on a transaction with high abort rate . . . . . . . . . . . . . . . 124
6.14 Maximum throughput under perfect admission control . . . . . . . . . . . . 125
6.15 Performance on aligned secondary index scans . . . . . . . . . . . . . . . . . 127
6.16 Performance on non-aligned secondary index scans . . . . . . . . . . . . . . . 128
6.17 Transactions with joins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.18 Behavior on limited hardware parallelism . . . . . . . . . . . . . . . . . . . . 130
6.19 Context switches on limited hardware parallelism . . . . . . . . . . . . . . . 131
6.20 Anatomy of critical sections for Baseline and DORA . . . . . . . . . . . . . . 132
7.1
Breakdown of critical sections for the PLP variants . . . . . . . . . . . . . . 143
7.2
Page latch breakdown for three OLTP benchmarks . . . . . . . . . . . . . . 144
7.3
Shared-everything vs physical- vs. logical-partitioning . . . . . . . . . . . . . 145
7.4
Variations of physiological partitioning . . . . . . . . . . . . . . . . . . . . . 148
7.5
Throughput of a statically partitioned system. . . . . . . . . . . . . . . . . . 154
7.6
Splitting a partition in PLP-Leaf . . . . . . . . . . . . . . . . . . . . . . . . 157
7.7
Splitting a partition in PLP-Partition . . . . . . . . . . . . . . . . . . . . . . 158
7.8
A two-level histogram for MRBTrees and the aging algorithm . . . . . . . . 163
7.9
Deciding new partition ranges example . . . . . . . . . . . . . . . . . . . . . 166
7.10 Average number of page latches acquired . . . . . . . . . . . . . . . . . . . . 169
LIST OF FIGURES
7.11
7.12
7.13
7.14
7.15
7.16
7.17
7.18
7.19
7.20
7.21
7.22
7.23
Time breakdown per transaction in an insert/delete-heavy benchmark. . . .
Time breakdown per TPC-C StockLevel transaction. . . . . . . . . . . . . .
Time breakdown per transaction in TPC-B with false sharing on heap pages.
Throughput in two multicore machines. . . . . . . . . . . . . . . . . . . . . .
Impact of MRBTree in non-PLP systems. . . . . . . . . . . . . . . . . . . . .
Time breakdown with frequent parallel SMOs . . . . . . . . . . . . . . . . .
Throughput when running the TPC-C StockLevel transaction. . . . . . . . .
Performance on transactions with secondary index scans. . . . . . . . . . . .
Space overhead of the PLP variations. . . . . . . . . . . . . . . . . . . . . .
Overhead of DLB under normal operation. . . . . . . . . . . . . . . . . . . .
DLB in action. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Partitions Before & After the repartitioning . . . . . . . . . . . . . . . . . .
Overhead of updating secondary indexes during repartitioning. . . . . . . . .
xvii
170
171
172
173
174
174
175
176
177
178
179
180
182
xviii
LIST OF FIGURES
xix
List of Tables
7.1
7.2
7.3
7.4
Repartitioning costs for splitting a partition into two . . . . . . .
Cost when splitting a partition of 466MB in half . . . . . . . . . .
Average index probe time for a hot record, as skew increases. . . .
Average record probes per sec for a hot record, as skew increases.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
156
160
181
181
xx
LIST OF TABLES
1
Chapter 1
Introduction
1.1
Data management and transaction processing
Mark Weiser, the father of ubiquitous computing, in his seminal article “The computer for
the 21st century” wrote: “The most profound technologies are those that disappear. They
weave themselves into the fabric of everyday life until they are indistinguishable from it.”
[Wei99]. Data management is one of those technologies.
Data processing and the dissemination of information, enabled and backed by data management technologies, are changing the world we live in. Consider the recent uprisings in
the Arab world, which were greatly influenced by the social websites Facebook 1 and Twitter 2 [Bea11]. Or the fact that Wall Street now operates the majority of its trading at high
frequency: trades are made automatically in data centers following analysis of large volumes
of data with decisions being made almost instantaneously [ZL11]. Across all the world’s
activities, we see that data analysis capabilities, enabled by data management systems, have
changed centuries-old operations such as the way we practice medicine (e.g. [Lof96]), play
sports [Mal11], or trade agricultural products [Kel11], to name just few.
The database management market itself sees consistent yearly growth, with revenues of
nearly $19 billion in 2008 [GSHS09]. At the same time, there is evidence that in the recent
years the amount of data managed in various markets has increased almost exponentially.
For example, in 2007, the social website Facebook collected 15TBs of data; 3 years later,
in 2010, it collected 700TBs [TSJ+ 10]. The online auction and shopping website eBay 3
daily ingests 50TB of new data to its database [Raw10]. Databases become larger, they
are used even more frequently, and increasingly more complex algorithms are employed for
data processing. The pressure is on the data management systems which need to perform
efficiently and respond to requests in a timely manner.
1
2
3
http://www.facebook.com
http://www.twitter.com
http://www.ebay.com
CHAPTER 1. INTRODUCTION
2
While we experience an explosion in the amounts of data processed and in the number of
data-centered applications available, the underlying hardware technologies are also changing
tremendously. The rise of multicore hardware and the emergence of non-volatile storage
technologies (such as flash-based devices and phase change memories) are fundamentally
changing data processing capabilities across all types of applications. A primary target for
change on the software side is the need to adjust and prioritize scalability in order to utilize
abundantly available hardware parallelism.
One of the most challenging database workloads is transaction processing [GR92]. The
main characteristic of this type of workload is that it consists of a multitude of concurrent
requests which typically touch only a small portion of a multi-gigabyte database in a largely
unpredictable way. The concurrent requests need to complete consistently and in isolation
from any other, while the changes made need to be durable. Transaction processing systems
need to provide both high throughput and low response times. Unfortunately, the transaction
processing model has remained largely the same for the past three decades, and that imposes
some inherent scalability difficulties.
This dissertation contributes to the quest for scalable transaction processing software
within a multiprocessor node. It studies the interaction between modern hardware and
transaction processing workloads; proposes a methodological way to analyze the scalability of software systems; and makes solid improvements in essential transaction processing
components, such as locking and logging. More importantly, it shows that the conventional
transaction execution model has fundamental scalability problems due to its chaotic data
access patterns. It then shows that the data-oriented transaction execution model does not
have such limitations and can maintain scalability as hardware parallelism increases.
1.2
The emergence of multicore hardware
Let’s begin by looking at the evolution of computer hardware. Since 1965 processor technology has largely followed or outpaced Gordon Moore’s prediction that transistor counts
within a single chip will double every year or two [Moo65]. Computer architects, in order to
translate this biennial increase in the transistor budget into performance have followed two
avenues:
• Gradually increasing the complexity of the processors by employing aggressive microar-
chitectural technologies (long execution pipelines, out-of-order execution, sophisticated
branch prediction, super-scalar execution, etc.. [HP02]).
1.2. THE EMERGENCE OF MULTICORE HARDWARE
3
HW contexts/chip
256
Sun UltraSPARC
64
16
4
1
Oct-93
IBM POWER
Intel Core/Nehalem
Intel Itanium
Intel Pentium
AMD Opteron
Apr-97
Oct-00
Apr-04
Oct-07
Apr-11
Year
Figure 1.1: Evolution of the number of hardware contexts per chip for some major processor
lines. We observe an exponential increase in on-chip parallelism since circa 2005.
• Clocking the processors at ever higher frequencies, testing the endurance of hardware
materials, such as silicon.
Unfortunately, aggressive microarchitectural optimizations started giving diminishing results on commercial workloads, such as databases [ADHW99, RGAB98, BGB98, MDO94, LW92].
Even worse, sometime around 2005 we reached the limits of material science. Silicon transistors couldn’t be clocked at higher frequencies because they would melt, while each processor
was (uneconomically) drawing 100 Watts or more of power.
Due to the power and thermal caps, processor vendors made a historic change of course.
Until around 2005, the focus of each chip design was single-thread performance (trying to
accomplish a single task as efficiently as possibly using all the resources of the chip). From
this time on, processor vendors had to rely on the thread-level parallelism of software systems
to improve performance. Vendors began to use growing transistor budgets to exponentially
increase the number of processing cores or hardware contexts per chip, rather than making
single cores exponentially more complex [ONH+ 96, BGM+ 00, DLO05]. Figure 1.1 shows
historic evidence of the explosion of on-chip parallelism in every major processor line (note
the logarithmic scale in the y-axis).
Multicore processors now dominate the hardware landscape and greatly affect software
system performance. The pressure is now on the software manufactures, who can no longer
expect the hardware to provide all the performance improvements. But, according to Am-
4
dahl’s Law [Amd67], which says that the speedup a software system can achieve on parallel
hardware depends on the fraction of its execution that can be parallelized, to achieve high
performance, software must provide exponentially-increasing parallelism. Unfortunately, this
is a very difficult task and software systems typically become bottlenecked long before they
manage to saturate the underlying hardware (e.g. [BWCM+ 10]).
1.3
Limitations of conventional transaction processing
Transaction processing systems experience scalability problems with the advent of exponentiallyincreasing hardware parallelism. As we will see in Chapter 3, open source transaction processing systems were caught off guard by the move to increased parallelism. There we study
the scalability of the most popular open-source database engines4 when we use them to run
a perfectly scalable transactional workload. We see that none of the database engines manages to scale its performance on a multicore chip with 32 hardware contexts. Even the most
scalable of the database engines has a huge 8% serial component, which allows it utilize no
more than ∼16 cores effectively. In the past this was a significant degree of parallelism, with
parallel systems containing a limited number of processors (typically only 4-8). In contrast,
emerging multicores may contain 64-512 hardware contexts [JN07, STH+ 10, KSSF10], with
that number projected to continue increasing.
This result is surprising because transaction processing is very well-suited for executing
in parallel hardware. Transactional workloads exhibit abundant parallelism at the request
level and over the past three decades the database systems community has done exceptional
work: transaction processing systems excel at exploiting concurrency—support for multiple
in-progress operations—to interleave the execution of a large number of transactions. Unfortunately, internal bottlenecks prevent the systems from translating their high concurrency
to proportionally high execution parallelism [JPH+ 09].
There are two reasons why transaction processing systems face scalability problems and
fail to exhibit unbounded execution parallelism. First, they are exceptionally complex software systems. In order to provide its core services a typical transaction processing system has
tightly coupled components which interact with each other very frequently and sometimes
measure thousands or even millions lines of code.
Second, because of the way conventional transaction processing systems assign work to
their worker threads, transactional workloads result in totally unpredictable data access patterns [PJHA10, SWH+ 04]. That is, under conventional execution, each incoming transaction
4
Throughout the thesis we use the terms “transaction processing system”, “database engine” and “storage
manager” interchangeably.
1.3. LIMITATIONS OF CONVENTIONAL TRANSACTION PROCESSING
DISTRICTS
Thread-to-transaction (Conventional)
Thread-to-data (DORA)
100
100
80
80
60
60
40
40
20
20
0
5
0
0.2
0.4
0.6
Time (secs)
0.8
0.2
0.4
0.6
Time (secs)
0.8
Figure 1.2: Comparison of the access patterns of the conventional and data-oriented execution. On the left are the accesses caused by the conventional thread-to-transaction
assignment of work policy. On the right are the accesses caused by the (data-oriented)
thread-to-data assignment of work policy.
is assigned to a worker thread, a mechanism we refer to as thread-to-transaction assignment.
The access pattern of each transaction, and consequently of each thread, however, is arbitrary and totally uncoordinated. The end result is that concurrent threads read and update
data from the entire address space in a random fashion as shown in Figure 1.2 (left). This
figure plots the concurrent thread accesses to the records of a table of a conventional transaction processing system as it runs a standardized transactional benchmark. Each access
is color-coded indicating which thread is performing it.5 To ensure data integrity during
shared, uncoordinated accesses, each thread enters a very large number of contentious critical sections in the short lifetime of each transaction it executes. For example, to complete
one of the simplest transactions possible, which probes for a Customer and updates her
balance, a modern conventional transaction processing system needs to enter more than 70
critical sections or points of serialization (see the left-most bar of Figure 1.3).
Even though with huge effort and extremely careful and inspired engineering the complexity of transaction processing systems can be tamed, the unpredictability of the accesses
remains. In other words, no matter how well engineered a conventional transaction processing system is, system designers are forced to be overly pessimistic and clutter the transaction
processing codepaths with a very large number of points of serialization [JPA08, PTJA11].
This imposes a considerable overhead to single-thread performance [HAMS08]. Even worse,
5
More details about this figure are in Chapter 6.
6
some of the serializations will eventually become impediments to scalability. Thus, we argue
that because of inherent scalability problems conventional transaction execution is doomed,
and there is a need to fundamentally change the way database engines process transactions.
1.4
Focus of this dissertation
Given the difficulty of improving the scalability of conventional transaction processing systems, recently there has been an emergence of designs which exploit specific application characteristics in order to provide scalable performance. For example, many web applications,
like Facebook and Twitter, can tolerate stale data and inconsistencies. Such applications
can be served by systems that provide only eventual consistency guarantees [Vog09] or, in
general, do not guarantee some of the ACID properties (atomicity, consistency, isolation,
and durability) [GR92]. Other applications access data only by using record key identifiers.
To serve such applications, several key-value stores [DHJ+ 07] have been implemented, including BigTable [CDG+ 06], HBase 6 , CouchDB 7 , Tokyo Cabinet 8 , Redis 9 , Cassandra 10
and DynamoDB [Vog12], to name just few. Those systems provide only a subset of the
database functionality, provide a limited get()/put() key-value interface, and are designed
for scalability, reliability, high availability, and easy of deployment on clusters of multiple
nodes, rather than a single multicore node.
This dissertation focuses on the scalability of “traditional” transaction processing systems
within a multicore node. We are seeking transaction processing system designs which can
replace existing systems without requiring changes in the legacy application code. We are
interested on systems that maintain the ACID properties and they do not provide limited
data management functionality (e.g. the ability to perform joins) or supported interface (e.g.
not only key-value accesses). Also, we are interested in scaling up the transaction processing
performance within a single multicore node. Scaling out to a cluster of nodes is a mostly
orthogonal problem and out of the scope of this work. For example, one could easily exploit
the results of this dissertation to implement the building blocks of a scale out solution.
In terms of workloads, we are interested on transactional workloads that consist of multiple concurrent short-running transactions. Such workloads exhibit high concurrency at
the application level and put pressure on the transaction processing system. Data analysis
workloads that consist of few long-running queries, exhibit low concurrency, put pressure on
6
7
8
9
10
http://hbase.apache.org/
http://couchdb.apache.org/
http://fallabs.com/tokyocabinet/
http://redis.io/
http://cassandra.apache.org/
1.5. NOT ALL SERIAL COMPUTATIONS ARE THE SAME
7
the query execution component, and are of no interest for this work. As a matter of fact, specialized data management systems are increasingly popular for serving such workloads (e.g.
column stores [SAB+ 05, BZN05]). In addition, and in contrast with data analysis workloads,
it is realistic to assume that transactional workloads are not I/O-bound [SMA+ 07, JPS+ 10].
That is, as main-memories become cheaper and larger the working set of most transactional
workloads tends to be memory-resident with the only I/Os made to provide durability (flush
the log buffer and write back dirty pages); contrary to “big data” analysis applications that
operate of terabytes or petabytes of data and are often I/O-bound.
In Part I, after an introduction to transaction processing systems (or database engines),
we show that the performance of conventional open-source database engines suffers in highly
parallel hardware due to their poor scalability. The rest of the dissertation consists of two
main parts. The first discusses improvements to the scalability of conventional transaction processing designs.The second re-architects traditional transaction processing models in
order to break the aforementioned inherent limitations.
1.5
Not all serial computations are the same
On our quest for scalable transaction processing, one of the first challenges we met was
discovering how to quantify the scalability of a system. It is impractical to start a system
redesign based only on performance observations of an available parallel hardware machine.
Given the rate at which hardware parallelism increases, once the software system redesign
and implementation are completed, a new generation of more parallel processors will be
available and new bottlenecks may have emerged. That is, not only do we need to identify
the bottlenecks in current multicore hardware, but we need to be able to predict potential
problems in future processor generations.
One reliable way to predict the scalability of various transaction processing designs is by
profiling the serial computations (or critical sections) executed during a single transaction
and categorizing them based on their behavior. Behavior differs based on whether the
contention for a specific critical section increases with the number of processing cores (or
running threads in the system) or not.
Using this criterion we see there are two main types of critical sections: those whose
contention remains steady (or fixed) no matter how many processing cores are in the system,
and those whose contention grows without bounds as hardware parallelism increases. We
refer to the latter type of critical sections as unbounded. A third, special, type of critical
sections are composable. As first observed by Moir et al. [MNSS05], in certain cases multiple
8
threads can combine their operations and enter a critical only once, whereas normally the
critical section would have been entered by each thread individually. For example, consider
a concurrent stack where threads can push() and pop() items. While accessing the stack
is a critical section—if two threads concurrently modify the head of the stack, behavior will
be unpredictable—a push() and a pop() can combine their requests off the critical path
without the need to execute the critical section.
Of the three types of critical sections it is clear that only the unbounded ones impose
threat to the scalability of the system. The other two types, fixed and composable, aggravate
only the single-thread performance. Furthermore, employing the wrong synchronization
primitive may have severe impact on both performance and scalability. Thus, the keys to
scalable software designs are (a) to reduce the number of unbounded serial computations
through algorithmic changes, and (b) to enforce the serial computations using appropriate
synchronization primitives.
The main message of Chapter 4 is that the numerous critical sections in transaction
processing impose significant overhead even in single-thread performance. At the same
time, not all critical sections are the same; different types impose different threats, if any,
to scalability. To achieve a scalable design we need to drastically reduce the number of
unbounded serial computations. By the end of this thesis, we will provide evidence that
analyzing the number and type of critical sections is a reliable indicator of the scalability of
systems. More elaborate (and scalable) designs execute on average fewer unbounded critical
sections, as shown in Figure 1.3.
1.6
Improving the scalability of conventional designs
There are three ways to reduce the frequency of unbounded critical sections in a software
system and improve its performance:
• Avoid. The system can avoid executing unbounded critical sections, for example through
caching. Section 5.3 of Part II presents Speculative Lock Inheritance (or SLI), an example of avoiding the execution of unbounded critical sections in the lock manager
through caching. SLI detects, at run-time, which database locks are “hot” (where there
is contention for acquiring and releasing them) and makes sure the transaction executing threads cache those “hot” locks across transactions. The execution model does not
change, since each thread in the system still executes the same codepaths and tries to acquire the same database locks from the centralized lock manager. It just happens to find
1.6. IMPROVING THE SCALABILITY OF CONVENTIONAL DESIGNS
80
Uncategorized
70
CSs per Transaction
9
Message passing
60
Xct mgr
50
40
Aether log mgr
30
Log mgr
20
Metadata
10
Bpool
0
Page Latches
Chapter 5a
Chapter 5b
Chapter 6
Chapter 7
Conventional
SLI & Aether
Data-oriented
Physiological
Lock mgr
Figure 1.3: Comparison of the number and type of critical sections executed for the completion of a very simple transaction from the various designs presented in this dissertation. The
unbounded critical sections are the bars with solid fills. The fewer the unbounded critical
sections, the more scalable the corresponding design is.
the “hot” locks stored in a thread-local cache and avoids interaction with the centralized
lock manager, reducing contention.
• Downgrade. Unbounded critical sections can be downgraded into fixed or composable
ones. By doing so, single-thread performance is not affected, but the scalability improves.
Section 5.4 of Part II presents a concrete example of downgrading a class of unbounded
critical sections from the log manager. An essential component of any transaction processing system, the log manager records all the changes made in a database and ensures
that the system can recover in the event of a crash. However, if treated naively, the log
buffer inserts can become a bottleneck (all the concurrent threads need to record their
changes to the same main memory log buffer). But, because requests to append entries
into a log buffer can be combined to form requests for larger appends, we are able to
downgrade unbounded critical sections to composable ones and achieve better scalability.
• Re-architect. The most drastic measure we can take to improve scalability is to com-
pletely eliminate the need to execute contention-prone codepaths (codepaths that enter
many unbounded critical sections) by modifying the entire execution model. We follow
this direction in Part III, which is the main contribution of this dissertation.
Part II is dedicated to improving the scalability of conventional transaction processing.
We make two solid improvements to essential components of any transaction processing
system.
10
However, the second bar of Figure 1.3 suggests that the problem remains; no matter the
optimizations, the conventional system still executes a large number of unbounded critical
sections (Figure 1.3 shows the execution of over 35 unbounded critical sections) with the
danger of some of them becoming bottlenecks.
1.7
Data-oriented transaction execution
In a highly parallel multicore landscape, we need to approach transaction processing from a
different perspective. One radical approach proposed by Stonebraker et al. [SMA+ 07], is HStore. H-Store is a shared-nothing design [Sto86] of single-threaded main-memory database
instances within a node which rely on replication to maintain durability. Since each database
is accessed by a single thread, H-Store eliminates critical sections altogether. Unfortunately,
since shared-nothing systems physically partition data, H-Store delivers poor performance
when the workload triggers distributed transactions [Hel07, JAM10, CJZM10, PJZ11] or
when skew causes load imbalance [CJZM10, PJZ11]. Further, repartitioning to rebalance
load requires the system to physically move and reorganize all affected data. These weaknesses become especially problematic as partitions become smaller and more numerous in
response to multicore hardware. Thus, aggressive shared-nothing designs, such as H-Store,
solve the scalability problems of only a limited set of applications, for example applications
whose access patterns do no exhibit sudden changes and are easily partition-able.
Because aggressive shared-nothing designs cannot adequately serve all transactional workloads, we need a design that maintains the desired shared-everything properties (e.g. all the
data in a single address space, no need to execute distributed transactions), but also allows
us to drastically reduce the number of unbounded critical sections. Based on the observation that uncoordinated accesses to data leads to scalability problems in conventional
shared-everything designs, we propose a thread-to-data policy of assigning work to threads.
Under this policy, each transaction is decomposed to a set of smaller actions according to
the data region each action accesses. Then each action is routed to a thread responsible for
that data region. Transactions flow from one thread to another as they access different data.
In essence, instead of pulling data (database records) to the computation (transaction), the
thread-to-data policy distributes the computation to wherever the data is mapped (or pushes
computation to data).
This simple change in the execution model breaks the limitations of conventional processing. Once the system is guaranteed that a single thread will access a specific region of
the database during a period of time, it can be overly optimistic and avoid executing all
1.8. THESIS STATEMENT AND CONTRIBUTIONS
11
the critical sections normally would. Physical separation of data is also avoided, since the
partitioning is only logical. Figure 1.2 clearly visualizes the difference between conventional
thread-to-transaction and thread-to-data execution models; the former results in chaotic
uncoordinated accesses, while the latter’s accesses are coordinated and easy to predict.
In Part III we present two designs that eliminate significant sources of unbounded critical
sections. Both optimizations are enabled once we adopt the thread-to-data execution model
and exploit the resulting coordinated accesses. In particular, Chapter 6 shows how we can
distribute the formerly centralized lock management service and make it thread-local (see
the third bar in Figure 1.3).
Chapter 7 extends the data-oriented design with modifications to the physical layout
of the database, so that physical accesses map to logical partitioning. The physiologically
partitioned or PLP design eliminates the need to employ unbounded critical sections for both
logical operations (in the lock manager), as well as physical operations, such as page latching.
To eliminate the need to acquire page latches, PLP employs a new access method, which
we call MRBTree. The MRBTree access method consists of multiple independent sub-trees
(regular B+trees [BM70]) connected to a “root” page that maps the logical partitioning
of the data-oriented system. Overall, PLP executes almost an order of magnitude fewer
unbounded critical sections than the most scalable conventional design, as shown in the
right-most bar of Figure 1.3.
In addition, because data-oriented execution is based on logical-only partitioning, systems
that adopt it can react relatively quickly and balance load in response to changes in the access
patterns. In Chapter 7 we also show how a system that follows the thread-to-data policy
can dynamically and efficiently adapt to load changes, a big advantage over shared-nothing
designs which employ physical partitioning.
1.8
Thesis statement and contributions
This dissertation contributes in the quest for scalable transaction processing. The thesis
statement is simple and reads as follows:
THESIS STATEMENT
To break the inherent scalability limitations of conventional transaction processing, systems should depart from the traditional thread-to-transaction execution
model and adopt a data-oriented one.
As hardware parallelism increases, data-oriented design paves the way for transaction
processing systems to maintain scalability. The principles used to achieve scalability can
12
also be applied to other software systems facing similar scalability challenges as the shift to
multicore hardware continues. This thesis makes the following main contributions:
• We show that in a highly parallel multicore landscape, system designers should primarily
focus on reducing the number of unbounded critical sections in their systems, rather than
on improving single-thread performance.
• We make two solid improvements in conventional transaction processing technology by
avoiding the execution of certain unbounded critical sections in the lock manager through
caching, and by downgrading log buffer inserts from unbounded to composable critical
sections.
• We show that conventional transaction processing has inherent scalability limitations due
to the unpredictable access patterns it exhibits, due to the request-oriented execution
model it follows. Instead we suggest a data-oriented execution model and show that it
breaks the inherent limitations of conventional processing. The final design eliminates the
need to execute critical sections related to both logical operations (such as for locking)
and physical operations (such as page latching).
1.9
Roadmap
The overall goal of this thesis is to improve the scalability of transaction processing systems.
Most chapters attack specific problems and are therefore fairly self-contained. The thesis is
divided into three parts.
In Part I we introduce background information about transaction processing systems and
provide evidence that conventional transaction processing systems face significant scalability
problems. These are due to the complexity and unpredictability of access patterns inherent to
the conventional transaction processing execution model. Readers familiar with transaction
processing should skim through Chapter 2.
Part II is dedicated to improving the scalability of conventional transaction processing.
We first describe the various types of critical sections and underline the need to enforce
the appropriate synchronization primitive at each critical section (Chapter 4). Next we
attack specific scalability problems of conventional database engines. Chapter 5 presents
two concrete examples of how to remove bottlenecks from within two significant database
engine components: the centralized lock manager, which enforces concurrency control, and
the log manager, which is responsible for the recovery of the system in case of crashes. All
prototyping and evaluations are done using the Shore-MT storage manager [JPH+ 09], a
multithreaded storage manager we implemented for the needs of our research.
1.9. ROADMAP
13
The main part of this thesis is Part III (Chapters 6–7), which makes the case for dataoriented transaction execution. We first argue that conventional transaction processing and
its thread-to-transaction assignment of work policy have inherent scalability limitations. We
then make a case in favor of a thread-to-data work assignment policy and design a dataoriented system that eliminates major source of unbounded critical sections: those related to
the centralized lock manager. Chapter 7 extends the design to eliminate page latching—the
biggest remaining component of unbounded critical sections (see the third bar in Figure 1.3).
Chapter 7 also shows that systems based on the data-oriented transaction execution model
can easily re-partition at run time and adapt to load changes, a big advantage over designs
that apply physical partitioning to the data.
14
Part I
Scalability of transaction processing
systems
15
17
Chapter 2
Transaction processing: properties, workloads
and typical system design
This chapter presents background information about transaction processing. It defines the
concept of transaction, describes briefly the structure of a typical transaction processing system and the various I/O activities that take place when processing transactions. Also, it
discusses the requirements of transaction processing workloads, presenting five representative
transaction processing benchmarks.1
2.1
The concept of transaction and transaction processing
In general, transaction processing refers to the database operations corresponding to a business transaction. These operations range from tiny to fairly complex, are often fixed, and
execute concurrently with many other requests.
A database transaction is the basic unit of change in a database. According to Ramakrishnan and Gehrke [RG03]: “A transaction is any one execution of a user program in a
DBMS. (Executing the same program several times will generate several transactions.) This
is the basic unit of change as seen by the DBMS: Partial transactions are not allowed, and
the effect of a group of transactions is equivalent to some serial execution of all transactions.”
A transaction processing system is a software system that serves database transactions.
Such a transaction processing system is expected to maintain four important properties of
database transactions, known with the acronym ACID [GR92]. When ACID is maintained,
every database transaction obeys the following properties:
1. Atomicity. Either all the effects (modifications) of the transaction remain when that is
completed, or none. The atomicity requirement has the ”all or nothing” semantics. To
the outside, from the application to even possibly the concurrently running transactions,
a committed transaction appears to be indivisible and atomic; while the effects of an
1
This chapter draws material from various sources, most notably [RG03, GR92, HSH07] .
CHAPTER 2. BACKGROUND: TRANSACTION PROCESSING
18
aborted transaction are like that it never happened, no matter what the transaction did
until the abort.
2. Consistency. Every transaction leaves the database in a consistent state.2 The execution of a transaction transforms the database from one consistent state to another, while
aborted transactions do not change the state.
3. Isolation. Transactions cannot interfere with each other. Moreover, depending on the
level of isolation the application has requested, the effects of an incomplete transaction
may or may not be visible to another transactions. Providing isolation is the main goal
of concurrency control.
4. Durability. The effects (modifications) of successfully completed transactions must
persist.
The concurrency of requests from the application and the need to maintain the ACID
properties, complicate the design of transaction processing systems.
2.2
A typical transaction processing engine
3
The transaction processing engine (or storage manager) is the “heart” of any database
management system and it is responsible for many of the core database management services.
Briefly, the main services of a transaction processing engine are the following:
• Transaction management: Begin, commit or abort transactions, atomically roll-
backing their modifications.
• Access methods: Provide different organizations of the data.
• Logging and recovery: Make sure that the database can recover to a consistent state
in the event of a crash, discarding partially-performed work without side effects.
• Buffer pool management: Give the notion that the system has infinite memory at its
disposal.
These services, in turn, access even lower-level services. In general the transaction processing engine is the most complicated part of any database system. It consists of many
sub-components which are very tightly-coupled with each other and often measure thousands lines of code. Figure 2.1 shows the major components of a typical transaction pro2
3
It is the responsibility of the application that issues the transaction to ensure that the transaction itself
is correct.
This subsection uses material presented in [JPH+ 09].
2.2. A TYPICAL TRANSACTION PROCESSING ENGINE
19
Transaction Processing System
Application
Transaction
Management
Lock Manager
Log
Manager
Metadata
Manager
Access
Methods
Free Space
Management
Memory
Management
Buffer Pool
Latching
Storage
Figure 2.1: Components of a transaction processing engine.
cessing system. The following subsections highlight these components, explaining briefly
their functionality and how they are usually implemented.
2.2.1
Transaction management
The transaction processing engine maintains information about all active transactions, especially the newest and oldest in the system, in order to coordinate services such as checkpointing and recovery. In addition, it allows thread to attach and detach from transaction
contexts and does all the bookkeeping (e.g. a transaction cannot commit if more than two
threads are attached to it).
Checkpointing allows the log manager to discard old log entries, saving space and shortening recovery time. However, no transactions may begin or end during some phases of
checkpoint generation, producing a potential bottleneck unless checkpoints are very fast.
2.2.2
Logging and recovery
The log manager records all the operations performed in the database into the database
log. Logging modifications ensures that they are not lost if the system database fails before
the buffer pool flushes those changes to disk. The log also allows the database to roll
back modifications in the event of a transaction abort. Most transaction processing engines
follow the ARIES scheme [MHL+ 92], which weaves logging, buffer pool management, and
concurrency control into a comprehensive recovery scheme.
20
The log usually consists of two parts: the persistently stored log file and one or more
main memory log buffers. The transactions log their modifications in the main-memory log
buffer(s) and flush regularly to disk. All the log entries of a transaction need to be flushed
to disk before that transaction commits.
2.2.3
Access methods
One of the most important functions of the transaction processing engine is to maintain
the database’s various data structures on disk and in memory. The transaction processing
engine must manage disk space efficiently across many insertions and deletions, in the same
way that malloc() and free() manage memory. It is especially important that pages which
are scanned regularly by transactions be allocated sequentially to improve disk access times;
table reorganizations are occasionally necessary in order to improve data layout on disk.
In general, there are three types of data to be managed: (a) heap files, (b) index structures, and (c) metadata. A proper selection of indexes (part of physical design) and an
effective optimizer can reduce query execution times by factors of a thousand or more by
avoiding unnecessary disk scans when only one value is needed.
Heap data. Normal, unordered, database records are stored in heap files. These provide
sequential access to unknown (sets of) records, or random access to records whose location
is known through other means (such as the result of an index probe).
Index data. Indexes provide key-based access to data, either based on a candidate key
(unique) or a key attribute which could map to many tuples. The most common index
types are B+trees [BM70] and hash-based structures [FNPS79, Lit80]. The former give
O(logn) access time to ordered data, while the latter give O(1) access time to unordered
data. B+trees can be used for range scans while the latter not. Thus, B+trees is most
frequently use indexing technique.
Metadata. The transaction processing engine maintains metadata about the physical layout of data on disk, especially as it relates to space management. This metadata is similar
to file system metadata.
2.2.4
Metadata management
The transaction processing engine stores meta-information about the different objects (heap
files, index structures) that store the database. Applications make heavy use of this metadata. From the perspective of the transaction processing engine, database metadata (such
as the data dictionary or catalog) is just another type of data.
2.2. A TYPICAL TRANSACTION PROCESSING ENGINE
21
The transaction processing engine ensures that changes to metadata and free space do
not corrupt running transactions, while also servicing a high volume of requests, especially
for metadata. One part of the metadata information it is updated very infrequently. That
is, typically in a database, the number of objects (heap files and index structures) which are
being used along with their structure (number of columns, their order and data type) gets
updated very infrequently. Thus, usually systems are optimistic and cache the metadata
information throughout the connection session.
2.2.5
Buffer pool management
The buffer pool manager presents the rest of the system with the illusion that the entire
database resides in main memory, similar to an operating system’s virtual memory manager.
The buffer pool is a set of “frames,” each of which can hold one page of data from disk.
When an application requests a database page not currently in memory, it must wait while
the buffer pool manager fetches it from disk. If there are no free frames to put the newly
fetched page, the buffer pool needs to evict another page, following a replacement policy
(e.g. Least-Recently Used, CLOCK [Smi78], 2Q [KCK+ 00], or ARC [MM03]). Transactions
“pin” in-use pages in the buffer pool to prevent them from being evicted, and unpin them
when finished. As part of the logging and recovery protocol, the buffer pool manager and
log manager are responsible to ensure that modified pages are flushed to disk (preferably in
the background) so that changes to in-memory data become durable.
In order to quickly find any database page requested, buffer pools are typically implemented as large hash tables. Operations within the hash table must be protected from concurrent structural changes caused by evictions, usually with per-bucket mutex locks. Hash
collisions and hot pages can cause contention among threads for the hash buckets; growing
memory capacities and hardware context counts increase the frequency of page requests,
and hence the pressure, which the buffer pool must deal with. Finally, the buffer pool must
flush dirty pages and identify suitable candidates for eviction without impacting negatively
requests from applications (either by evicting the wrong candidates or by preventing them
from accessing the page in memory).
2.2.6
Concurrency control
Database engines must enforce logical consistency at the transaction level, ensuring that
transactions do not interfere with the correctness of other concurrent transactions. One of
the most intuitive (and restrictive) consistency models is “two phase locking” (2PL), which
dictates that a transaction may not acquire any new locks once it has released any. This
22
CPU
CPU
CPU
CPU
Main Memory
Buffer pool + Log buffer
2
3
Data
Data
Data
1
Log
Figure 2.2: An OLTP installation. There are four main components: a (possible multicore)
processor, ample main memory, a storage subsystem that stores the database, and a storage
subsystem that maintains the database log. There are three main types of I/O activities:
(1) logging, (2) write-backs of dirty pages, and (3) ad-hoc reads of random pages.
scheme is sufficient to ensure that all transactions appear to execute in some serial order,
though it can also restrict concurrency. An even more restricting alternative is the “strict
two phase locking” (strict 2PL), which in addition to 2PL dictates that all locks (shared
and exclusive) are released only after the transaction commits or aborts. In order to balance
the overhead of locking with concurrency, the transaction processing engine also provides
hierarchical locks. For example, to modify a single row a transaction acquires a database
lock, table lock, and row lock; meanwhile transactions which access a large fraction of a table
may reduce overhead by “escalating” to coarser-grained locking at the table level.
An increasing number of database engines also support an alternative to pessimistic
locking known as multiversioned buffer management [BG83], which provides the application
with snapshot isolation [BJK+ 97]. These schemes allow writers to update copies of the data
rather than waiting for readers to finish. Copying avoids the need for most low-level locking
and latching because older versions remain available to readers. Multiversioning is highly
effective for long queries which would otherwise conflict with many short-running update
transactions, but performs poorly under contention. In addition, many transactions update
only a few bytes per record accessed, and multiversioning imposes the cost of copying an
entire database page per record. Finally, “snapshot isolation,” which suffers from certain
non-intuitive isolation anomalies that are only partly addressed to date [JFRS07, AFR09].
2.3. I/O ACTIVITIES IN TRANSACTION PROCESSING
2.3
23
I/O activities in transaction processing
Transaction processing systems need to handle efficiently the various I/O activities that
occur when processing transactions. As a matter of fact, handling efficiently I/Os had been
one of the biggest concerns of transaction processing system designers over the past decades,
especially when database servers where mostly uni-processors and mechanical hard drives
very slow. But, the emergence of multicore processors and other technologies, such as flashbased storage devices, had brought up other issues, such as the scalability on highly parallel
hardware, which is the main topic of this dissertation.
In this section we analyze the possible I/O activities during transaction processing and
make the case that for a significant range of transactional applications, our study is valid
even though it is performed on machines that do not have very efficient (and expensive)
I/O subsystems. In the following subsections, we identify the different I/O activities and
categorize their resulting I/O patterns. The three main types of I/O activities during transaction processing, depicted in Figure 2.2, are the transactional logging, the ad-hoc reads
and evictions of random pages, and the write-backs of dirty pages.
2.3.1
Logging
The first type of I/O activity is the log writing. The transaction processing system maintains
a log to ensure the ACID properties over crashes or hardware failures. The log is a file
of modifications done to the database. Before a transaction commits, the redo log records
generated by the transaction must be flushed to stable media [MHL+ 92]. The log is typically
implemented as a circular buffer of constant size. When the log needs to wrap around (the
stable storage allocated for log becomes full) the oldest log records will be truncated to make
space for new records.
For databases whose working set fits in main-memory, which is a common case in modern database servers with large main memories, the log flushing is the only I/O activity
that needs to take place synchronously during the execution of a transaction. The system
must ensure that a transaction’s log records reach non-volatile storage before committing.
With access times in the order of milliseconds, a log flush to magnetic media can easily
become the longest part of a transaction. Further, log flush delays become serial if the log
device is overloaded by multiple small requests. Fortunately, The I/O pattern of the log
writing are essentially sequential appends, which can be easily handled by modern solidstate drives. Thus, log flush I/O times become less important as fast solid-state drives gain
popularity [BJB09, LMP+ 08, Che09], and when using techniques such as group commit
[HSL+ 89, RD89].
24
Even when the log storage device can sustain the write rate needed by the system,
transactional log flushing causes at least two problems: (a) the actual I/O wait time during
which all the locks are held possibly reducing concurrency, and (b) the context switches
required to block and unblock the thread at either end of the flush. To that end, our work
on Aether logging [JPS+ 10, JPS+ 11] presents two techniques that remove those potential
problems. We briefly discuss those techniques in the next two paragraphs. For more details
about those two mechanism and performance evaluation, the interested reader is referred to
[JPS+ 10] and [JPS+ 11].
Early Lock Release. To handle the problem of long waits for locks held during log flushes,
DeWitt et al. [DKO+ 84] observe that a transaction’s locks can be released before its commit
record is written to disk, as long as it does not return results to the client before becoming
durable. Other transactions which read data updated by a pre-committed transaction become dependant on it and must not be allowed to return results to the user until both their
own and their predecessor’s log records have reached the disk. Serial log implementations
preserve this property naturally, because the dependant transaction’s log records must always
reach the log later than those of the pre-committed transaction and will therefore become
durable later also. Formally, as shown in [SSY95], the system must meet two conditions for
early lock release to preserve recover-ability: (a) Every dependant transaction’s commit log
record is written to the disk after the corresponding log record of pre-committed transaction;
and (b) when a pre-committed transaction is aborted all dependant transactions must also
be aborted. Most systems meet this condition trivially; they do no work after inserting the
commit record, except to release locks, and therefore can only abort during recovery when
all uncommitted transactions roll back.
Early Lock Release (ELR) removes log flush latency from the critical path by ensuring
that only the committing transaction must wait for its commit operation to complete; having
released all held database locks, others can acquire these locks immediately and continue executing. In spite of its potential benefits modern database engines do not implement ELR and
to our knowledge this is the first paper to analyze empirically ELR’s performance. We hypothesize that this is largely due to the effectiveness of asynchronous commit [Ora05, Pos10],
which obviates ELR and which nearly all major systems do provide. However, systems which
do not sacrifice durability can benefit strongly from ELR under workloads which exhibit lock
contention and/or long log flush times.
Flush Pipelining. Optimizations such as group commit [RD89] focus on improving I/O
wait time without addressing thread scheduling. On the other hand, ELR decreases the
25
wait time of other transactions that wait on locks held by the transaction that does the
log flush. Still the requesting transaction must still block for its log flush I/O and be
rescheduled as the I/O completes. Unlike I/O wait time, which the OS can overlap with
other work, each scheduling decision consumes several microseconds of CPU time which
cannot be overlapped. To eliminate the scheduling bottleneck (and thereby increase CPU
utilization and throughput), the database engine must decouple the transaction commit from
thread scheduling. Flush Pipelining is a technique which allows agent threads to detach from
transactions during log flush in order to execute other work, resuming the transaction once
the flush is completed.
Flush Pipelining operates as follows. First, agent threads commit transactions asynchronously (without waiting for the log flush to complete). However, unlike asynchronous
commit they do not return immediately to the client but instead detach from the transaction, encode its state at the log and continue executing other transactions. A daemon
thread triggers log flushes using policies similar to those used in group commit (e.g. “flush
every X transactions, L bytes logged, or T time elapsed, whichever comes first”). After
each I/O completion, the daemon notifies the agent threads of newly-hardened transactions,
which eventually reattach to each transaction, finish the commit process and return results
to the client. Transactions which abort after generating log records must also be hardened
before rolling back. The agent threads handle this case as relatively rare under traditional
(non-optimistic) concurrency control and do not pass the transaction to the flush daemon. 4
When combined with ELR, Flush Pipelining provides the same throughput as asynchronous
commit without sacrificing any safety.
In summary, fast solid-state drives and techniques such as Early Lock Release and Flush
Pipelining handle the performance problems related to log flushing, which may be the only
source of I/O transactional databases that fit in main memory,
2.3.2
On-demand reads & evictions
The second type of I/O activity are the on-demand reads of random pages. If the OLTP
database is larger than the buffer pool size, the database pages to be accessed by a transaction
may be missing from the buffer pool. Under such situation, the system issues read I/O
requests on demand to retrieve the required pages into main memory for processing. The
execution of the transaction blocks until the required pages are brought to memory. The
resulting I/O are random reads of small size, equal to the database page size. While the
4
Most transaction rollbacks not due to deadlocks arise because of invalid inputs; these usually abort before
generating any log and do not have to be considered.
26
transaction is blocked, another transaction starts running on the CPU keeping it utilized.
As long as there are many concurrent requests that can be served, the CPUs of the system
will be fully utilized. Solid-state drives provide two orders of magnitude higher random read
bandwidth than mechanical disk drives, thereby to keep a machine with solid-state drives
fully utilized the number of outstanding transactions needs to be two orders of magnitude less
than if the system had mechanical hard drives. If the context switches become a significant
fraction of the execution time, a system can by-pass this problem by employing asynchronous
non-blocking I/Os.
There is also the case where the buffer pool is full and decides to evict a dirty page which
needs to be written back. In that case the resulting I/O is again a random write. Such
cases are less frequent than the random reads because typically the buffer pool manager will
prefer to replace a clean rather than a dirty page. The reason is simple. In order to replace
a clean page the system has to copy only the new page in the frame occupied by the old
(clean) page. On the other hand, in order to replace a dirty page the system first has to
write it back to stable storage doubling the I/Os performed for bringing a new page into the
buffer pool. Since we are mostly interested on servers with ample main memory, we expect
the evictions of dirty pages to be extremely rare.
2.3.3
Dirty page write-backs
The final of I/O activity are the dirty page write-backs forced by a checkpoint or log truncation. A database page in the buffer pool becomes dirty if it is modified by a transaction.
Dirty pages are not written back to stable storage synchronously during the transaction
execution. Instead, page-cleaning threads perform the write-backs asynchronously in the
background. In this way, multiple transactions may make modifications to (different parts)
of the same page in the buffer pool, reducing the number of write-back I/Os. Additionally,
those I/Os are not added in the response time of the transaction, which should be kept as
low as possible.
A dirty page, however, will eventually have to be written back if its associated redo log
records are to be truncated because of a log wrap-around or a checkpoint call. A checkpoint
is forced when the log exceeds the size that the database could recover in the amount of
time defined as the acceptable recovery interval by the application 5 . Hence, the mean time
to recovery, specified by the application, determines the size of the log and the checkpoint
frequency. The higher the checkpoint frequency (or the smaller the log size) the larger
the pressure on the underlying storage system and the larger the probability of the system
5
In case of a failure, the recovery starts from the last checkpoint [MHL+ 92].
27
becoming I/O-bound. On the other hand, the smaller the mean time to recovery the better
for the application. Therefore the system should be able to apply frequent checkpoints
without detriment to performance.
The dirty pages are flushed to the same location on the stable storage where the came
from. Since the dirty pages may be distributed across the entire database the resulting I/O
activity is typically a set of small random writes. The page cleaners try to find consecutive
dirty pages to write them to disk on larger blocks but their approach is opportunistic without
guarantees.
2.3.4
Summary and a note about experimental setups
To summarize, during OLTP we encounter three basic types of I/O activity: logging, which
consists of sequential appends; random on-demand page reads on buffer pool misses and
possible buffer pool evictions; and write-backs of dirty pages which page cleaners try to
coalesce to larger writes of consecutive pages. Software techniques, such as Early Lock
Release and Flush Pipelining, in combination with fast solid-state drives which provide
enough sequential write bandwidth, remove I/O flush wait times from being the bottleneck
in transactional workloads. For example, in our experiments the maximum log write rate
we encountered was around 180MB/sec (in Section 5.4.2). This was achieved when all the
64 hardware contexts of a multicore machine were fully utilized running an update-heavy
transactional workload. 6 Such a sequential write bandwidth can be easily handled by
solid-state drives. For example, one of the recently announced solid-state drives is reported
to provide up to 520MB/sec sequential write bandwidth [Int12]. Also, throughout our
experimentation we didn’t observe a case where the page cleaners could not keep up with
the log writing rate. At the same time, solid-state drives provide low latency and a large
number of random I/O operations per second 7 so that the number of concurrent transactions
needed to keep a multicore server fully utilized to be relatively small, two orders of magnitude
smaller than on machines with magnetic hard drives.
With the previous analysis in mind, we believe that I/Os won’t prevent transaction
processing systems from being CPU-bound. That’s why this dissertation focuses on the
performance and scalability of transaction processing systems when multicore = servers
are fully utilized. The majority of the combinations of hardware and transaction processing
systems we are using throughout this dissertation, are capable of delivering high performance
on the benchmarks we described in the next section, as long as the I/O sub-system allows.
6
7
Most workloads contain also read-only transactions that reduce the pressure for log I/O.
For example the Intel SSD 520 is reported to provide 80KIOPS [Int12]
28
The demand on the I/O sub-system scales with throughput. To by-pass this problem, and
yet have meaningful analysis we put the database files in a file system in main-memory.
Threads that need to perform and I/O still have to context switch, but we ensure that I/Os
are not the bottleneck. In some other experiments with systems that we have access to
the source code (such as Shore-MT [JPH+ 09], DORA [PJHA10] and PLP [PTJA11]), we
modify the system to impose a 6 msec penalty for each I/O operation. The artificial delay
simulates a high-end disk array having many spindles, such that all requests can proceed
in parallel but must each still pay the cost of a disk seek. This arrangement is somewhat
pessimistic because it assumes every access requires a full seek even if there is some sequential
component to the access pattern, but it ensures that all aspects of the transaction processing
system are exercised. We observed that the quantitative analysis did not change regardless
if we were keeping the data in main-memory or imposing the artificial delay.
2.4
Transactional processing workloads and benchmarks
The concurrency of requests and the need to maintain the ACID requirements are the main
characteristics of transactional workloads. Recently, due to the high increase in the need
for data management services, and because some applications can tolerate it, systems relax ACID requirements (e.g. [Vog09]) or drop “traditional” data processing capabilities of
database management systems (e.g. the key-value stores [CDG+ 06, DHJ+ 07]).
Transaction processing benchmarks are the gold standards for performance, and they are
used for marketing purposes. The following subsections describe several important database
transaction processing benchmarks mentioned and used throughout this dissertation.
2.4.1
TPC-A
The first widely-accepted database benchmark was formalized in 1985 [A+ 85]. That specification included three workloads, of which the “DebitCredit” stressed the database engine.
The DebitCredit benchmark was an instant success, soon database and hardware vendors
took to reporting extraordinary results, often achieved by removing key constraints from the
specification. Therefore, in 1988 a consortium of analysts and hardware, operating system,
and database system vendors formed the Transaction Processing Performance Council (TPC)
in order to enforce some order in database benchmarking. Its first benchmark specification,
TPC-A, essentially formalized the DebitCredit benchmark.
TPC-A is very simple. It models deposits and withdrawals on random bank accounts,
with the associated double-entry accounting on a database that contains 10k Branches, 100k
Tellers, and 10M Accounts. It also captures the entire system, including terminals and
2.4. OLTP WORKLOADS AND BENCHMARKS
29
network. Transactions usually originate from their “home” Branch, but can go anywhere;
conflicts are also possible, requiring the system to recover occasionally from failed transactions. An important aspect of this benchmark is its scaling rule: for a result to be valid, the
database size must be proportional to the reported throughput.
Simple though it was, the DebitCredit benchmark highlighted the importance of quantifying the performance and correctness of different systems – early benchmarking showed
vast performance differences between different vendors (400x), as well as exposing serious
bugs which had lurked, undiscovered, for many years in mature products.
2.4.2
TPC-B
TPC’s second benchmark, TPC-B [TPC94], was also very similar to DebitCredit, but cut out
the network and terminal handling to create a database engine stress test. Like DebitCredit,
the TPC-B database contains four tables: Branch, Teller, Account, and History, which are
accessed in double-entry accounting style as customers make deposits and withdrawals from
various tellers. The benchmark consists of a single transaction AccountUpdate and stresses
the transaction processing engine heavily, especially logging and concurrency control.
2.4.3
TPC-C
For its third benchmark specification, TPC-C [TPC07], the TPC moved away from banking
to commerce. TPC-C models an online transaction processing database for a wholesale
supplier. It consists of five transactions which follow customer orders from initial creation
to final delivery and payment. Below we briefly describe the five transactions and in the
parenthesis we show the frequency of each transaction as specified by the benchmark.
• New Order (45%). The NewOrder inserts a new sales order into the database. It is a
medium-weight transaction with 1% failure rate due to invalid inputs.
• Payment (43%). The Payment is a short transaction, very similar to the transaction
of TPC-B, which makes a payment on an existing order.
• Order Status (4%). The OrderStatus is a read-only transaction which computes the
shipping status and the line items of an order.
• Delivery (4%). The Delivery is the largest update transaction and also the most
contentious. It selects the oldest undelivered orders for each warehouse and marks them
as delivered.
• Stock Level (4%). The StockLevel is also a read-only transaction. It joins on average
200 order line items with their corresponding stock entries in order to produce a report.
30
The benchmark combines the five transactions listed above at their specified frequencies.
The specification lays out strict requirements about response time, consistency, and recoverability in the system, and returned to testing an end-to-end system that includes network
and terminal handling. Like the transactions, the database schema is more complex, consisting of nine tables instead of four; where prior benchmark schemas could be represented
as a tree the TPC-C schema is a directed acyclic graph.
TPC-C stresses the entire stack (database system, operating system and hardware) in
several ways. First, it mixes together short and long, read-only and update-intensive transactions, exercising a wider variety of features and situations than previous benchmarks. In
addition, the benchmark has significant hotspots, partly from the way transactions access the
Warehous table, and partly from the way the Delivery transaction is designed. The resulting contention and deadlocks stress the system’s concurrency control mechanisms. Finally,
the database grows throughout the benchmark run, stressing code paths which previous
benchmarks had not touched.
TPC-C is the most popular OLTP benchmark for over twenty years. All database vendors have published results in TPC’s website, and in several occasions it has been used for
marketing purposes.8
2.4.4
TPC-E
9
The goal of TPC with its latest OLTP benchmark, TPC-E, was to make a more realistic
than TPC-C [TPC10], which is getting rather old. TPC-E incorporates several features that
are found in real-world transaction processing applications but missing in TPC-C, such as
check constraints and referential integrity. In addition, the TPC-E databases are populated
with pseudo-real data that are based on the year 2000 U.S. and Canada census data and
actual listings on the NYSE and NASDAQ stock exchanges. In this way, TPC-E reflects
natural data skews in the real world, addressing a complaint of TPC-C using random data
that do not reflect real-world data distributions
TPC-E models a financial brokerage house. There are three components: customers,
brokerage house, and stock exchange. TPC-E’s focus is the database system supporting the
brokerage house, while customers and stock exchange are simulated to drive transactions at
the brokerage house.
8
9
E.g. http://www.oracle.com/us/solutions/performance-scalability/t3-4-tpc-c-12210-bmark-190934.html
and http://www-03.ibm.com/press/us/en/pressrelease/32328.wss .
Some of the material of this subsection is presented at [CAA+ 10].
2.4. OLTP WORKLOADS AND BENCHMARKS
31
There are 33 tables in TPC-E, over three times as many as in TPC-C. Of the 33 TPC-E
tables, there are 9 tables recording customer account information, 9 tables recording broker
and trade information, 11 market-related tables, and 4 dimension tables for addresses and
fixed information such as zip codes. 19 of the 33 tables scale as the number of customers,
5 tables grow during TPC-E runs, and the remaining 9 tables are static. TPC-E has 6
read-only and 4 read-write transaction types, compared to 2 read-only and 3 read-write
transaction types in TPC-C. 76.9% of the generated transactions in TPC-E are read-only,
while only 8% of TPC-C transactions are read-only. This suggests that TPC-E is more
read intensive than TPC-C. TPC-E’s test setup is complicated, requiring the development
of customer and market drivers. The complexity and the lack of in-depth understandings of
TPC-E have led so far to its slow adoption by both industry and academia. For example,
only one database vendor has posted results of this benchmark in TPC’s website.
TPC-E it imposes way less stress to the database engine than TPC-C or TPC-B [CAA+ 10].
Since the goal of this dissertation is to improve the scalability of database engines when their
internals are under stress, we have elected not to use it.
2.4.5
TATP
The only benchmark that we use in this dissertation and it is not specified by the Transaction
Processing Council is the Telecommunications Application Transaction Processing Benchmark or TATP [NWMR09] 10 . TATP was originally developed by Nokia as an in-house test
to verify the suitability of various database, operating system and hardware offerings for use
with Nokia’s telecommunications business.
TATP consists of seven transactions, operating on four database tables, which implement
various Home Location Register operations executed by mobile networks during cell phone
calls, including cell tower hand-offs and call forwarding. The transactions are extremely
short, usually accessing only 1-4 database rows, and must execute with very low latency
even under extreme load. The benchmark is unusual in that many transactions fail due to
either invalid inputs or because the probe for non-existent entries. They have very high
failure rate, which is on average 25%, stresses the logging and recovery components of the
system, while may cause deadlocks. Three of the transactions are read-only while the other
four perform updates (execution frequency in the parenthesis):
• Get Subscriber Data (35%). The GetSubscriberData is a read-only transaction
which retrieves information about the location of subscriber accessing a single table. It
is one of the two transactions of the benchmark that are not expected to fail.
10
Also known as “Telecom One” or ‘TM-1”, and as “Network Database Benchmark” or “NDBB”.
32
• Get New Destination (10%).
The GetNewDest is a read-only transaction which
retrieves the current call forwarding destination for a subscriber, if any, touching two
tables. There is a 75% probability the subscriber to not have a current call forwarding
destination, in that case the transaction fails.
• Get Access Data (35%). The GetAccData is a read-only transaction which returns the
subscriber’s access validation data, by probing a single table. There is 37.5% probability
for this transaction to fail.
• Update Subscriber Data (2%). The UpdSubData transaction updates a subscriber’s
profile touching two tables and with 37.5% failure rate.
• Update Location (14%). The UpdLocation transaction updates the current location
of a subscriber. It touches a single table and never fails.
• Insert Call Forwarding (2%). The InsCallFwd adds a call forwarding destination.
It touches three tables (making it the most complicated transaction of the benchmark),
and has 68.75% failure rate.
• Delete Call Forwarding (2%). The DelCallFwd transaction removes a call forwarding
destination. It touches two tables and has 68.75% failure rate.
Reflecting its origins in cell phone call processing, the benchmark focuses on throughput,
very low and predictable response times, and high availability. From the description of the
transactions the reader understands how stressful this benchmark is for the database engine.
All the transactions constitute of one or few index probes and record accesses, and very
limited application logic. As result, during the execution of this benchmark the system
spends most of its time inside the database engine. For example, compared with TPC-C,
TATP’s dataset tends to be memory-resident and the transactions are much shorter with
vastly higher abort rates. TATP stresses the transaction processing engine heavily because
the short transactions tend to expose overheads longer transactions would mask.
33
Chapter 3
Scalability Problems in Database Engines
Database engines have long been able to efficiently handle multiple concurrent requests. Until
recently, however, a computer contained only a few single-core CPUs, and therefore only a
few transactions could simultaneously access the database engine’s internal structures. This
allowed database engines to get away with using non-scalable approaches without any severe
penalty. With the arrival of multicore chips, however, this situation is rapidly changing.
More and more threads can run in parallel, stressing the internal scalability of the database
engine. Systems optimized for high performance at a limited number of cores are not assured similarly high performance at a higher core count, because unanticipated scalability
obstacles arise. In this chapter, we first present the major components of a typical database
engine, making clear that the codepaths in transaction execution are full of points of serializations. Then, we benchmark four popular open-source database engines (SHORE, BerkeleyDB, MySQL, and PostgreSQL) on a modern multicore machine. We find that all suffer
in terms of scalability. 1
3.1
Introduction
Most database engine designs date back to the 1970’s or 1980’s, when disk I/O was the
predominant bottleneck. Machines typically featured 1-8 (uni)processors (up to 64 at the
very high end) and limited RAM. Single-thread speed, minimal RAM footprint, and I/O
subsystem efficiency determined overall performance of a storage manager. Efficiently multiplexing concurrent transactions to hide disk latency was the key to high throughput. Thus,
research focused on efficient buffer pool management, fine-grain concurrency control, and sophisticated caching and logging schemes.
Today’s database systems face a different environment. Main memories are in the order
of several tens of gigabytes, and play the role of disk for many applications whose working set
fits in memory [Gra07b, SMA+ 07]. Modern CPU designs all feature multiple processor cores
per chip, often with each core providing some flavor of hardware multithreading. For the
1
This chapter highlights some of our findings which appeared in EDBT 2009 [JPH+ 09].
CHAPTER 3. SCALABILITY PROBLEMS IN DATABASE ENGINES
34
Norm. Throughput
10
MySQL
8
BDB
6
PostgreSQL
4
SHORE
2
0
0
4
8
12 16 20 24
Concurrent Clients/Transactions
28
32
Figure 3.1: Scalability of four popular open-source storage managers, with throughput on
the y-axis. A perfectly scalable system, at 32 clients (at the right-most of the graph) would
achieve 32x the performance of the 1 client, more than three times higher than the highest
performing system under test.
foreseeable future we can expect single-thread performance to remain the same or increase
slowly while the number of available hardware contexts grows exponentially. As a result, the
database engine must be able to utilize the dozens of hardware contexts that will soon be
available to it.
However, as Section 3.2 discusses, the codepath of a typical database engine is full of
points of serialization which may hamper scalability. Unfortunately, the internal scalability
of database engines has not been tested under such rigorous demands before.
To determine how well existing database engines scale, we experiment with four popular open-source systems: SHORE [CDF+ 94], BerkeleyDB 2 , MySQL 3 , and PostgreSQL
[SR86]. The latter three engines are all widely deployed in commercial systems. We ran our
experiments on a Sun T2000 (Niagara) server [KAO05], a “lean” multicore processor design
[HPJ+ 07], featuring eight cores and four hardware thread contexts per core for a total of
32 OS-visible “processors.” Our first experiment consists of a micro-benchmark (a small,
tightly-controlled application) where each client in the system creates a private table and
repeatedly inserts records into it (see Section 3.3.1 for details). The setup ensures there is
no contention for database locks or latches, and that there is no I/O on the critical path.
Figure 3.1 shows the results of executing this micro-benchmark on each of the four storage
managers. The number of concurrent threads varies along the x-axis with the corresponding
2
3
http://www.oracle.com/technology/products/berkeley-db/index.html
http://www.mysql.com
3.2. CRITICAL SECTIONS INSIDE A DATABASE ENGINE
35
throughput for each engine on the y-axis. As the number of concurrent threads grows from
1 to 32, throughput in a perfectly scalable system should increase linearly. However, none of
the four systems scales well, and their behavior varies from arriving at a plateau (PostgreSQL
and SHORE) to a significant drop in throughput (BerkeleyDB and MySQL). These results
suggest that, with core counts doubling every two years, none of these transaction processing
systems is ready for the multicore era; even though database systems have utilized parallel
hardware for decades (e.g. [DG92, SKPO88, DGS+ 90, BAC+ 90, AVDBF+ 92]). It should
be noted that Figure 3.1 summarizes the situation ca. 2009. The situation has improved
markedly since we first ran these experiments. Developers of the various engines have focused
on improving their scalability, sometimes reporting that they used techniques presented in
this dissertation.
In retrospect these results are understandable because when these engines were developed,
internal scalability was not a bottleneck and designers did not foresee the coming shift to
multicore hardware. At the time, it would have been difficult to justify spending considerable
effort in this area. Today, however, internal scalability of database engines is the key to
performance as core counts continue increasing.
The rest of this chapter is structured as follows. In Section 3.2, we briefly overview the
major components of a database engine and list the kinds of critical sections they use. Then,
in Section 3.3, we measure the scalability of popular database engines, and we conclude in
Section 3.4.
3.2
Critical sections inside a database engine
Database engines purposefully serialize transaction threads in three ways. Database locks
enforce consistency and isolation between transactions by preventing other transactions from
accessing the lock holder’s data. Locks are a form of logical protection and can be held for
long duration (potentially several disk I/O times). Latches protect the physical integrity of
database pages in the buffer pool, allowing multiple threads to read them simultaneously, or
a single thread to update them. Transactions acquire latches just long enough to perform
physical operations (at most one disk I/O), depending on locks to protect that data until
transaction commit time. Locks and latches have been studied extensively [ACL87, GL92].
Database locks are especially expensive to manage, prompting proposals for hardware acceleration [Rob85].
Critical sections form the third source of serialization. Database engines employ many
complex, shared data structures; critical sections (usually enforced with semaphores or mutex
36
locks) protect the physical integrity of these data structures in the same way that latches
protect page integrity. Unlike latches and locks, critical sections have short and predictable
duration’s because they seldom span I/O requests or complex algorithms [JPA08, HM93];
often the thread only needs to read or update a handful of memory locations. For example,
a critical section might protect traversal of a linked list.
Critical sections abound throughout the codepath of typical database engines. For example, in Shore-MT (the storage manager we present next in Chapter 5), we estimate that a
TPC-C Payment transaction – which only touches 4-6 database records (see Section 2.4.3) –
enters roughly one hundred critical sections before committing. Under these circumstances,
even uncontended critical sections are important because the accumulated overhead can contribute a significant fraction of overall cost. The rest of this section presents an overview of
major storage manager components and lists the kinds of critical sections they use.
Buffer pool manager. The buffer pool manager maintains a pool for in-memory copies of
in-use and recently-used database pages and ensures that the pages on disk and in memory
are consistent with each other. The buffer pool consists of a fixed number of frames which
hold copies of disk pages and provide latches to protect page data. The buffer pool uses a
hash table that maps page IDs to frames for fast access, and a critical section protects the
list of pages at each hash bucket. Whenever a transaction accesses a persistent value (data
or metadata) it must locate the frame for that page, pin it, then latch it. Pinning prevents
the pool manager from evicting the page while a thread acquires the latch. Once the page
access is complete, the thread unlatches and unpins the page, allowing the buffer pool to
recycle its frame for other pages if necessary. Page misses require a search of the buffer
pool for a suitable page to evict, adding yet another critical section. Overall, acquiring and
releasing a single page latch requires at least 3-4 critical sections, and more if the page gets
read from disk.
Lock manager. Database locks preserve isolation and consistency properties between
transactions. Database locks are hierarchical, meaning that a transaction wishing to lock
one row of a table must first lock the database and the table in an appropriate intent mode.
Hierarchical locks allow transactions to balance granularity with overhead: fine-grained locks
allow high concurrency but are expensive to acquire in large numbers. A transaction which
plans to read many records of a table can avoid the cost of acquiring row locks by escalating
to a single table lock instead. However, other transactions which attempt to modify unrelated rows in the same table would then be forced to wait. The number of possible locks
3.3. SCALABILITY OF EXISTING ENGINES
37
scales with the size of the database, so the storage engine maintains a lock pool very similar
to the buffer pool.
The lock pool features critical sections that protect the lock object free list and the linked
list at each hash bucket. Each lock object also has a critical section to “pin” it and prevent
recycling while it is in use, and another to protect its internal state. This means that to
acquire a row lock, a thread enters at least three critical sections for each of the database,
table, and row lock.
Log manager. The log manager ensures that modified pages in memory are not lost in the
event of a failure: all changes to pages are logged before the actual change is made, allowing
the page’s latest state to be reconstructed during recovery. Every log insert requires a critical
section to serialize log entries and another to coordinate with log flushes. An update to a
given database record often involves several log entries due to index and metadata updates
that go with it.
Free space management. The storage manager maintains metadata which tracks disk
page allocation and utilization. This information allows the storage manager to allocate
unused pages to tables efficiently. Each record insert (or update that increases record size)
requires entering several critical sections to determine whether the current page has space
and to allocate new pages as necessary. Note that the transaction must also latch the free
space manager’s metadata pages and log any updates.
Transaction management. The system maintains a total order of transactions in order
to resolve lock conflicts and maintain proper transaction isolation. Whenever a transaction
begins or ends this global state must be updated. In addition, no transaction may commit
during a log checkpoint operation, in order to ensure that the resulting checkpoint is consistent. Finally, multi-threaded transactions must serialize the threads within a transaction in
order to update per-transaction state such as lock caches.
3.3
Scalability of existing engines
Obviously the actual number and behavior of critical sections differs depending on the specific
implementation of the database engine under test. That’s why in this section we measure
the internal scalability of various database engines. We begin by describing the experimental
environment and then we proceed to the actual evaluation.
38
3.3.1
Experimental setup
All experiments were conducted using a Sun T2000 (Niagara) server [KAO05, DLO05] running Solaris 10. The Niagara chip has an aggressive multi-core architecture with 8 cores
clocked at 1GHz; each core supports 4 thread contexts, for a total of 32 OS-visible “processors.” The 8 cores share a common 3MB L2 cache and each of them is clocked at 1GHz.
The machine is configured with 16GB of RAM and its I/O subsystem consists of a RAID-0
disk array with 11 15kRPM disks.
We relied heavily on the Sun Studio development suite, which integrates compiler, debugger, and performance analysis tools. Unless otherwise stated every system is compiled using
version 5.9 of Sun’s CC. All profiler results were obtained using the ‘collect’ utility, which
performs sample-based profiling on unmodified executables and imposes very low overhead
(< 5%).
We evaluate four open-source database engines: PostgreSQL [SR86], MySQL, BerkeleyDB, and SHORE [CDF+ 94].
PostgreSQL v8.1.4. PostgreSQL is an open source database management system providing a powerful optimizer and many advanced features. We used a Sun distribution of
PostgreSQL optimized specifically for the T2000. We configured PostgreSQL with a 3.5GB
buffer pool, the largest allowed for a 32-bit binary.4 The client drivers make extensive use
of SQL prepared statements.
MySQL v5.1.22-rc. MySQL is a very popular open-source database server recently acquired by Sun. We configured and compiled MySQL from sources using InnoDB as the
underlying transactional storage engine. InnoDB is a full transactional database engine (unlike the default, MyISAM). Client drivers use dynamic SQL syntax calling stored procedures
because we found they provided significantly better performance than prepared statements.
BerkeleyDB v4.6.21. BerkeleyDB is an open source, embedded database engine currently
developed by Oracle and optimized for C/C++ applications running known workloads. It
provides full database engine capabilities but client drivers link against the database library
and make calls directly into it through the C++ API, avoiding the overhead of a SQL
front end. BerkeleyDB is fully reentrant but depends on the client application to provide
multithreaded execution. We note that BerkeleyDB is the only storage engine without rowlevel locking; its page-level locks can severely limit concurrency in transactional workloads.
4
The release notes mention sub-par 64-bit performance
39
SHORE v5.0.1. SHORE was developed at the University of Wisconsin in the early 1990’s
and provides features that all modern DBMS use: full concurrency control and recovery with
two-phase row-level locking and write-ahead logging, along with a robust implementation of
B+Tree indexes. The SHORE database engine is designed to be either an embedded database
or the back end for a “value-added server” implementing more advanced operations. Client
driver code links directly to the database engine and calls into it using the API provided for
value-added servers. Client code must use the threading library that SHORE provides.
For comparison and validation of the results, we also present measurements from a commercial database manager (DBMS “X”). 5 All database data resides on the RAID-0 array,
with log files sent to an in-memory file system. The goal of our experiments is to exercise
all the components of the database engine (including I/O, locking and logging), but without
imposing I/O bottlenecks. Unless otherwise noted, all database engines were configured with
4GB buffer pools.
We are interested in two metrics: throughput (e.g.. transactions per second) and scalability (how throughput varies with the number of active threads). Ideally an engine would
be both fast and scalable. Unfortunately, as we will see, database engines tend to be either
fast or scalable, but not both.
Our micro-benchmark repeatedly inserts records into a database table backed by a BTree index. Each client uses a private table; there is no logical contention and no I/O
on the critical path. 6 Transactions commit every 1000 records, with one exception: We
observed a severe bottleneck in log flushes for MySQL/InnoDB and modified its version of
the benchmark to commit every 10000 records in order to allow a meaningful comparison
against the other engines. Record insertion stresses primarily the free space manager, buffer
pool, and log manager.
In order to extract the highest possible performance from each storage manager, we
customized our benchmarks to interface with each storage manager directly through its
respective C API. Client code executed on the same machine as the database server, but we
found the overhead of clients to be negligible (< 5%).
3.3.2
Evaluation of performance and scalability
We begin by benchmarking each database database engine under test and highlight the
most significant factors that limit its scalability. Due to lock contention in the transactional
5
6
Licensing restrictions prevent us from disclosing the vendor.
All the engines use asynchronous page cleaning and generated more than 40MB/sec of disk traffic during
the tests.
40
Efficiency (tps/client)
Throughput / Client
10
DBMS 'X'
PostgreSQL
1
MySQL
BDB
SHORE
0.1
0
8
16
24
32
Concurrent clients
Figure 3.2: Comparison of the efficiency (throughput/thread) with which several storage engines execute transactions as the number of threads increases. Ideally, per-thread throughput
remains steady as more threads (or utilization) join the system.
benchmarks, the internals of the engines do not face the kind of pressure they do on the
insert-only benchmark. Thus we use the latter to expose the scalability bottlenecks at high
core counts and to highlight the expected behavior of the transactional benchmarks as the
number of hardware contexts per chip continues to increase.
Figure 3.2 compares the scalability of the various engines when we run the insert-only
micro-benchmark. This figure shows the efficiency of the system, measured in transactions
per second per thread, plotted on a log-y axis. Higher is better, and a perfectly scalable
system would maintain the same efficiency as thread counts increase. We use log-y scale on
the graphs because it shows scalability clearly without masking absolute performance. Linear
y-axis is misleading because two systems with the same scalability will have differently-sloped
lines, making the faster one appear less scalable than it really is. In contrast, a log-y graph
gives the same slope to curves having the same scalability.
To have a better insight on what is going on we profile the runs with multiple concurrent
clients (16 or 24) stressing up the storage engine. Then we collect the results and interpret
call stacks to identify the operations where each system spends its time.
PostgreSQL. PostgreSQL suffers a loss of parallelism due to three main factors. First,
contention for log inserts causes threads to block (XLogInsert). Second, calls to malloc
add more serialization during transaction creation and deletion (CreateExecutorState and
ExecutorEnd). Finally, transactions block while trying to lock index metadata (ExecOpenIndices),
41
even though no two transactions ever access the same table. Together these bottlenecks only
account for 10-15% of total thread time, but that is enough to limit scalability.
MySQL. MySQL/InnoDB is bottlenecked on two spots. The first one is the interface
to InnoDB; in a function called srv conc enter innodb threads remain blocked as long as
around the 39% of the total execution time. The second one are the log flushes. In another
function labeled log preflush pool modified pages the system again experiences large
blocking time equal to the 20% of the total execution time (even after increasing transaction
length to 10K inserts).
We also observe that MySQL spends a non-trivial fraction of its time on two mallocrelated functions, take deferred signal and mutex lock internal. This suggests a potential for improvement by avoiding excessive use of malloc (trash stacks, object re-use,
thread-local malloc libraries, etc.)
BerkeleyDB. BDB spends the majority of its time on either testing for availability or
trying to acquire a mutex – the system spends over 80% of its processing time in two
functions with names db tas lock and lock try. Presumably the former is a spinning
test-and-set lock while the latter is the test of the same lock. Together these likely form
a test-and-test-and-set mutex primitive [RS84], which is supposed to scale better than the
simple test-and-set. The excessive use of test-and-test-and-set (TATAS) locking justifies the
high performance of BDB on low contended cases, since the TATAS locks impose very little
overhead on low contention, but fail miserably on high contention.
BerkeleyDB employs coarse-grained page-level locking, which by itself imposes scalability
problems. The two callers for lock acquisition have names bam search and bam get root.
So, in high contention BDB spends most of its time trying to acquire the latches for tree
probes. Additionally, we see that the system spends significant amount of time blocked waiting on a pthread mutex lock and cond wait, most probably because the pthread mutexes
are used as a fallback plan to acquire the highly contended locks (i.e. spin-then-block).
DBMS “X”. Unfortunately, the commercial database engine is significantly harder to
profile, lacking debug symbols and making all system calls in assembly code rather than
relying on standard libraries. However, we suspect from the system’s I/O behavior and CPU
utilization that log flush times are the main barrier to scalability.
SHORE. SHORE suffers multiple scalability bottlenecks, which is unsurprising since it
was designed to multiplex all user-level threads over a single kernel thread.
This section highlights the bottlenecks we observed in existing database engines, which
developers cannot ignore if the goal is a true scalable system in emerging many-core hardware.
42
Scaleup for P=32
10
8%
8
6
PostgreSQL
Peff (Amdahl’s Law)
4
59%
2
MySQL
BerkeleyDB
80%
0
0%
20%
40%
Degree of serialization (s)
60%
80%
100%
Figure 3.3: Measured bottleneck sizes and scalability vs. predictions by Amdahl’s Law.
As the PostgreSQL case illustrates, what appears to be a small bottleneck can still hamper
scalability as the number of concurrent clients increases.
3.3.3
Ramifications
We presented the major components of a typical database engine and made clear that the
codepaths in transaction execution are full of critical sections. As a result, and as can be seen
in the preceding Section 3.3.2, all the major open source database engines suffer scalability
bottlenecks which prevent them from exploiting multicore hardware.
Some of the bottlenecks (such as in PostgreSQL and DBMS “X”), do not appear so large.
However, Amdahl’s Law [Amd67] captures just how difficult it can be to extract scalable
performance from parallel hardware. To illustrate, Figure 3.3 overlays Amdahl’s Law with
a scatter plot of the scalability and bottleneck sizes for three of the database engines we
evaluated in the previous section. 7 As can be seen, the predicted and measured impact of
bottlenecks are very close, verifying that small bottlenecks have a disproportionate impact
on scalability. Even PostgreSQL, which suffers only an 8% bottleneck, utilizes fewer than 10
cores.
3.4
Conclusion
This chapter shows that the codepath of database engines is cluttered with a large number
of critical sections, and underscores the importance of focusing on them since they hamper
scalability. The rest of the dissertation investigates ways to boost scalability of database
7
We could not measure the size of the bottleneck for DBMS “X”, and SHORE’s bottleneck is absurdly
large.
3.4. CONCLUSION
43
engines. We will show that this is do-able, even when starting from such unpromising results
as shown here.
44
Part II
Addressing scalability bottlenecks
45
47
Chapter 4
Critical sections in transaction processing:
categorization and implementation
As discussed in the previous part, the serial computations or “critical sections” are those
which determine the scalability of database engines. In this chapter, we make the key observation that not all the critical sections constitute equal threats to the scalability of the
system. The most worrisome are those which the contention for them increases along with
the hardware parallelism. System designers should make the removal of those unscalable
critical sections their top priority. Furthermore, we observe that, in practice, many critical sections are so many and so short that enforcing them contributes a significant or even
dominating fraction of their total cost and tuning them directly improves the systems performance. In general, in order to ameliorate the impact of critical sections, we should both
provide algorithmic changes and employ proper synchronization primitives. 1
4.1
Introduction
Ideally, a database engine would scale perfectly, with throughput remaining (nearly) proportional to the number of clients, even for a large number of clients, until the machine is
fully utilized. In practice several factors limit database engine scalability. Disk and compute
capacities often limit the amount of work that can be done in a given system, and badlybehaved applications generate high levels of lock contention and limit concurrency. However,
these bottlenecks are all largely external to the database engine; within the storage manager
itself, threads share many internal data structures. Whenever a thread accesses a shared
data structure, it must prevent other threads from making concurrent modifications or data
races and corruption will result. These protected accesses are known as critical sections, and
can reduce scalability, especially in the absence of other, external bottlenecks.
As Chapter 3 discussed, the codepaths of typical database engines are cluttered with
critical sections. Out of the large number of critical sections only those whose contention
1
This chapter is based on material presented at [PTJA11] and [JPA08].
48
CHAPTER 4. CRITICAL SECTIONS
increases with the hardware parallelism impose threat to scalability. On the other hand,
given their sheer number, even uncontended critical sections are important because their
combined overhead can contribute a significant fraction of overall transaction cost.
The literature abounds with synchronization approaches and primitives which could be
used to enforce critical sections, each with its own strengths and weaknesses. The database
system developer must then choose the most appropriate approach for each type of critical
section encountered during the tuning process or risk lowering performance significantly. To
our knowledge there is only limited prior work that addresses the scalability and performance
impact of critical sections, leaving developers to learn which primitives are most useful by
trial and error.
As the database developer optimizes the system for scalability, algorithmic changes are
required to reduce the number of threads contending for particular critical section. Additionally, we find that the method by which existing critical sections are enforced is a crucial
factor in overall performance and, to some extent, scalability. Database code exhibits extremely short critical sections, such that the overhead of enforcing those critical sections
is a significant or even dominating fraction of their total cost. Reducing the overhead of
enforcing critical sections directly impacts performance and can even take critical sections
off the critical path without the need for costly changes to algorithms. The contributions of
this chapter are the following:
1. We make a distinction between the different categories of critical sections which impose
different threats to scalability of the system. The most worrisome are those which the
contention for them increases along with the hardware parallelism.
2. We make a thorough performance comparison of the various synchronization primitives
in a software system developer’s toolbox and highlight the best ones for practical use.
3. Using a prototype database engine as test-bed, we show that the combination of enforcing the appropriate synchronization primitives and employing algorithmic changes can
drastically improve scalability.
The rest of this chapter is structured as follows. In Section 4.2 we make the key observation that not all the points of synchronization constitute equal threats to the scalability
of the system. Then, Section 4.3 presents and evaluates the most common types of synchronization approaches. In addition, it identifies the most useful ones for enforcing the
types of critical sections found in database code. Finally, Section 4.4 discusses how one can
potentially improve critical section capacity and Section 4.5 concludes.
4.2. COMMUNICATION PATTERNS AND CRITICAL SECTIONS
4.2
49
Communication patterns and critical sections
Traditional transaction processing systems excel at providing high concurrency, or the ability
to interleave multiple concurrent requests or transactions over limited hardware resources.
However, as chip manufacturers continue to stamp out as many processing cores as possible
onto each chip, performance increasingly depends on execution parallelism, or the ability for
multiple requests to make forward progress simultaneously in different execution contexts.
Even the smallest of serializations on the software side therefore impacts scalability and
performance [HM08]. Unfortunately, recent studies show that high concurrency in transaction processing systems does not necessarily translate to sufficient execution parallelism
[JPH+ 09, JPS+ 10], due to the high degree of irregular and fine-grained communication they
exhibit.
In this section we categorize the types of communication that can occur in an OLTP
system. Communication is important because in order to be performed correctly it imposes
some kind of serialization, as the system needs to execute critical sections. Thus, critical
sections, in turn, fall into different categories depending on the nature of communication they
protect and the contention they tend to trigger in the system. We will use this categorization
in later chapters in order to analyze the execution of a shared everything system as we will
be evolving its design (e.g. Chapter 5, Chapter 6, and Chapter 7).
4.2.1
Types of communication
Transaction processing systems employ several different types of communication and synchronization. Database locking operates at the logical (application) level to enforce isolation
and atomicity between transactions. Page latching operates at the physical (database page)
level to enforce the consistency of the physical data stored on disk in the face of concurrent
updates from multiple transactions. Finally, at the lowest levels, critical sections protect
various code paths which must execute serially to protect the consistency of the system’s internal state. Critical sections are traditionally enforced by mutex locks, atomic instructions,
etc. We note that locks and latches, which form a crucial part of the systems’ internal state,
are themselves protected by critical sections, so analyzing the behavior of critical sections
captures nearly all forms of communication in the DBMS.
4.2.2
Categories of critical sections
Because a transaction processing system cannot always eliminate communication entirely
without giving up important features, we must find ways to achieve scalability while still
allowing some communication. In order to guide this search, we break communication pat-
50
Core
Core
Core
Core
Core
Core
Unbounded or unscalable
(e.g. locking, latching)
Core
Core
Core
Core
Core
Core
Fixed or point-to-point
(e.g. operator pipelining)
Core
Core
Core
Core
Core
Core
Cooperative or composable
(e.g. logging)
Figure 4.1: We classify the communication, and the resulting critical sections, into three
patterns: unbounded, fixed and cooperative. Only the unbounded communication poses scalability problems, the other two mostly cause overheads into single-threaded execution.
terns into three types: unbounded, fixed, and cooperative. We illustrate the three patterns in
Figure 4.1 and briefly describe them in the following paragraphs.
Unbounded or unscalable
This type of pattern, shown on the left side of Figure 4.1, arises when the number of threads
in a point of communication is roughly proportional to the degree of parallelism in the system. Unbounded or unscalable communication has the highly undesirable tendency to affect
every thread in the system. As hardware parallelism increases the degree of contention for
the corresponding critical sections that coordinate unbounded communication also increases
without bounds. No matter how efficient or infrequent the communication, exponentiallyincreasing parallelism will eventually expose it as a bottleneck. In other words, making these
critical sections shorter or less frequent provides a little slack but does not fundamentally
improve scalability.
Globally shared data structures, which multiple threads update concurrently, fall directly
to this category. Thus, in a naive implementation (or in an implementation not optimized
for high hardware parallelism, like the systems presented in Chapter 3), unbounded communication can easily dominate.
Fixed
The fixed communication pattern, shown in the middle of Figure 4.1, resides at the other
extreme of the spectrum, and involves a constant or near-constant number of threads regardless of the degree of parallelism. The pattern itself limits the amount of contention which
4.2. COMMUNICATION PATTERNS AND CRITICAL SECTIONS
51
can arise for the corresponding critical sections, because contention is independent of the
underlying hardware and depends only on the (fixed) number of threads which communicate. Grid-based simulations in scientific computing (including several from the SPLASH-2
benchmark suite [WOT+ 95]) exemplify this type of communication, with each simulated
object communicating only with its nearest neighbors in the grid. Peer-to-peer networks
(e.g. [SMK+ 01, RD01]) also employ fixed or near-fixed communication patterns.
In data management applications an example of fixed communication are producerconsumer pairs. Producer-consumer pairs are frequent in business intelligence workloads
which execute long-running queries consisting of multiple database operators and exhibiting
intra-query parallelism. For example, each operator in a query produces data consumed by
the operator right above in the query execution plan. The execution of the producer operator (and the sub-tree below it in the execution plan) and of the consumer (and everything
above) can be parallelized using the exchange operator [Gra90]. Such operator pipelining
does not cause contention because of the fixed number of threads communicating. On the
other hand, the parallelism in transaction processing systems comes from the concurrent
execution of different requests (transactions), rather than from a single request, as the intraquery parallelism in business intelligence workloads. Transactions are typically executed
by a single thread since they have very narrow execution plans and touch few data. Each
thread in the system acting on behalf of an independent transaction competes with the other
threads to access shared data structures, rather than to pass data to other threads. Thus,
in transaction execution fixed communication in not as frequent.
Cooperative or composable
A third kind of communication pattern, which we call cooperative and show on the right side
of Figure 4.1, arises when threads which wait for some resource can cooperate with each
other to reduce contention. A canonical example of cooperative communication arises in the
context of a parallel LIFO queue where threads push() and pop() items. While accessing
the head of the queue is a critical section (if two threads modify it concurrently, it will cause
unpredictable behavior or corruption), pairs of push() and a pop() requests which encounter
delays can cooperate by combining their requests and eliminating themselves directly without
competing further for the underlying data structure [MNSS05].
Cooperative communication results to critical sections that are highly resistant to contention because threads take advantage of queuing delays to combine their requests. Requests
which combine drop out, making the communication self-regulating: adding more threads
52
to the system gives more opportunity for threads to combine rather than competing directly
for the critical section.
Examining these three types of communication suggests that unbounded communication
is the main threat to scalability. Neither of the other two types allows contention to grow
without bound, even though they sometimes pose significant overhead in single-thread performance. Therefore the designers of transaction processing systems (or any type of software
systems in general) should focus on eliminating unbounded communication.
4.2.3
How to predict and improve scalability
As scalability should be the most important goal of modern software systems, the two questions that rise is how to predict and how to improve the scalability of software systems.
Predict. Out of the three types of critical sections we discussed, it is clear that only the
un-scalable ones impose threat to the scalability of the system. The other two types, fixed
and composable, aggravate only the single-thread performance. Thus, a reliable way to
measure and predict the scalability of a system is by counting and categorizing the critical
sections it normally executes.
Improve. The real key to scalability lies in two directions.
• First, any communication which is not necessary should be eliminated. For example, in
the storage manager which we will be using for the rest of this work (see Section 5.2),
some transaction-related statistics were being kept in a shared memory space and they
were being accessed by all the threads in the system. It was not long enough before the
statistics maintenance to be become the obstacle in scalability.
• Second, all unbounded communication should be eliminated or converted to either the
fixed or composable type, thus removing the potential for bottlenecks to arise.
While we worry about scalability, we should not completely ignore the single-thread
performance. There are many different synchronization primitives which can be used to
enforce a critical section. Given the type of communication, the expected contention and
the size (or length) of the critical section, different implementations are more appropriate.
We discuss this issue next.
4.3
Enforcing critical sections
The literature abounds with different synchronization primitives and approaches, each with
different overhead (cost to enter an uncontended critical section) and scalability (whether,
4.3. ENFORCING CRITICAL SECTIONS
53
and by how much, overhead increases under contention). Unfortunately, efficiency and scalability tend to be inversely related: the cheapest primitives are unscalable, and the most
scalable ones impose high overhead; but both metrics impact the performance of a database
engine. Next we present a brief overview of the types of primitives available to the designer.
4.3.1
Synchronization primitives
The most common approach to synchronization is to use a synchronization primitive to
enforce the critical section. There is a wide variety of primitives to choose from, all more or
less interchangeable with respect to correctness.
Blocking mutex. All operating systems provide heavyweight blocking mutex implementations. Under contention these primitives deschedule waiting threads until the holding thread
releases the mutex. These primitives are fairly easy to use and understand, in addition to
being portable. Unfortunately, due to the cost of context switching and their close association with the kernel scheduler, they are not particularly cheap or scalable for the short
critical sections we are interested in.
Test-and-set spinlocks. Test-and-set (TAS) spinlocks are the simplest mutex implementation. Acquiring threads use an atomic operation such as a compare-and-swap to simultaneously lock the primitive and determine if it was already locked by another thread, repeating
until they lock the mutex. A thread releases a TAS spinlock using a single store. Because of
their simplicity TAS spinlocks are extremely efficient. Unfortunately, they are also among
the least-scalable synchronization approaches because they impose a heavy burden on the
memory subsystem. Variants such as test-and-test-and-set (TATAS) [RS84], exponential
back-off [And90], and ticket-based [RK79] approaches reduce the problem somewhat, but
do not solve it completely. Backoff schemes, in particular, are hardware-dependent and
difficult to tune.
Queue-based spinlocks. Queue-based spinlocks organize contending threads into a linked
list queue where each thread spins on a different memory location. The thread at the head
of the queue holds the lock, handing off to a successor when it completes. Threads compete
only long enough to append themselves to the tail of the queue. The two best-known
queuing spinlocks are MCS [MCS91a] and CLH [Cra93, MLH94], which differ mainly in
how they manage their queues. MCS queue links point toward the tail, while CLH links
point toward the head. Queuing improves on test-and-set by eliminating the burden on the
memory system and also by decoupling lock contention from lock hand-off. Unfortunately,
each thread is responsible to allocate and maintain a queue node for each lock it acquires.
54
Memory management can quickly become cumbersome in complex code, especially for CLH
locks, which require heap-allocated state.
Reader-writer locks. In certain situations, threads enter a critical section only to prevent
other threads from changing the data to be read. Reader-writer locks allow either multiple readers or one writer to enter the critical section simultaneously, but not both. While
operating systems typically provide a reader-writer lock, we find that the pthreads implementation suffers from extremely high overhead and poor scalability, making it useless in
practice. The most straightforward reader-writer locks use a normal mutex to protect their
internal state; more sophisticated approaches extend queuing locks to support reader-writer
semantics [MCS91b, KSUH93].
A note about convoys. Some synchronization primitives, such as blocking mutex and
queue-based spinlocks, are vulnerable to forming stable quasi-deadlocks known as convoys
[BGMP79]. Convoys occur when the lock passes to a thread that has been descheduled
while waiting its turn. Other threads must then wait for the thread to be rescheduled,
increasing the chances of further preemptions. The result is that the lock sits nearly idle even
under heavy contention. Recent work [HSIS05] has provided a preemption-resistant form of
queuing lock, at the cost of additional overhead which can put medium-contention critical
sections squarely on the critical path. However, as [JSAM10] shows, proper scheduling can
eliminate the problem of convoys due to lock preemptions.
4.3.2
Alternatives to locking
Under certain circumstances critical sections can be enforced without resorting to locks. For
example, independent reads and writes to a single machine word are already atomic and need
no further protection. Other, more sophisticated approaches such as optimistic concurrency
control and lock-free data structures allow larger critical sections as well.
Optimistic concurrency control. Many data structures feature read-mostly critical sections, where updates occur rarely, and often come from a single writer. The reader’s critical
sections are often extremely short and overhead dominates the overall cost. Under these
circumstances, optimistic concurrency control (OCC) schemes can improve performance dramatically by assuming no writer will interfere during the operation. The reader performs
the operation without enforcing any critical section, then afterward verifies that no writer
interfered (e.g. by checking a version stamp). In the rare event that the assumption did
not hold, the reader blocks or retries. The main drawbacks to OCC are that it cannot be
applied to all critical sections (since side effects are unsafe until the read is verified), and
55
unexpectedly high writer activity can lead to livelock as readers endlessly block or abort and
retry.
Lock-free data structures. Much current research focuses on lock-free data structures
[Her91] as a way to avoid the problems that come with mutual exclusion (e.g. [Mic02, FR04]).
These schemes usually combine optimistic concurrency control and atomic operations to produce data structures that can be accessed concurrently without enforcing critical sections.
Unfortunately, there is no known general approach to designing lock free data structures;
each must be conceived and developed separately, so database engine designers have a limited
library to choose from. In addition, lock-free approaches can suffer from livelock unless they
are also wait-free, and may or may not be faster than the lock-based approaches under low
and medium contention (many papers provide only asymptotic performance analysis rather
than benchmark results).
Transactional memory. Transactional memory approaches enforce critical sections using database-style “transactions” which complete atomically or not at all. This approach
eases many of the difficulties of lock-based programming and has been widely researched.
Unfortunately, software-based approaches [ST95] impose too much overhead for the tiny
critical sections we are interested in, while hardware approaches [HM93, RG02] generally
suffer from complexity, lack of generality, or both, and have not been adopted. 2 Finally, we
note that transactions do not inherently remove contention; at best transactional memory
can serialize critical sections with very little overhead.
4.3.3
Choosing the right approach
This subsection evaluates the different synchronization approaches using a series of microbenchmarks that replicate the kinds of critical sections found in the code of a transaction
processing system. We present the performance of the various approaches as we vary three
parameters: contended vs. uncontended accesses, short vs. long duration, and read-mostly
vs. mutex critical sections. We then use the results to identify the primitives which work
best in each situation.
Each microbenchmark creates N threads which compete for a lock in a tight loop over
a one second measurement interval (typically 1 − 10M iterations). The metric of interest is
cost per iteration per thread, measured in nanoseconds of wall-clock time. Each iteration
begins with a delay of To ns to represent time spent outside the critical section, followed by
2
It is not a surprise that one of the most high-profile attempts to release a transactional memory processor,
Sun ROCK [DLMN09], was abandoned just before going to market.
56
Cost/Iteration (nsec)
800
pthread
ppmcs
600
mcs
tatas
400
ideal
200
0
0
8
16
24
32
Threads
Figure 4.2: Performance of mutex locks as the contention varies. Lower cost is better.
an acquire operation. Once the thread has entered the critical section, it delays for Ti ns to
represent the work performed inside the critical section, then performs a release operation.
All delays are measured to 4 ns accuracy using the machine’s cycle count register (highresolution time counters); we avoid unnecessary memory accesses to prevent unpredictable
cache misses or contention for hardware resources.
For each scenario we compute an ideal cost by examining the time required to serialize Ti
plus the overhead of a memory barrier, which is always required for correctness. Experiments
involving readers-writers are set up exactly the same way, except that readers are assumed
to perform their memory barrier in parallel and threads use a pre-computed array of random
numbers to determine whether they should perform a read or write operation.
All of our experiments were performed using a Sun T2000 machine, which contains one
Sun Niagara I processor, running Solaris 10. The Sun Niagara I chip [KAO05] is a multi-core
architecture with 8 cores; each core provides 4 hardware contexts for a total of 32 OS-visible
“processors”. Cores communicate through a shared 3MB L2 cache.
Contention
Figure 4.2 compares the behavior of four mutex implementations as the number of threads
in the system varies along the x-axis. The y-axis gives the cost of one iteration as seen by
one thread. In order to maximize contention, we set both To and Ti to zero; threads spend
all their time acquiring and releasing the mutex. TATAS is a test-and-set spinlock variant.
MCS and ppMCS are the original and preemption-resistant MCS locks, respectively, while
pthread is the native pthread mutex. Finally, “ideal” represents the lowest achievable cost
57
800
pthread
ppmcs
mcs
tatas
ideal
600
400
200
0
0
100
200
300
Critical section length (nsec)
Figure 4.3: Performance of mutex locks as the duration of the critical section varies. Lower
cost is better.
per iteration, assuming that the only overhead of enforcing the critical section comes from
the memory barriers which must be present for correctness.
As the degree of contention of the particular critical section changes, different synchronization primitives become more appealing. The native pthread mutex is both expensive
and unscalable, making it unattractive. TATAS is by far the cheapest for a single thread,
but quickly falls behind as contention increases. We also note that all test-and-set variants
are extremely unfair, as the thread which most recently released it is likely to re-acquire it
before other threads can respond. In contrast, the queue-based locks give each thread equal
attention.
Duration
Another factor of interest is the performance of the various synchronization primitives as
the duration of the critical section varies (under medium contention) from extremely short
to merely short. We assume that a long, heavily-contended critical section is a design flaw
which must be addressed algorithmically.
Figure 4.3 shows the cost of each iteration as 16 threads compete for each mutex. The
inner and outer delays both vary by the amount shown along the x-axis (keeping contention
steady). We see the same trends as before, with the main change being the increase in
ideal cost (due to the critical section’s contents). As the critical section increases in length,
the overhead of each primitive matters less; however, ppMCS and TATAS still impose 10%
higher cost than MCS, while pthread more than doubles the cost.
58
800
800
600
600
400
400
200
200
0
tatas (R/W)
tatas
MCS (R/W)
mcs
occ
ideal
0
0
8
Threads
16
24
32
1
10
Reads per write (avg)
100
Figure 4.4: Performance of reader-writer locks as contention (left) and reader-writer ratio
(right) vary.
Reader-writer ratio
The last parameter we study is the ratio between the readers and the writers. Figure 4.4 (left)
characterizes the performance of several reader-writer locks when subjected to 7 reads for
every write and with To and Ti both set to 100 ns. The cost/iteration is shown on the y-axis
as the number of competing threads varies along the x-axis. The TATAS mutex and MCS
mutex apply mutual exclusion to both readers and writers. The TATAS rwlock extends a
normal TATAS mutex to use a read/write counter instead of a single “locked” flag. The
MCS rwlock comes from the literature [KSUH93]. OCC lets readers increment a simple
counter as long as no writers are around; if a writer arrives, all threads (readers and writers)
serialize through an MCS lock instead.
We observe that reader-writer locks are significantly more expensive than their mutex
counterparts, due to the extra complexity they impose. For very short critical sections and
low reader ratios, a mutex actually outperforms the rwlock; even for the 100ns case shown
here, the MCS lock is a usable alternative.
Figure 4.4 (right) fixes the number of threads at 16 and varies the reader ratio from 0
(all writes) to 127 (mostly reads) with the same delays as before. As we can see, the MCS
rwlock performs well for high reader ratios, but the OCC approach dominates it, especially
for low reader ratios. For the lowest read ratios, the MCS mutex performs the best – the
probability of multiple concurrent reads is too low to justify the overhead of a rwlock.
59
Figure 4.5: The space of critical section types. Each corner of the cube is marked with the
appropriate synchronization primitive to use for that type of critical section.
4.3.4
Discussion and open issues
The microbenchmarks from the previous section illustrate the wide range in performance and
scalability among the different primitives. From the contention experiment we see that the
TATAS lock performs best under low contention due to having the lowest overhead; for high
contention, the MCS lock is superior due to its scalability. The experiment also highlights
how expensive it is to enforce critical sections. The ideal case (memory barrier alone) costs
50 ns, and even TATAS costs twice that. The other alternatives cost 250 ns or more. By
comparison a store costs roughly 10 ns, meaning critical sections which update only a handful
of values suffer more than 80% overhead. As the duration experiment shows, pthread and
TATAS are undesirable even for longer critical sections that amortize the cost somewhat.
Finally, the reader-writer experiment demonstrates the extremely high cost of reader-writer
synchronization; a mutex outperforms rwlocks at low read ratios by virtue of its simplicity,
while optimistic concurrency control wins at high ratios. Figure 4.5 summarizes the results
of the experiments, showing which of the three synchronization primitives to use under what
circumstances. We note that, given a suitable algorithm, the lock free approach might be
best.
The results also suggest that there is much room for improvement in the synchronization
primitives that protect small critical sections. Hardware-assisted approaches (e.g. [RG01])
and implementable transactional memory might be worth exploring further in order to reduce
60
overhead and improve scalability. Reader-writer primitives, especially, do not perform well
as threads must still serialize long enough to identify each other as readers and check for
writers.
All the knowledge collected from the experiments of this section can be used for improving
the scalability of database engines, whose codepath is cluttered with critical sections.
4.4
Handling problematic critical sections
By definition, critical sections limit scalability by serializing the threads which compete for
them. Each critical section is simply one more limited resource in the system that supports
some maximum throughput. Database engine designers can potentially improve critical
section capacity (i.e. peak throughput) by changing how they are enforced or by altering
algorithms and data structures.
4.4.1
Algorithmic changes
Algorithmic changes can address bottleneck critical sections in three ways:
1. By reducing how often threads enter them. Ideally problematic critical sections would
never be executed. For example, in Section 5.3 we are going to see how we remove a
significant obstacle to scalability by avoiding interacting with the lock manager through
caching.
2. By downgrading them to a category less threatening to scalability. According to the
discussion of Section 4.2, a fixed contention or composable critical section is more desired
than an unscalable one. For example, in Section 5.4 we are going to see how we improve
the scalability by downgrading the critical section for inserting entries to the log buffer
from unscalable to composable.
3. By breaking them into several “smaller” ones in a way that it both reduces the length of
it and distributes contending threads as well (ideally, each thread can expect an uncontended critical section). For example, buffer pool managers typically distribute critical
sections by hash bucket so that only probes for pages in the same bucket must be serialized.
In theory, algorithmic changes are the superior approach for addressing critical sections
because they can remove or distribute critical sections to ease contention. Unfortunately,
developing new algorithms is challenging and time consuming, with no guarantee of a breakthrough for a given amount of effort. In addition, even the best-designed algorithms will
4.4. HANDLING PROBLEMATIC CRITICAL SECTIONS
Throughput (ktps)
100
61
Algorithm (bpool)
Algorithm (lock, log)
Tuning (MCS)
Tuning (TATAS)
10
Algorithm (bpool)
Baseline
1
1
Concurrent Threads
10
100
Figure 4.6: Algorithmic changes and tuning combine to give best performance.
eventually become bottlenecks again if the number of threads increases enough, or if nonuniform access patterns cause hotspots.
4.4.2
Changing synchronization primitives
The other approach for improving critical section throughput is, as we saw in Section 4.3.3,
by altering how they are enforced. Because the critical sections we are interested in are so
short, the cost of enforcing them is a significant – or even dominating – fraction of their overall
cost. Reducing the cost of enforcing a bottleneck critical section can improve performance
a surprising amount. Also, critical sections tend to be encapsulated by their surrounding
data structures, so the developer can change how they are enforced simply by replacing the
existing synchronization primitive with a different one. These characteristics make critical
section tuning attractive if it can avoid or delay the need for costly algorithmic changes.
4.4.3
Both are needed
Figure 4.6 illustrates how algorithmic changes and synchronization tuning combined give the
best performance. It presents the performance of several stages of tuning a modern storage
manager, with throughput plotted against varying thread count in log-log scale. These
numbers came from the experience of converting SHORE [CDF+ 94] to Shore-MT [JPH+ 09]
(see Section 5.2). The process began with a thread-safe but very slow version of SHORE
and repeatedly addressed critical sections until internal scalability bottlenecks had all been
removed. The changes involved algorithmic and synchronization changes in all the major
components of the storage manager, including logging, locking, and buffer pool management.
62
The figure shows the performance and scalability of Shore-MT at various stages of tuning.
Each thread repeatedly runs transactions which insert records into a private table. These
transactions exhibit no logical contention with each other but tend to expose many internal
bottlenecks. Note that, in order to show the wide range of performance the y-axis of the
figure is log-scale; the final version of Shore-MT scales nearly as well as running each thread
in an independent copy of Shore-MT.
The “Baseline” curve at the bottom represents the thread-safe but unoptimized SHORE;
the first optimization (bpool) was algorithmic and replaced the central buffer pool mutex with
one mutex per hash bucket. As a result, scalability improved from one thread to nearly four,
but single-thread performance did not change. The second, tuning, optimization (TATAS)
replaced the expensive pthread mutex protecting buffer pool buckets with a fast test and set
mutex (see Section 4.3.1 for details about synchronization primitives), doubling throughput
for a single thread. The third, tuning, optimization (MCS) replaced the test-and-set mutex
with a more scalable MCS mutex, allowing the doubled throughput to persist until other
bottlenecks asserted themselves at four threads.
The next line (lock, log) represents the performance of Shore-MT after algorithmic
changes to the lock and log management code, at which point the buffer pool again became
a bottleneck. Because the critical sections involved were already as efficient as possible,
another algorithmic change was required (bpool2). This time the open-chained hash table
was replaced with a cuckoo hash table [PR01] to further reduce contention for hash buckets,
improving scalability from 8 to 16 threads and beyond.
This example illustrates how both proper algorithms and proper synchronization are
required to achieve the highest performance. In general, tuning primitives improves performance significantly, and sometimes scalability as well; algorithmic changes improve scalability and might help or hurt performance (more scalable algorithms tend to be more expensive).
Finally, we note that the two tuning optimizations each required only a few minutes to apply, while each of the algorithmic changes required days or weeks to design, implement, and
debug. The performance impact and ease of reducing critical section overhead make tuning
an important part of the optimization process.
4.5
Conclusions
This chapter focused on the critical sections which determine the scalability of any software
system. In order to reliably predict the behavior of a system in high parallelism, one needs
to not only count but also categorize the critical sections that are executed. Different types
4.5. CONCLUSIONS
63
of critical sections impose different threat to scalability. The definitely unwanted are the unscalable critical sections for which the contention increases with the hardware parallelism.
The other two categories, fixed and composable, mainly lower single-thread performance.
At the same time, the choice of synchronization primitives significantly affects performance as a large part of the execution is computation-bound. We observe that even uncontended critical sections sap performance because of the overhead they impose and we identify
a small set of especially useful synchronization primitives. Database system developers can
then utilize this knowledge to select the proper synchronization tool for each critical section
and maximize performance.
The bottomline is that critical sections impose obstacles to scalability and high overhead
to single-thread (uncontended) execution. To ameliorate the impact of critical sections we
should both provide algorithmic changes and employ proper synchronization primitives.
64
65
Chapter 5
Attacking un-scalable critical sections in a
conventional design
In Chapter 3, we showed that multicore hardware has caught database engines off guard,
since their codepaths are full of, mainly un-scalable, critical sections. This chapter presents
two mechanisms that boost the scalability of conventional transaction processing by reducing
the number of un-scalable critical sections. In particular, we show how we can avoid a large
number of un-scalable critical sections in the lock manager using Speculative Lock Inheritance,
a mechanism that detects database locks which experience contention at run-time and caches
them across transactions. We also show how a simple observation allows us to downgrade
the log buffer inserts from being un-scalable critical sections to composable. For all the
prototyping and evaluation we use Shore-MT, a multithreaded version of SHORE. ShoreMT constitutes a reliable baseline for our work, since compared to other database engines, it
exhibits superior scalability and 2-4 times higher absolute throughput.1
5.1
Introduction
Multicore hardware poses a unique challenge, providing exponentially growing parallelism
which software must exploit in order to benefit from hardware advances. In the past, a
primary use of concurrency in database engines was to overlap delays due to I/O and logical conflicts. Today, however, the high number of threads which can enter the database
simultaneously puts new pressure on the internal scalability of the database engine.
Unfortunately, none of the existing database engines provides the kind of scalability
required for modern multicore hardware; Chapter 3 highlights how current offerings top out
at 8-12 cores, failing to utilize the 32 or more hardware contexts which modern hardware
makes available. While most database applications are inherently parallel, current database
engines are designed under the assumption that only a limited number of threads will access
1
This chapter highlights findings which we presented in EDBT 2009 [JPH+ 09], VLDB 2009 [JPA09] ,
VLDB 2010 [JPS+ 10] and VLDBJ 2011 [JPS+ 11].
66
CHAPTER 5. ATTACKING UN-SCALABLE CRITICAL SECTIONS
their internal data structures at any given instant. Even when the application is executing
parallelizable tasks, and even if disk I/O is off the critical path, serializing accesses to these
internal data structures impedes scalability.
The main problem is that the code paths of conventional transaction processing are
full of critical sections most of them being un-scalable, according to the categorization of
Section 4.2.2. In order to ameliorate the scalability of databases systems, the system designers need to study the codepaths and attack the sources of un-scalable critical sections.
In this chapter, we attack two significant sources of un-scalable critical sections in conventional transaction processing. In particular, we show how we can avoid a large number of
un-scalable critical sections in the lock manager using Speculative Lock Inheritance, a mechanism that detects database locks which experience contention at run-time and caches them
across transactions. In addition, we show how a simple observation allows us to downgrade
the log buffer inserts from being un-scalable critical sections to composable, and present the
corresponding log buffer implementation.
All the prototyping and evaluation are done using the Shore-MT storage manager [JPH+ 09].
Before we present the two mechanisms that reduce the number of un-scalable critical sections, we show that Shore-MT constitutes a reliable baseline system for our work, since
compared to other database engines, it exhibits superior scalability and 2-4 times higher
absolute throughput.
After we integrate each of the two mechanisms to Shore-MT, we show the breakdown of
critical sections for the execution of a simple transaction and the performance on a highly
parallel multicore machine. As we move to more elaborate and scalable designs the number
of un-scalable critical sections drops and performance increases. That observation validates
our claim in Section 4.2.3 that one can measure and predict the scalability of a transaction
processing system by analyzing the number and type of the critical sections at its codepath.
The contributions of this chapter are three-fold:
1. We briefly present Shore-MT, a multithreaded version of the SHORE storage manager
[CDF+ 94], which we use as a reliable baseline for the rest of our work and make it
available to the research community2 .
2. We present Speculative Lock Inheritance or SLI. SLI boosts the scalability of transaction
processing by detecting database locks that encounter contention at run time and passing
those “hot locks” across transactions, thus avoiding the execution of a significant portion
of unscalable critical sections.
2
Available at http://diaswww.epfl.ch/shore-mt/
5.2. SHORE-MT: A RELIABLE BASELINE
67
3. We present a solution for the contention encountered in insertions on the main memory
log buffer of transaction processing systems, by present a log buffer implementation
based on consolidations of requests. This technique improves scalability by converting
the unscalable critical sections for the log buffer inserts to composable ones.
The rest of this chapter is structured as follows. Section 5.2 briefly presents Shore-MT.
Section 5.3 shows how SLI helps us to avoid executing un-scalable critical sections inside
the centralized lock manager. Section 5.4 shows how a log buffer implementation based on
consolidations of requests allows us to downgrade the un-scalable critical sections for inserting
records into the main memory log buffer to composable ones; and Section 5.6 concludes.
5.2
Shore-MT: a reliable baseline
Since Chapter 3 showed that none of the available open source database engines manages
to scale its performance on highly parallel multicore hardware, we set out to create one of
our own based on the SHORE storage manager [CDF+ 94]. We elected the SHORE storage
manager as our target for optimization for two reasons. First, SHORE supports all the
major features of modern database engines: full transaction isolation, hierarchical and rowlevel locking [GR92], a CLOCK buffer pool with replacement and prefetch hints [Smi78],
KVL-Tree indexes [Moh90], and ARIES-style logging and recovery [MHL+ 92]. Additionally, SHORE has previously shown to behave like commercial engines at the instruction level
[ADH01], making it a good open-source platform for comparing against closed-source engines. This exercise resulted in Shore-MT, which scales far better than its open source peers
while also achieving superior single-thread performance.
During the process of implementing Shore-MT, we completely ignored single-thread performance and focused only on removing bottlenecks of SHORE. In order to compare the
scalability of Shore-MT against the other open- and closed-source peers, we use the microbenchmark described in Section 3.3.1, as well as the Payment and NewOrder transactions
from the TPC-C benchmark (see Section 2.4.3).
Shore-MT scales commensurately with the hardware we make available to it, setting
the absolute example for other systems to follow. In Figure 5.1, we plot the results of the
microbenchmark from Figure 3.2 in Chapter 3, but this time also showing results for ShoreMT. While single-threaded SHORE did not scale at all, Shore-MT exhibits excellent scaling.
Moreover, at 32 clients, it scales better than DBMS “X”, a popular commercial DBMS.
While our original goal was only to achieve high scalability, we also achieved nearly 3x
speedup in single-thread performance over SHORE. Shore-MT attains a healthy performance
68
Efficiency (tps/client)
Throughput / Client
10
Shore-MT
DBMS 'X'
PostgreSQL
1
MySQL
BDB
SHORE
0.1
0
8
16
24
32
Concurrent clients
Figure 5.1: Scalability of Shore-MT vs. several open-source database engines. A scalable
system maintains steady per-thread performance as load increases.
lead over the other engines.3 We attribute the performance improvement to the fact that
database engines spend so much time in critical sections. The process of shortening or
eliminating critical sections, and reducing synchronization overheads, also had the side effect
of shortening the single-thread code path.
As a further comparison, Figure 5.2 shows the performance of the three fastest database
engines running the New Order (left) and Payment (right) transactions of TPC-C. Again,
Shore-MT achieves the highest performance 4 while scaling as well as the commercial system
for New Order. In New Order, all three systems encounter logical contention for the STOCK
and ITEM tables, causing a significant dip in scalability across the board around 16 clients.
Payment, in contrast, imposes no application-level contention, allowing Shore-MT to scale
all the way to 32 threads.
With Shore-MT, we have a database engine that performs better than its open source
peers and its performance is limited only by the underlying hardware parallelism. That is, in
the Niagara I multicore processor, with its 8 physical cores and 32 hardware contexts, ShoreMT’s performance scales almost optimally and there are no indications for any potential
bottlenecks.
3
4
BerkeleyDB outperforms the other systems at first, but its performance drops precipitously for more than
four clients.
Some of the performance advantage of Shore-MT is likely due to its clients being directly embedded in the
engine while the other two engines communicate with clients using local socket connections. We would
be surprised, however, if a local socket connection imposed 100% overhead per transaction.
5.2. SHORE-MT: A RELIABLE BASELINE
69
Efficiency (ktps/thread)
1000
Shore-MT
Dbms "X"
Postgres
100
10
0
8
16
24
Concurrent NewOrder xcts
32
0
8
16
24
32
Concurrent Payment xcts
Figure 5.2: Efficiency of Shore-MT, DBMS “X” and PostgreSQL for varying thread counts,
when executing TPC-C NewOrder (left) and Payment (right) transactions, note the logarithmic scale in the y-axis. High, steady performance is better.
5.2.1
Critical section anatomy of Shore-MT
However, the number of un-scalable critical sections in Shore-MT’s codepaths are still numerous. The left-most bar of Figure 5.3 shows the breakdown of the critical sections executed
on average by Shore-MT for the completion of the very simple UpdLocation transaction of
the TATP benchmark (see Section 2.4.5). 5 When Shore-MT executes this simple transaction,
it still enters over 70 critical sections with nearly 60 of them being un-scalable.
Thus, even though in the Niagara I multicore processor Shore-MT’s performance is limited by hardware and it seems that we cannot further optimize its design, Figure 5.3 shows
that there are lurking bottlenecks. These bottlenecks are immediately exposed when we move
our experimentation to an even more parallel hardware machine, like the second generation
Niagara chip which contains 64 hardware contexts. The component which contributes the
majority of the critical sections is the lock manager, while another significant source of critical sections is the log manager. In the following two sections, we present two mechanisms
that handle those lurking bottlenecks.
5
We execute the transaction 1000 times and instrument each critical section code to get this breakdown.
70
CSs per Transaction
80
Uncategorized
70
Xct mgr
60
Aether log mgr
50
Log mgr
40
Metadata
30
20
Bpool
10
Page Latches
Lock mgr
0
Shore-MT
SLI
SLI & Aether
Figure 5.3: Comparison of the number and type of critical sections executed on average
for the completion of a simple transaction, TATP UpdateLocation. The un-scalable critical
sections are the bars with solid fills. Shore-MT’s lock manager and log manager codepaths
are sources of a large fraction of un-scalable critical sections.
5.3
Avoiding un-scalable critical sections in the lock manager with SLI
Virtually all database engines use some form of hierarchical locking [GR92] to allow applications to trade off concurrency and overhead. For example, requests which access large
amounts of data can acquire coarse-grained locks to reduce overhead at the risk of reduced
concurrency. At the same time, small requests can lock precisely the data (e.g. records)
which they access and maximize concurrency with respect to other independent requests.
The lock hierarchy is crucial for application scalability because it allows efficient fine-grained
concurrency control at the logical level.
Ironically, however, hierarchical database locking causes a new scalability problem while
addressing the first one: all transactions must acquire intention locks in the upper levels
of the hierarchy, contenting with each other for updating the internal lock state of each
database lock, in order to access individual objects. The increased hardware concurrency
leads to bottlenecks in the centralized lock manager, especially as hierarchical locking forces
many threads to update repeatedly the state of a few hot locks.
Physical contention causes locking-related bottlenecks even for scalable database applications which cause few logical conflicts. Because of the inherent behavior of hierarchical
locking, we expect that every system will eventually encounter this kind of contention within
the lock manager, if it has not done so already. Figure 5.4 highlights how contention for
database locks impacts performance as we increase load in a multicore system running the
5.3. AVOID UN-SCALABLE CRITICAL SECTIONS WITH SLI
Normalized CPU time
LM Contention
300%
Other Contention
LM Overhead
200%
Computation
71
418%
100%
0%
2%
11%
48%
67%
System load
86%
98%
Figure 5.4: Lock manager overheads as system load varies. Contention consumes CPU time
without increasing performance.
TATP benchmark (see Section 2.4.5 for the benchmark description and Section 5.3.2 for the
experimental setup). The x-axis varies load on the system from very light (left) to very heavy
(right) while the y-axis shows the fraction of CPU time each transaction spends in the lock
manager (not counting time spent blocked on I/O or true lock conflicts). This figure, and
those that follow in this subsection, define overhead and contention as the useful and useless
work, respectively, performed by the system when processing transactions. We can make
two observations from Figure 5.4. First, the useful work due to the lock manager, in light
load is around 10-15% (relatively a small fraction of the total) corroborating other studies,
like [HAMS08]. Second, nearly all contention in the system arises within the lock manager
and that contention component grows rapidly, eventually accounting for a dominant nearly
75% of the transaction’s CPU time on heavy load.
Figure 5.4 as well as the left-most bar of Figure 5.3 suggest that to improve scalability we
must focus on eliminating contention within the lock manager. Though database designers
are often willing to make sacrifices in consistency or other areas if it improves performance
[Hel07, JFRS07, Vog09], our goal is to design a mechanism that does not change transaction
consistency semantics or introduces other anomalies. It must be transparent and automatic,
and impose minimal performance penalty on operation under light load or when the is no
contention on the lock manager. Next we present speculative lock inheritance, a technique
for reducing contention within the lock manager and achieves the aforementioned goals.
72
Figure 5.5: Agent threads which detect contention at the lock manager retain hot locks
beyond transaction commit, passing them directly to the transactions which follow.
5.3.1
Speculative lock inheritance
The key to reducing contention within the lock manager with speculative lock inheritance is
the observation that virtually all transactions request high-level locks in compatible modes;
even requests for exclusive access to particular rows or pages in the database generate compatible intent locks higher up, and transactions which require coarse-grained exclusive access
are extremely rare in scalable workloads. Further, in the absence of intervening updates, it
makes no semantic difference whether a shared mode (SH) or intention-shared mode (IS)
lock is released and re-acquired or simply held continuously. Either way a transaction will
see the same unchanged object, and other transactions are free to interleave their reads to
the object as well.
Speculative lock inheritance or SLI, exploits the lack of logical contention for hot, shared
database locks to reduce physical contention for their internal lock state. As Figure 5.5
shows, SLI allows a completing transaction to pass on some locks which it acquired to
transactions which follow and are going to be executed by the same worker agent thread. This
avoids a pair of release and acquire calls to the lock manager for each such lock. During the
lock release phase of transaction commit, the transaction’s agent thread identifies promising
candidate locks and places them in a thread-local lock list instead of releasing them. It then
initializes the next transaction’s lock list with these previously acquired locks hoping that
the new transaction will use some of them. Successful speculation improves performance in
two ways. First, a transaction which inherits useful locks makes fewer lock requests, with
corresponding lower overhead and better response time; short transactions amortize the cost
of the lock acquire over many row accesses instead of just one. Second, other transactions
73
which do request the lock will face less contention in the lock manager. In the following
sub-sections we elaborate to more details of the speculative lock inheritance mechanism.
Extensions to the lock manager
When a transaction attempts to release a lock to the lock manager the agent thread determines whether it is a good candidate for inheritance (see next section). If so, it does not
remove the request from the lock queue. Instead, it changes the request status from granted
to inherited and moves it from the transaction’s private list to a different private list owned
by the transaction’s agent thread. When the agent thread executes its next transaction, it
pre-populates the new transaction’s lock cache with the inherited locks. The speculation
succeeds if the new transaction attempts to request an inherited lock: it will find the request
already in its cache, update its status from inherited back to granted, and add it to its lock
list as if it had just acquired it. The status update uses an atomic compare-and-swap operation and does not require calling into the lock manager, allocating requests, or updating
latch-protected lock state. Inheritance fails harmlessly if the transaction does not make use
of the lock(s) it inherited: they do not cause overhead during transaction execution and the
transaction simply releases them at commit time along with the locks it did use. If another
transaction encounters an inconveniently inherited lock request and an atomic compare-andswap to invalid state succeeds, it simply unlinks the request from the queue and continues.
Future attempts to reclaim the lock will fail, and the next time the owning agent completes
a transaction it will deallocate any invalid requests it finds.
Lock inheritance is a very lightweight operation regardless of whether it eventually succeeds or not. In the worst case a transaction does not use the lock it inherited, and pays the
cost of releasing the lock which the previous transaction avoided. Both invalidations and
garbage collection are performed only when a transaction is already traversing the queue and
add only minimal overhead. In the best case the lock manager will be completely relieved
of requests for hot locks, with a corresponding boost to performance.
Criteria for inheriting locks
The speculative lock inheritance mechanism uses five criteria to identify candidate locks
which are likely to benefit subsequent transactions with minimal risk of reducing concurrency:
1. The lock is page-level or higher in the hierarchy (no record-level locks).
2. The lock is “hot” (i.e. we observed contention for the latch protecting it).
3. The lock is held in a shared or intention mode (e.g. S, IS, IX).
4. No other transaction is waiting on the lock (e.g. to lock it exclusively).
74
5. The previous conditions recursively hold for the lock’s parent, if any.
The first two criteria favor locks which are likely to be reused by a subsequent transaction.
Very fine-granularity locks such as row locks are so numerous that the overhead of tracking
them outweighs the benefits, while a lock which has only one outstanding request at release
time is unlikely to have another request arrive in the near future. We detect a “hot” lock by
tracking what fraction of the most recent several acquires encountered latch contention and
enabling SLI when the ratio crosses a tunable threshold. 6 Criteria 3 and 4 ensure SLI does
not hurt performance and concurrency or lead to starvation. The last criterion ensures that
SLI maintains the hierarchical locking protocol.
Ensuring correctness
SLI preserves consistency semantics by only passing shared- or intention-mode locks from
one transaction to another. Assuming the first transaction acquired its locks in a consistent
way, the new transaction will inherit consistent locks. In addition, such lock modes ensure
that the previous transaction did not change the corresponding data objects.
From the perspective of a new transaction, an inherited lock request looks just like any
other request that happened to be granted with no intervening updates since it was last
released. Two-phase locking semantics are preserved because the inheritance is not finalized
until the new transaction actually requests the lock. If an exclusive request arrives before
then it invalidates the inheritance and the inheriting transaction must make a normal request.
Therefore, from a semantic perspective an inherited lock was released and reacquired; only
the underlying implementation has changed. From the perspective of both the inheriting
and any competing transactions which arrive after the original transaction completes, the
request was granted in the same order it would have been had SLI not intervened. A mixture
of inherited and non-inherited locks is consistent and serializable for the same reasons. SLI
preserves the hierarchical locking protocol by only inheriting locks whose parents are also
eligible. Any inherited lock “orphaned” when its parent is invalidated will also be invalidated
before any transaction tries to use it, thus avoiding the case where a low-level lock is held
without appropriate locks on its ancestors.
A transaction could also potentially acquire locks in a different order than expected if it
inherits locks which it would have requested later than the beginning of the transaction. For
example, Figure 5.6 shows how SLI could potentially induce new types of deadlocks between
transactions that are otherwise well-behaved. During normal execution (left), transaction
agents T1 and T2 both acquire lock L2 followed by L1. Whichever agent requests the
6
To achieve that we have to slightly modify the latch implementation.
75
Figure 5.6: Example of SLI-induced deadlock.
lock second has to wait until the other commits its current transaction, but no deadlock is
possible. However, enabling SLI (right) allows T1 to inherit L1 from a previous transaction.
If agents could not invalidate inherited but not-yet-used locks, T1 would have effectively
acquired its locks in reverse order and could deadlock with T2.
Fortunately, SLI-induced deadlocks can easily be avoided because the transaction must
still reclaim the lock before accessing the data; if any exclusive request arrives for an inherited
lock before the inheriting transaction first requests it, the lock manager invalidates the
inheritance. Once the request has been reclaimed the transaction has effectively acquired
the lock in its natural order and conflicting requests will have the same risk of deadlock as
in the unmodified system.
Non-uniform locking patterns
One concern with SLI is that for real-world workloads there might exist locking patterns
which interfere with normal operations, preventing it from achieving its full potential. Next,
we discuss two potential patterns, the moving hotspot and the bimodal workload, and show
that they do not prevent SLI from being effective.
Many workloads do not access data uniformly over time. Instead, the object of interest
shifts, and contention with it. A common example is a table (such as a history or log)
with heavy append traffic. For a given page of the table, for instance, high contention will
disappear as soon as the page fills and transactions begin inserting records in a different
page. This moving target presents two potential difficulties for SLI. First, old unnecessarily
inherited locks might pollute transaction caches and lock lists, and waste space in the lock
manager’s hash table. Second, newly hot locks will not be inherited at first, leading to
76
contention. Fortunately, neither problem occurs in practice because SLI has a short memory:
if transactions do not use inherited locks their agent thread will release them quickly; if new
sources of contention appear, SLI will quickly begin inheriting the problematic locks.
A bimodal workload consists of two groups of transactions which access different sets
of locks. If the distribution of transactions to agent threads is random, a high fraction of
transactions will not utilize the locks they inherited from the previous, possibly different,
transaction type. Given the short memory and inheritance criteria, the lock manager may
stop inheritance even though it would be beneficial to continue. There are several potential
ways to make SLI resistant to this sort of workload:
• Identify groups of transaction types which acquire similar locks, and bias the assignment
of transactions to agent threads so that similar transactions execute with the same agents
most of the time. This approach would require either application developer assistance
or some form of cluster identification based on observing which high-level locks each
transaction type tends to acquire.
• Apply a small hysteresis or momentum which prevents the lock manager from dropping
inheritance just because one transaction did not use the lock. This approach is straightforward and inexpensive to implement using only local knowledge, but would tend to
increase the number of useless locks which pass between transactions.
• Do nothing. The fewer locks in common different transaction types acquire, the less
contention their requests will cause and the less opportunity SLI has in the first place.
Additionally, because contention tends to grow quadratically, 7 even a minor reduction in
the number of threads competing for a lock request provides a significant improvement.
In our experimentation we find that the third approach works well in practice, though as
the number of cores per chip continues to increase, contention may grow to the point that
only a subset of the total threads are required to cause significant contention.
Passing information with agent threads
While implementing SLI, we noticed that SLI’s technique to pass information across transactions executed by the same agent thread is very handy and can be used to remove other
sources of un-scalable critical sections. In our prototype we employ this technique to avoid
accessing the metadata database pages frequently and practically eliminate the critical sections associated with the metadata component.
7
If N threads all contend for the same object, each can expect to wait for N/2 threads, for O(N 2 ) total
time wasted blocking or spinning.
77
In a naive design and for transactions that do few operations per object accessed, accessing metadata data pages may constitute a non-negligible source of critical sections. For
example in our baseline implementation, each transaction maintains a data structure which
caches access information about the various database objects the transaction had to access
so far, so that in case it has to re-access them to do that quickly. This metadata information
can be the page id of the root of an index or the page id of the first page of heap file, and
is being stored in regular database pages. 8 When an agent thread accesses a metadata
database page it has to enter a critical section. In the extreme case, if a transaction does
a single record probe per table (e.g. a transaction that probes for a single customer), the
critical section for accessing the metadata information constitutes a significant fraction of
the critical sections for that object. Obviously the more operations per object accessed, the
smaller that component.
To address the problem of critical sections for accessing metadata, each agent thread populates the transaction-local metadata data structure as usual, but at transaction completion
it does not destroy that data structure. Instead, the agent thread passes the populated
data structure to the next transaction it will serve, a la SLI. The next transaction finds the
metadata data structure pre-populated. At the infrequent case where the metadata information of a database object is stale (e.g. an object has been destroyed) the access will return
with an error. The agent thread needs to refresh its metadata information and re-attempt
before aborting. As we will see in the evaluation (Section 5.3.2), this technique removes a
significant source of un-scalable critical sections for short transactions, while its overhead is
negligible since metadata information changes very infrequently.
5.3.2
Evaluation of SLI
We evaluate several individual transactions and transaction mixes on a multicore machine
to identify both the opportunity for and the effectiveness of speculative lock inheritance. We
make use of four metrics to determine the effectiveness of SLI. First, we consider the numbers
and types of locks which are responsible for contention, using software counters. Second,
we use profiling tools to identify bottlenecks (or lack of them) through time breakdowns.
Third, we consider the resulting anatomy of critical sections, and, finally, we measure system
throughput to quantify the performance impact of SLI.
We perform all experiments on a Sun Niagara II machine running Solaris 10. The Niagara
II chip [JN07] contains 8 cores each supporting 8 hardware contexts, for a total of 64 OS8
That is, a database page belongs either to a heap file, an index or contains metadata.
78
CPU utilization (out of 64)
64
48
32
LM contention
Other contention
LM overhead
Computation
16
0
Figure 5.7: Execution time breakdown for the baseline system running transactions from the
TATP, TPC-B, and TPC-C benchmark, each at the load giving peak performance. High bars
with low contention are best.
visible “CPUs.”. That is twice as many as the Niagara I machine which we used for the
evaluation of baseline Shore-MT in Section 5.2.
Opportunity analysis
To demonstrate the potential for SLI, we perform a two-fold opportunity analysis. We first
profile the baseline system under high load and produce a breakdown of overheads arising
out of the lock manager, giving an upper bound on the performance improvement SLI could
achieve. We also examine the number and types of locks acquired by transactions to verify
the basic underlying idea behind SLI, which is to exploit shared and intention locks to reduce
contention.
Lock manager overhead and contention. To identify the magnitude of contention
within the lock manager, we profile transactions and transaction mixes from the TATP, TPC-B
and TPC-C benchmarks. First, we find the load (number of concurrent clients) which maximizes performance for each workload and, then, for that load we plot the time breakdown.
Figure 5.7 shows the normalized work breakdown extracted from the profiler output. The
height of each bar shows the number of hardware contexts utilized at peak performance. Note
that many transactions peak long before utilizing all 64 available contexts. Each column in
the graph shows the fraction of CPU time a transaction spent in both work and contention,
both inside and outside the lock manager. The results confirm that the lock manager is a
large bottleneck in the system, especially for the smaller transactions, such as those from
TATP. As expected, the largest TPC-C transactions do not suffer from the lock manager bot-
79
Throughput (ktps)
TATP
60
50
TPC-C Payment
40
TPC-B
30
20
10
0
0
16
32
48
Hardware contexts utilized
64
Figure 5.8: Impact of lock manager bottleneck as load varies. Performance should be increasing monotonically for utilization under 64 hardware contexts.
tleneck: Stock queries a large amount of data and thus amortizes the cost of acquiring
high-level locks; Delivery is not only large but also introduces true lock contention which
blocks transactions so they do not compete for the lock manager. The measured (useful)
lock manager overheads range between 10-20%, corroborating the results in [HAMS08].
The profiler results also indicate that the lock manager bottleneck is smaller for mixes
of transactions which access a wider variety of tables, like the TATP and TPC-C Mix), even
though transaction size has a far stronger effect. Mixing different transactions together
reduces the bottleneck for two reasons: different access patterns spread contention over
more types of locks and agents running long transactions spend less time in the lock manager,
easing pressure. For the workloads with small transactions we expect the bottleneck to grow
over time as more cores per chip allow multiple different hotspots in the lock manager at
the same time. Distributing hotspots over multiple tables will not eliminate contention in
the long run because, even if the number of heavily-accessed tables in a workload grows over
time, we do not expect it to grow uniformly or nearly as fast as core counts.
Figure 5.8 illustrates the impact in performance of the lock manager bottleneck as we
increase the load on the system along the x-axis from near idle to saturated. Each data series
shows the throughput achieved at different CPU utilizations for the TATP and TPC-B benchmarks, as well as for TPC-C Payment transactions. For small numbers of hardware contexts
utilized, we see throughput increasing nearly linearly and the system scales well. However,
as the number of hardware contexts increases past 32 contexts the lock manager bottleneck
begins to impact performance, and by 48 contexts the bottleneck becomes severe enough
80
Breakdown by lock type
120%
6 7
5 10 6
100%
80%
60%
40%
16
8
6
9
6
19
19 37 117 274 1099 68 114
Hot, X-High
Hot, X-Row
Hot, S-High
Hot, S-Row
Cold High
Cold Row
20%
0%
Figure 5.9: Breakdown of SLI-related characteristics for locks acquired by each workload.
SLI works with shared- or intention-mode locks, both types are grouped together as S-High
and S-Row.
that throughput starts to drop – the system is unable to utilize effectively the additional
processing power available to it.
Opportunity for lock inheritance. Two conditions must hold in order for SLI to improve
system scalability: most contention in the system should center around the internal state of
hot locks, and those locks must be inherit-able, meeting the SLI’s criteria, for passing them
from one transaction to the next. The previous section illustrates that, for short transactions,
the lock manager is indeed the primary source of contention in the system. We now analyze
lock access patterns to evaluate the opportunity for SLI to reduce that contention (and verify
its hypothesis). The analysis considers three characteristics: hot vs. cold lock, shared vs.
exclusive requests and row-level locks vs. those higher in the hierarchy. SLI targets hot,
shared- or intention-mode, high-level locks. We are not interested in cold locks because they
do not cause contention within the lock manager, SLI cannot work with exclusive lock modes
because it would impact concurrency, and we hypothesize that row-level locks which are hot
and in shared or intention mode are too rare to be worth considering. Therefore, SLI will have
the most potential to improve performance if a large fraction of locks meet the inheritance
criteria and if most remaining locks are cold. We note that it is entirely possible for many
transactions to wait on “cold” locks, especially in badly-behaved workloads. However, true
lock contention serializes transactions, and the resulting low concurrency reduces contention
for the lock’s internal state, making SLI unnecessary.
Breakdown by lock type
120%
4 5 3 6
100%
4
9
5
4
5
4
12
81
12 12 49 30 345 31 40
Not Inherited
80%
Discarded
60%
Invalidated
40%
Upgraded
20%
Used
0%
Figure 5.10: Breakdown of outcomes for locks which SLI could choose to pass between
transactions.
Figure 5.9 shows a breakdown of the types of locks acquired by each transaction or
transaction mix. The number at the top of each column is the average number of locks
acquired per transaction. SLI targets locks which are both hot and inherit-able. Hot locks
which remain cannot be handled by SLI and ideally contribute only a small fraction of the
total. As expected, the smallest transactions acquire few locks but most of those locks are
inherit-able and many are hot. As transactions acquire more and more locks the number of
hot and inherit-able locks does not increase as quickly, indicating lower contention in the lock
manager and less opportunity for SLI. We observe that, for the workloads analyzed here,
there are very few, if any, hot non-inheritable locks and that transactions with the most hot
and inherit-able locks also experience the highest contention in the lock manager according to
Figure 5.8. Together these indicate that indeed SLI has the potential to reduce or eliminate
the lock manager bottleneck. We note that, though there are relatively few hot and inheritable locks in the breakdown, transactions inheriting them will have a disproportionate impact
in reducing contention. Row-level locks, though numerous, are not usually hot, and even
less often both hot and inherit-able.
Effectiveness of lock inheritance
We first examine the effectiveness of SLI in passing locks between transactions. When
inheritance is effective most hot locks in the system are inherited and used by succeeding
transactions. SLI will not eliminate fully the lock manager bottleneck if hot locks are not
inherited, or remain unused and are discarded, or are invalidated before transactions reclaim
them.
82
CPU utilization (out of 64)
64
48
32
16
SLI contention
LM contention
Other contention
SLI overhead
LM overhead
Computation
0
Figure 5.11: Breakdown of CPU utilization for each workload when SLI active and the
machine is saturated. Lower contention is better.
Figure 5.10 shows the breakdown of outcomes for only the hot locks in the system for
each transaction and mix. SLI is selective, passing only hot locks between transactions. For
shorter transactions most locks are hot, though a significant fraction of them are invalidated
and cannot be used. The longest transactions have virtually no hot locks because they
acquire so many that relatively little time per transaction goes to any one request. We
also note that mixing multiple transaction types increases the number of locks which are
invalidated, and also increases the number of useless locks which transactions eventually
discard, potentially due to bimodal patterns discussed previously. However, as we will see
in the next section, the locks which are successfully inherited are also the ones responsible
for most of the lock manager bottleneck.
Time breakdowns with SLI
We expect SLI to work best when there is a heavy load of many small and usually nonconflicting transactions. Workloads under low load or with transactions which are large or
conflicting, will not benefit nearly as much. Figure 5.11 shows the work breakdown of transactions when SLI is active. Significantly, none of the transactions has a large contribution
from lock manager contention any more. This indicates that SLI is effective in identifying
and passing the locks which cause most lock manager contention.
We also note that SLI has low overhead. Even in the worst case it adds only 5% overhead
usually with a corresponding decrease in lock manager overhead. For example, locks which
are inherited but never used must still be released, and that overhead counts toward SLI, not
the lock manager. In most cases, contention in the lock manager is replaced by useful work,
83
suggesting a significant performance improvement. However, the NewOrder transaction sees
a shift of contention from the lock manager to other areas, mostly the space manager. As
expected from Figure 5.7, the two large TPC-C transactions are virtually unchanged by SLI
because they did not have a significant lock manager component to begin with.
Ideally, SLI would also reduce transaction overhead by avoiding calls to the lock manager.
However, we observe this effect to be negligible (< 4%) for even the shortest transactions,
given the large fraction of locks which are not inherited. 9
Overall, when SLI is active transactions spend 75% or more of the time doing useful
work even though the system is fully loaded, in contrast with Figure 5.7. For example, in
TATP SLI exhibits lower contention at 95% machine utilization than baseline does at 60%
utilization. Thus, SLI gives large speedup for this workload because it not only eliminates
contention for existing load, but allows load to increase without the contention returning.
Anatomy of critical sections and performance impact with SLI
The primary goal of SLI is to reduce contention within the lock manager so it does not
impede scalability. SLI achieves this goal by avoiding to interact with the lock manager of
the acquisition/release of some “hot” locks. The second bar of Figure 5.3 shows the anatomy
of critical sections for the simple TATP UpdateLocation transaction, when SLI is active. We
observe that the number of critical sections entered on average is significantly reduced. Two
are the reasons for this reduction. First, SLI is efficient in picking the right locks to inherit,
so transactions need to interact with the lock manger less frequently. Also, as we discussed
in Section 5.3.1, the metadata information is propagated from one transaction to the next,
reducing the interaction with the metadata manager as well. The final result is that the
number of un-scalable critical sections drops by 30% (from 60 to 40).
Given the significant reduction of the number of un-scalable critical sections, we expect
a significant boost in overall performance. Figure 5.12 compares performance of the baseline system with SLI on two benchmarks, TATP and TPC-B, which consist of short running
transactions that put pressure on the lock manager component. As expected, TATP and
TPC-B with their short transactions benefit significantly from SLI. Baseline system stops
scaling somewhere between 32 and 48 hardware contexts. With SLI enabled, the system’s
performance in TATP increases almost linearly up to 64 hardware contexts, as many as the
machine has. In TPC-B performance increases again up to 64 hardware contexts, but this
9
Chapter 6 presents a technique that not only eliminates contention on the lock manager, but also significantly reduces locking overheads.
84
Throughput (ktps)
100
TATP
TATP (SLI)
80
TPC-B
60
TPC-B (SLI)
40
20
0
0
16
32
48
Hardware Contexts Utilized
64
Figure 5.12: Performance improvement due to SLI, for TATP and TPC-B.
time the increase is not linear because logging becomes the bottleneck. In the next section
(Section 5.4) we focus on the logging-related problem(s).
5.4
Downgrading log buffer insertions to composable critical sections
The log manager, an essential component of any transaction processing system which ensures the system ability to recover from crashes [MHL+ 92], is another potential source of
bottlenecks. 10 The logging-related problem we are focusing on in this section is the log
record inserts to the main memory log buffer.
The log buffer inserts belong to the un-scalable type of critical section. As hardware
parallelism increases, a large number of threads simultaneously attempt to insert to a centralized log buffer and the contention becomes a significant and growing fraction of total execution time. Continuing the analysis of the critical sections of the TATP UpdateLocation
transaction, the second bar of Figure 5.3 shows that when SLI is enabled the log buffer
inserts constitute around the 20% of the un-scalable critical sections. Where current hardware trends generally reduce other logging-related bottlenecks (e.g. solid state drives reduce
I/O latencies [LMP+ 08, Che09, JPS+ 10]), each successive processor generation aggravates
contention for log buffer inserts. We therefore consider the log buffer inserts as the most
challenging logging-related problem with respect to future scalability.
10
See [JPS+ 10] and [JPS+ 11] for a detailed discussion on logging-related bottlenecks on multicore and
multisocket hardware.
5.4. DOWNGRADING LOG BUFFER INSERTIONS
5.4.1
85
Log buffer designs
Most database engines use some variant of ARIES [MHL+ 92], which assigns each log record
a unique log sequence number (LSN). The LSN encodes a record’s disk address, acts as a
timestamp for data pages written to disk, and serves as a pointer to log records both in
memory and on disk. It is also convenient for LSN to serve as addresses in the log buffer, so
that generating an LSN also reserves buffer space. In order to keep the database consistent
in spite of repeated failures, ARIES imposes strict ordering constraints on LSN generation.
While a total ordering is not technically required for correctness, valid partial orders tend to
be too complex and interdependent to be worth pursuing as a performance optimization. 11 .
Because of its serial nature, LSN generation and the accompanying log inserts impose
serious limitation on parallelism in the system. In this section we attack the problem at
its root, developing techniques which allow LSN generation to proceed in parallel. We
achieve parallelism by adapting the concept of “elimination”, introduced at [ST97], to allow
the system to generate sequence numbers in groups. An especially desirable effect of this
grouping is that increased load leads to larger groups rather than causing contention. We also
explore the performance trade-offs that come from decoupling the LSN generation process
from the actual log buffer insert operation.
We begin by considering the basic log insertion algorithm, which consists of three distinct
phases:
1. LSN generation and log buffer acquire. The thread first claims the space it will eventually
fill with the intended log record
2. Log record insertion. The thread copies the log record in the buffer space it has claimed.
3. Log buffer release. The transaction releases the buffer space, which allows the log manager
to write the record to disk.
Baseline implementation
A straightforward log insert implementation acquires a central mutex before performing
all three phases and the mutex is released at the same time as the buffer. That is, in
the straightforward implementation there is a single un-scalable critical section which every
thread needs to execute for every log insert it makes.
This approach is attractive 12 for its simplicity: log inserts are relatively inexpensive, and
in the monolithic case buffer release is simplified to a mutex release. Further, even though
11
12
We explore this option at [JPS+ 11] .
There are indications that popular systems employ this simple design. One of them is PostgreSQL [Pos11] .
86
Figure 5.13: Illustrations of several log buffer designs. The baseline system can be optimized
for shorter critical path (D), fewer threads attempting log inserts (C), or both (CD).
LSN generation is fully serial, it is also short and predictable (barring exceptional situations
such as buffer wraparound or full log buffer, which are comparatively rare).
The monolithic log insert suffers a major weakness because it serializes buffer fill operations, even though buffer regions never overlap, adding their cost directly to the critical
path. In addition, log record sizes vary significantly, making copying costs unpredictable.
Figure 5.13 (B) illustrates how a single large log record can impose long delays on later
threads. This situation arises frequently in our system because the distribution of log records
has two strong peaks at 40B and 264B (a 6x difference) and the largest log records can occupy
several KB each.
To permanently eliminate contention for the log buffer, we seek to make the cost of
accessing the log independent of both the sizes of the log records being inserted and the
number of threads inserting them. The following subsections explore both approaches and
propose a hybrid solution which combines them.
Consolidating buffer allocation
A log record consists of a standard header followed by an arbitrary payload. Log buffer
allocation is composable in the sense that two successive requests also begin with a log header
and end with an arbitrary payload. We exploit this composability by allowing threads to
combine their requests into groups, carve up and fill the group’s buffer space off the critical
path, and finally release it back to the log as a unit. To this end we extend the idea of
elimination-based backoff [HSY04, MNSS05], a hybrid approach combining elimination trees
87
[ST97] with backoff. Threads which encounter contention back off, but instead of sleeping
or counting cycles they congregate at an elimination array, a set of auxiliary locations where
they attempt to combine their requests with those of others.
When elimination is successful threads satisfy their requests without returning to the
shared resource at all, making the backoff very effective. For example, stacks are amenable
to elimination because push() and pop() requests which encounter each other while backing
off can cancel each other directly via the elimination array and leave. Similarly, threads
which encounter contention for log inserts back off to a consolidation array and combine
their requests before reattempting the log buffer. We use the term “consolidation” instead
of “elimination” because, unlike with a stack or counter, threads must still cooperate after
combining their requests so that the last to finish can release the group’s buffer space. Like
an elimination array, any number of threads can consolidate into a single request, effectively
bounding contention at the log buffer to the number of array entries protecting the log buffer,
rather than the number of threads in the system.
The net effect of consolidation is that only the first thread from each group competes
to acquire buffer space from the log, and only the last thread to leave must wait to release
it. Figure 5.13 (C) depicts the effect of consolidation; the first thread to arrive is joined
by two others while it waits on the log mutex and all three proceed in parallel once the
mutex acquire succeeds. However, as the figure also shows, consolidation leaves significant
wait times because only buffer fill operations within a group proceed in parallel; operations
between groups are still serialized. Given enough threads in the system, at least one thread
of each group is likely to insert a large log record, delaying later groups.
Decoupling buffer fill and delegating release
Because buffer fill operations are not inherently serial (records never overlap) and have
variable costs, they are highly attractive targets to move off the critical path. All threads
which have acquired buffer regions can safely fill those regions in any order as long as they
release their regions in LSN order. We therefore modify the original algorithm so that
threads release the mutex immediately after acquiring buffer space. Buffer fill operations
thus become pipelined, with a new buffer fill starting as soon as the next thread can acquire
its own buffer region.
Decoupling log inserts from holding locks results in a non-trivial buffer release operation
which becomes a second critical section. Like LSN generation, buffer release must be serialized to avoid creating gaps in the log. Log records must be written to disk in LSN order
because recovery must stop at the first gap it encounters; in the event of a crash any com-
88
mitted transactions beyond a gap would be lost. No mutex is required, but before releasing
its own buffer region, each thread must wait until the previous buffer has been released.
With pipelining in place, arriving threads can overlap their buffer fills with that of a large
log record, without waiting for it to finish first. Figure 5.13 (D) illustrates the improved
concurrency that results, with significantly reduced wait times at the buffer acquire phase.
Under most circumstances, log record sizes do not vary enough that threads wait for previous
ones to release the buffer, but high skew in the record size distribution will limit scalability
because a very large record will force small ones which follow to wait for it to complete. A
further optimization (not shown in the figure) allows threads to delegate their buffer release
to a predecessor which has still not completed.
To summarize the delegated buffer release protocol, threads which would normally have
to wait for a predecessor instead attempt to mark their buffer as abandoned using an atomic
compare-and-swap operation. Threads which succeed in abandoning their buffer before the
predecessor notifies them are free to leave, forcing the predecessor to release all buffers that
would have waited for it. In addition to making the system much less sensitive to large log
inserts, it also improves performance because a single thread releases groups of buffers in a
tight loop rather than communicating the releases with other threads.
Putting it all together: a hybrid log buffer
In the previous two subsections we outlined (a) a consolidation array which reduces the
number of threads entering the log insert critical section, and (b) a decoupled buffer fill
which allows threads to pipeline buffer fills outside the critical section. Neither approach
eliminates all contention by itself, but the two are orthogonal and can be combined easily.
Consolidating groups of threads limits log contention to a constant that does not depend
on the number threads in the system, while providing a degree of buffer insert pipelining
(within groups but not between them). Decoupling buffer fill operations allows pipelining
between groups and reduces the log critical section length by moving buffer outside, thus
making performance relatively insensitive to log record sizes. The resulting design, shown in
Figure 5.13 (CD), achieves bounded contention for threads in the buffer acquire stage and
maximum pipelining of all operations. As we will see in the evaluation section, the hybrid
version consistently outperforms the other configurations by combining their best features.
5.4.2
Evaluation of log buffer re-design
This section details the sensitivity of the consolidation array based techniques to various
parameters.
89
Experimental setup
To isolate the log buffer inserts from any other logging-related bottlenecks we are using a
modified version of Shore-MT where we integrated the optimizations described in [JPS+ 10],
namely Early Lock Release and Flush Pipelining. In addition, to eliminate contention in the
lock manager and focus on logging, we employ SLI (see previous section).
We run the TATP, TPC-B and TPC-C benchmarks as well as a log insert microbenchmark.
For that microbenchmark, we extract a subset of Shore-MT’s log manager as an executable
which supports only log insertions without flushes to disk or performing other work, thereby
isolating the log buffer performance. We then vary the number of threads, the log record
size and distribution, and the timing of inserts.
All results report the average of 10 30-second runs unless stated otherwise. We do not
report variance because all measurements were within 2% of the mean. Measurements come
from timers in the benchmark driver as well as Sun’s profiling tools. Profiling is highly
effective at identifying software bottlenecks even in the early stages before they begin to
impact performance, because problematic functions can be seen to shift their position in the
timing breakdowns.
All experiments were performed on a Sun Niagara II machine with 64 hardware contexts
and 64GB of main memory running Solaris 10. Because our focus is on the logging subsystem, and because modern transaction processing workloads are largely memory resident
[SMA+ 07], we use memory-resident data sets, while disk still provides durability.
Log buffer contention
First, to set the stage, we measure log buffer contention. Already from the second bar
of Figure 5.3 we expect the log buffer inserts to be a potential scalability bottleneck.
Figure 5.14 shows the time breakdown for Shore-MT using its baseline log buffer implementation as an increasing number of clients submit the UpdateLocation transaction from
TATP. As the load increases, the time each transaction spends contenting for the log buffer
increases at a point which the log buffer contention becomes the bottleneck taking more than
35% of the execution time. This problem will only grow as processor vendors release more
parallel multi-core hardware.
Impact of log buffer optimizations (microbenchmarks)
A database log manager should be able to sustain any number of threads regardless of
the size of the log records they insert, limited only by memory and compute bandwidth.
Next, through a series of microbenchmarks we determine how well the log buffer designs
proposed in Section 5.4.1 meet these goals. In each experiment we compare the baseline
90
CPU time (secs)
100%
80%
Log mgr. contention
60%
Other contention
40%
Log mgr. work
20%
Useful work
0%
2% 13% 25% 38% 50% 63% 75% 88% 97%
Load
Figure 5.14: Breakdown of the execution time of Shore-MT with two log optimizations (ELR
and flush pipelining) enabled, running TATP UpdateLocation transactions as load increases.
The log buffer inserts become the bottleneck.
implementation with the consolidation array (C), decoupled buffer insert (D), and the hybrid
solution combining the two optimizations (CD). We examine scalability with respect to
both thread counts and log record sizes and we analyze how the consolidation array’s size
impacts its performance. Further experiments explore the impact of skew in the record size
distribution and of changing the number of slots in the slot array.
Scalability with respect to thread count. The most important metric of a log buffer is
how many insertions it can sustain per unit time, or the bandwidth which the log can sustain
at a given average log insert size. It is important because core counts grow exponentially
while log record sizes are application- and DBMS-dependent and are fixed. The average
record size in our workloads is about 120 bytes and a high-performance application generates
between 100 and 200MBps of log, or between 800K and 1.6M log insertions per second.
Figure 5.15 (left) shows the performance of the log insertion microbenchmark for records
of an average size of 120B as the number of threads varies along the x-axis. Each data
series shows one of the log variants. We can see that the baseline implementation quickly
becomes saturated, peaking at roughly 140MB/s and falling slowly as contention increases
further. 13 Due to its complexity, the consolidation array starts out with lower throughput
than the baseline. But once contention increases, the threads combine their requests and
performance scales linearly. In contrast, decoupled insertions avoid the initial performance
penalty and perform better, but eventually the growing contention degrades performance
and perform worst than the consolidation array. Finally, the hybrid approach combines the
13
Notice that even such bandwidth would saturate any mechanical disk drive
91
Throughput (GB/s)
100
CD in L1
CD
10
C
D
1
Baseline
0.1
0.01
1
4
Thread count
16
64
12
120
1200
12000
Log record size (bytes)
Figure 5.15: Sensitivity analysis of the consolidation array with respect to thread count and
log record size. The hybrid design combines the benefits of both optimizations.
best properties of both optimizations, eliminating most of the startup cost from (C) while
limiting the contention which (D) suffers. The drop in scalability near the end is a hardware
limitation, as described in Section 5.4.2. Overall, we see that while both consolidation and
decoupling are effective at reducing contention, both have limitations which we overcome by
combining the two, achieving near-linear scalability.
Scalability with respect to log record size. In addition to thread counts, log record
sizes also have a strong influence on the performance of the log buffer. In the case of the
baseline and consolidated variants, larger record sizes increase the critical section length;
in all cases, however, larger record sizes decrease the number of log inserts one thread can
perform because it must copy an increasing amount of data per insertion.
Figure 5.15 (right) shows the impact of these two factors, plotting sustained bandwidth
achieved by 64 threads as they insert log records ranging between 48B and 12KB (the
largest record size in Shore-MT). As log records grow the baseline performs better, but
there is always enough contention that makes all other approaches more attractive. The
consolidated variant (C) performs better at small records sizes as it can handle contention
much better than the decoupled record insert (D). But once the records size is over 1KB,
contention becomes low and the decoupled insert variant fares better as more log inserts can
be pipelined at the same time. The hybrid variant again significantly outperforms its base
components across the whole range, but in the end all three become bandwidth-limited as
they saturate the machine’s memory system.
92
Figure 5.16: Sensitivity to the number of slots and thread count in the consolidation array.
Lighter colors indicate higher bandwidth.
Finally, we modify the microbenchmark so that threads insert their log records repeatedly
into the same thread-local storage, which is L1 cache resident. With the memory bandwidth
limitation removed, the hybrid variant continues to scale linearly with record sizes until
it becomes CPU-limited at roughly 21GBps (nearly 20x higher throughput than modern
systems can reach).
Sensitivity to slot array size. Our last microbenchmark analyzes whether (and by how
much) the performance of the consolidation array is affected by the number of available
slots. Ideally the performance should depend only on the hardware and be stable as thread
counts vary. Figure 5.16 shows a contour map of the space of slot sizes and thread counts,
where the height of each data point is its sustained bandwidth. Lighter colors indicate
higher bandwidth, with contour lines marking specific throughput levels. We achieve peak
performance with 3-4 slots, with lower thread counts peaking with fewer and high thread
counts requiring a somewhat larger array. The optimal slot number corresponds closely with
the number of threads required to saturate the baseline log which the consolidation array
protects. Based on these results we fix the consolidation array size at four slots to favor high
thread counts; at low thread counts the log is not on the critical path of the system and its
peak performance therefore matters much less than at high thread counts.
Anatomy of critical sections and impact in overall performance
To complete the experimental analysis, we measure the impact of the log buffer optimization
in the overall system performance. The right-most bar of Figure 5.3 shows the anatomy of
93
TATP-UpdateLocation
TPC-B
Throughput (Ktps)
175
150
125
100
75
50
25
0
Throughput (Ktps)
Hybrid (CD)
Opt. Baseline
Baseline
Hybrid (CD)
Opt. Baseline
Baseline
80
60
40
20
0
0
10
20
30
40
#CPU Utilized
50
60
0
10
20
30
40
50
60
#CPUs utilized
Figure 5.17: Overall performance improvement provided by the hybrid log buffer design when
the systems run TATP UpdateLocation transactions (left) and the TPC-B benchmark (right).
The optimized baseline contains the ELR and Flush Pipelining optimizations [JPS+ 10]. The
hybrid log buffer achieves the highest performance and displays no lurking bottleneck.
critical sections for the simple TATP UpdateLocation transaction, when, on top of SLI, we
employ the hybrid log buffer design (in this graph we refer the hybrid design as “Aether”).
The only difference with the second bar is that now the majority of the critical sections
related to the log manager are composable, instead of un-scalable. Hence, we expect better
scalability.
Figure 5.17 captures the scalability of Shore-MT running TATP UpdateLocation transactions (left) and the TPC-B benchmark (right). We plot throughput as the number of
client threads varies along the x-axis. The hybrid (consolidated) log buffer design improves
performance by 7% and 15% respectively by eliminating log contention. The performance
improvements seem to be modest. This happens simply because the peak transaction execution rate (which is achieved when the machine is saturated) does not generate enough log
bandwidth for the hybrid design to significantly outperform the baseline, which Figure 5.15
shows that it can sustain approximately 140MBps. Nevertheless, by converting the log buffer
inserts from un-scalable critical sections to composable, the hybrid log buffer displays no
lurking logging-related bottlenecks and our microbenchmarks suggest that it has significant
headroom to accept additional log traffic as systems scale in the future.
94
5.5
Related work
There is a broad set of literature on scaling the performance of database systems in general,
and transaction processing system in particular. Up until recently, however, the majority
of the studies focused on scaling out the performance rather than scaling up. Since the
two techniques we presented in this section (speculative lock inheritance and consolidated
log buffer inserts) affect the lock and the log manager, in the following two subsections we
briefly describe work related with those two significant transaction processing components.
5.5.1
Reducing lock overhead and contention
The guiding concept of speculative lock inheritance – not releasing locks between transactions
– appears in Rdb/VMS [Jos91] as a way to reduce network communication costs. Locks in
this distributed database physically migrate to nodes whose transactions acquire them. The
authors highlight very briefly a “lock carry-over” optimization which allows a node to avoid
the overhead of returning the lock to its home node when transactions complete by caching it
locally, as long as no conflicting lock requests have arrived. Each carry-over saves at least one
round trip over the network in the event the lock is reused by a later transaction, improving
the performance of a two-node system by over 60%. In this chapter, we apply the concept
of lock carry-over to the single-node Shore-MT engine to solve the problem of contention for
lock state, which did not exist with the high network overheads and low node counts (1-3
in the evaluation) experienced by Rdb/VMS. We also detail an implementation designed for
modern database engines running on multicore hardware with shared memory and caches,
and where transactions, not nodes, hold locks. SLI allows a centralized lock manager to
distribute requests among the many threads that would otherwise contend with each other.
IBM’s DB2 provides a performance tuning registry variable, DB2 KEEPTABLELOCK
[IBM11], which allows transactions or even connections to retain read-mode table locks
between uses, again exploiting the idea of not releasing locks unless necessary. However,
transactions only benefit from the setting if they repeatedly release and reacquire the same
locks, and the documentation notes that retaining table locks for the life of a connection leads
to “poor concurrency” because other transactions cannot make updates until the connection
closes. The setting is disabled by default.
Multiversioned buffer pools [BJK+ 97] allow writers to update copies of pages rather
than waiting for readers to finish. Copying avoids the need for low-level locking because
older versions remain available to readers, but it does not remove the need for hierarchical
locks or the corresponding contention which SLI addresses. In addition, for the common case
where a transaction updates only a few bytes per record accessed, multiversioning imposes
5.6. CONCLUSION
95
the cost of copying an entire database page per record. Finally, multiversioning provides
“snapshot isolation,” which suffers from certain non-intuitive update anomalies that are
only partly addressed to date [JFRS07, AFR09].
5.5.2
Handling logging-related overheads
Logging is one of the most important components of a database system, but also is one of
the most complicated. Even in a single-threaded database engine the overhead of logging
is significant. For example, Harizopoulos et al. [HAMS08] report that in a single-threaded
database engine logging accounts for roughly 12% of the total time in a typical OLTP
workload.
Virtually all database engines employ some variant of ARIES [MHL+ 92], a sophisticated
write-ahead logging system which integrates concurrency control with transaction rollback
and disaster recovery, and allows the system to recover fully even if recovery is interrupted
repeatedly by new crashes. To achieve its high robustness with good performance, ARIES
couples tightly with the rest of the system, particularly the lock and buffer pool managers, and has a strong influence on the design of access methods such as B+Tree indexes
[Moh90, ML92].
Main-memory database engines [DKO+ 84] impose a special challenge for log implementations because the log is the only I/O operation of a given transaction. Not only is the
I/O time responsible for a large fraction of total response time, but short transactions also
lead to high concurrency and contention for the log buffer. Some proposals go so far as to
eliminate the log (and its overheads) altogether [SMA+ 07], replicating each transaction to
multiple database instances and relying on hot fail-over to maintain durability. However,
replication has its own large set of challenges [GHOS96], and it is a field of active research
[TA10].
5.6
Conclusion
In this chapter we detailed two mechanisms which address scalability bottlenecks in two
essential components of any transaction processing system, the lock manager and the log
manager. Both mechanisms provide significant improvements in the performance of the
baseline system, and reduce the number of un-scalable critical sections. Unfortunately, no
matter the optimizations, the transaction execution codepath still contains a large number
of critical sections. Figure 5.3 shows that the optimal design still enters 35 un-scalable
critical sections. With hardware parallelism doubling each processor generation, eventually
some of those critical sections will hamper scalability. This suggests that for embarrassingly
96
parallel execution, we need to depart from the conventional execution model and investigate
more radical approaches, which is the topic of the next part (Part III).
Part III
Re-architecting transaction processing
97
99
Chapter 6
Data-oriented Transaction Execution
While hardware technology has undergone major advancements over the past decade, transaction processing systems have remained largely unchanged. The number of cores on a chip
grows exponentially, following Moore’s Law, allowing for an ever-increasing number of transactions to execute in parallel. As the number of concurrently-executing transactions increases,
contended critical sections become scalability burdens. In typical transaction processing systems the centralized lock manager is often the first contended component and scalability bottleneck.
In this chapter, we take a more rigorous approach against scalability bottlenecks of conventional transaction processing. We identify the conventional thread-to-transaction assignment
policy as the primary cause of contention. Then, we design DORA, a system that decomposes
each transaction to smaller actions and assigns actions to threads based on which data each
action is about to access. DORA’s design allows each thread to mostly access thread-local data
structures, minimizing interaction with the contention-prone centralized lock manager. Built
on top of a conventional storage engine, DORA maintains all the ACID properties. Evaluation of a prototype implementation of DORA on a multicore system demonstrates that
DORA eliminates any contention related to the lock manager and attains up to 4.8x higher
throughput than the state-of-the-art storage engine when running a variety of synthetic and
real-world OLTP workloads. 1
6.1
Introduction
The diminishing returns of increasing on-chip clock frequency coupled with power and thermal limitations have led hardware vendors to place multiple cores on a single die and rely on
thread-level parallelism for improved performance. Today’s multicore processors feature 64
hardware contexts on a single chip equipped with 8 cores2 , while multicores targeting spe1
2
This chapter highlights material presented at VLDB 2010 [PJHA10].
Modern cores support multiple hardware contexts, interleaving their instruction streams to improve CPU
utilization.
CHAPTER 6. DATA-ORIENTED TRANSACTION EXECUTION
100
DISTRICTS
Thread-to-transaction (Conventional)
Thread-to-data (DORA)
100
100
80
80
60
60
40
40
20
20
0
0
0.2
0.4
0.6
Time (secs)
0.8
0.2
0.4
0.6
Time (secs)
0.8
Figure 6.1: Comparison of the trace of the record accesses by the threads of a system that
applies the conventional thread-to-transaction assignment of work policy (left) and a system
that applies the thread-to-data policy (right). The data accesses of the conventional threadto-transaction system are uncoordinated and complex. On the other hand, the data accesses
of the thread-to-data system are coordinated and show regularity.
cialized domains find market viability at even larger scales. With experts in both industry
and academia forecasting that the number of cores on a chip will follow Moore’s Law, an
exponentially-growing number of cores will be available with each new process generation.
As the number of hardware contexts on a chip increases exponentially, an unprecedented
number of threads execute concurrently, contending for access to shared resources. Threadparallel applications running on multicores suffer of increasing delays in heavily-contended
critical sections, with detrimental performance effects [JPH+ 09]. To tap the increasing computational power of multicores, software systems must alleviate such contention bottlenecks
and allow performance to scale commensurately with the number of cores.
Online transaction processing (OLTP) is an indispensable operation in most enterprises.
In the past decades, transaction processing systems have evolved into sophisticated software
systems with code bases measuring in the millions of lines. Several fundamental design principles, however, have remained largely unchanged since their inception. The execution of
transaction processing systems is full of critical sections [JPH+ 09]. Consequently, these systems encounter significant performance and scalability problems on highly-parallel hardware.
To cope with the scalability problems of transaction processing systems, researchers have
suggested employing shared-nothing configurations [DGS+ 90] on a single chip [SMA+ 07]
and/or dropping some of the ACID properties [DHJ+ 07, LBD+ 12].
6.1. INTRODUCTION
6.1.1
101
Thread-to-transaction vs. Thread-to-data
In this chapter, we argue that the primary cause of the contention problem is the uncoordinated data accesses that is characteristic of conventional transaction processing systems.
These systems assign each transaction to a worker thread, a mechanism we refer to as
thread-to-transaction assignment. Because each transaction runs on a separate thread, on
every single shared data access threads need to contend with each other entering a very large
number of un-scalable critical sections.
The chaotic, uncoordinated access pattern of the thread-to-transaction (i.e. conventional)
assignment policy becomes easily apparent with visual inspection. Figure 6.1 (left) depicts
the accesses issued by each worker thread of a conventional transaction processing system,
to each one of the records of the District table in a TPC-C database with 10 Warehouses3 .
The system is configured with 10 worker threads and the workload consists of 20 clients
repeatedly submitting Payment transactions from the TPC-C benchmark [TPC07], while we
trace only 0.7 seconds of execution4 .
The access patterns of each transaction, and consequently of each thread, are arbitrary
and totally uncoordinated. To ensure data integrity from those uncoordinated accesses, each
thread enters a large number of critical sections in the short lifetime of each transaction it
executes. Critical sections, however, incur latch acquisitions and releases, whose overhead
increases with the number of concurrent threads. Even more worrisome is that some of
those critical sections are un-scalable — critical sections whose contention increases with the
number of concurrent threads.
To assess the performance overhead of critical section contention, Figure 6.2 depicts
the throughput attained by a state-of-the-art storage manager (Shore-MT [JPH+ 09] and
Section 5.2) as the machine utilization increases. The workload consists of clients repeatedly submitting GetSubscriberData transactions from the TATP benchmark [NWMR09]
(methodology detailed in Section 6.4.1). As the machine utilization increases, the performance per CPU utilization drops. When utilizing all 64 hardware contexts the per hardware
context performance drops by more than 80%.
Figure 6.3(left) shows that the contention within the lock manager quickly dominates of
the conventional system. At 64 hardware contexts the system spends more than 85% of its
execution time on threads waiting to execute critical sections inside the lock manager.
Based on the observation that uncoordinated accesses to data lead to high levels of contention, we propose a data-oriented architecture (DORA) to alleviate contention. Rather
3
4
A TPC-C database of scaling factor 10
The system and workload configuration are kept small to enhance the graphs visibility.
Throughput /CPU Util.
(Ktps/#ctxs)
102
3
2
1
DORA
BASELINE
0
0
20
40
60
80
100
CPU Util. (%)
Figure 6.2: Throughput per hardware context achieved by a conventional system and DORA
when they execute the TATP GetSubscriberData transaction. Ideally, per-thread performance does not depend on the number of threads in the system.
than coupling each thread with a transaction, DORA couples each thread with a disjoint
subset of the database. Transactions flow from one thread to the other as they access different
data, a mechanism we call thread-to-data assignment. DORA decomposes the transactions
to smaller actions according to the data they access, and routs them to the corresponding
threads for execution. In essence, instead of pulling data (database records) to the computation (transaction), DORA distributes the computation to wherever the data is mapped.
Figure 6.1 (right) illustrates the effect of the data-oriented assignment of work on data
accesses. It plots the data access patterns issued by a prototype DORA system, which
employs the thread-to-data assignment. The accesses in DORA are coordinated and show
regularity.
A system adopting thread-to-data assignment can exploit the regular pattern of data accesses, reducing the pressure on contended components. In DORA, each thread coordinates
accesses to its subset of data using a private locking mechanism. By limiting thread interactions with the centralized lock manager, DORA eliminates the contention in it (Figure 6.3
(right)) and provides better scalability (Figure 6.2).
DORA exploits the low-latency, high-bandwidth inter-core communication of multicore
systems. Transactions flow from one thread to the other with minimal overhead, as each
thread accesses different parts of the database. Figure 6.4 compares the time breakdown of
a conventional transaction processing system and a prototype DORA implementation when
all 64 hardware contexts of a Sun Niagara II machine [JN07] are utilized running Nokia’s
TATP benchmark [NWMR09] and OrderStatus transactions from the TPC-C benchmark
6.1. INTRODUCTION
Baseline
100%
Time Breakdown
103
DORA
DORA
80%
Lock Mgr Cont.
60%
Lock Mgr
40%
Other Cont.
20%
Work
0%
8
33
65
89
CPU Util. (%)
100
8
34
69
86
CPU Util. (%)
100
Figure 6.3: Time breakdown of the conventional (left) and DORA (right) systems executing
TATP GetSubscriberData transactions. Large and/or growing contention indicates poor
scalability and performance.
[TPC07]. From the breakdowns of TATP (left) we see that the DORA prototype eliminates
the contention on the lock manager. Also, from the breakdowns of TPC-C OrderStatus
(right), we see that DORA substitutes the heavy-weight centralized lock management with
a much lighter-weight thread-local locking mechanism.
6.1.2
When DORA is needed
DORA is a novel transaction processing design which is useful for transactional workloads
with very high execution rates that pose pressure to the components of the transaction processing engine, such as the lock manager, when running on a multicore node. DORA does
not do any compromises to the consistency level it offers; neither it requires any modifications to the application layer, even though as we will see an application agnostic to DORA’s
partitioning is expected to perform better. Thus, DORA provides a solution for the scalability of “traditional” transaction processing within a single multicore node. A DORA system
can replace existing transaction processing systems without requiring changes in the legacy
application code. In addition, it maintains the ACID properties [GR92] and does not do any
compromises in the data management functionality it provides (e.g. the ability to perform
joins) or supported interface (e.g. not only key-value accesses). If the database cannot fit
in a single node and a scale out solution is needed, one can easily employ the DORA design
as the building block of a scale out solution. For applications that can tolerate relaxed consistency requirements or limited data management functionality, other solutions (possibly
following the popular “NewSQL” or “NoSQL” approach) may be also suitable.
104
Time Breakdown (%)
TATP
TPC-C OrderStatus
100%
DORA
80%
Lock Manager Cont.
60%
Lock Manager
40%
Other Cont.
20%
Work
0%
Baseline
DORA
Baseline
DORA
Figure 6.4: Time breakdowns of the conventional and DORA systems when the machine is
fully utilized running the TATP workload (left) and TPC-C OrderStatus transactions (right).
A large ’work’ component indicates high throughput.
6.1.3
Contributions and chapter organization
This chapter, which is centerpiece to the entire dissertation, makes three contributions.
1. We demonstrate that the conventional thread-to-transaction assignment results in contention at the lock manager that severely limits the performance and scalability on multicores.
2. We propose DORA, a data-oriented architecture that exhibits predictable access patterns
and allows to substitute the heavyweight centralized lock manager with a lightweight
thread-local locking mechanism. The result is a shared-everything system that scales to
high core counts without weakening the ACID properties.
3. We evaluate a prototype DORA transaction execution engine and show that it attains
up to 82% higher peak throughput against a state-of-the-art storage manager. Without
admission control the performance benefits for DORA can be up to 4.8x. Additionally,
when unsaturated DORA achieves up to 60% lower response times because it exploits
the intra-transaction parallelism inherent in many transactions.
The rest of the chapter is organized as follows. Section 6.2 explains why a conventional
transaction processing system may suffer from contention in its lock manager. Section 6.3
presents DORA, an architecture based on the thread-to-data assignment, and Section 6.4
evaluates the performance of a prototype DORA OLTP engine. Section 6.5 discusses weaknesses of DORA. Finally, Section 6.6 presents related work and Section 6.7 concludes.
6.2. CONTENTION IN THE LOCK MANAGER
Lock ID Hash Table
T1
105
Transaction Head
Lock strengths:
IS < IX < S
Lock Head
Lock Manager
L1
A Compute new lock mode (supremum)
Lock Request
L2
release
…
L3
IS
IS
S
IS
upgrade(IX)
IX
IS
IS
…
C Grant new requests
B Process upgrades
Figure 6.5: Overview of a lock manager, with the inset depicting a lock release.
6.2
Contention in the lock manager
In this section, we explain why in typical OLTP workloads the lock manager of conventional
systems is often the first contended component and the obstacle to scalability.
A typical OLTP workload consists of a large number of concurrent, short-lived transactions, each accessing a small fraction (ones to tens of records) of a large dataset. Each transaction independently executes on a separate thread. To guarantee data integrity, transactions
enter a large number of critical sections to coordinate accesses to shared resources. One of
the shared resources is the logical locks.5 The lock manager is responsible for maintaining
isolation between concurrently-executing transactions, providing an interface for transactions
to request, upgrade, and release locks. Behind the scenes it also ensures that transactions
acquire proper intention locks, and performs deadlock prevention and detection.
Next, we describe the lock manager of the Shore-MT storage engine [JPH+ 09] (for a
more detailed discussion of database locking and Shore-MT’s lock manager see Section 2.2).
Although the implementation details of commercial system’s lock managers are largely unknown, we expect their implementations to be similar. A possible varying aspect is that of
latches. Shore-MT uses a preemption-resistant variation of the MCS queue-based spin-lock
[HSIS05]. In the Sun Niagara II machine, our test bed, and for the CPU loads we are using
in this study (< 130%), spinning-based implementations outperform any known solution
involving blocking [JASA09].
5
We use the term “logical locking” instead of the more popular “locking” to emphasize its difference with
latching. Latching protects the physical consistency of main memory data structures. On the other hand,
logical locking protects the logical consistency of database resources, such as records and tables.
106
Time Breakdown
100%
Release Contention
80%
Release
60%
Acquire Contention
Acquire
40%
Other
20%
0%
7
28
57
76
100
CPU Util. (%)
Figure 6.6: Breakdown of time spent in the lock manager when baseline Shore-MT runs the
TPC-B benchmark. High contention leads to poor performance.
The lock manager of Shore-MT is depicted on the left side of Figure 6.5. In Shore-MT
every logical lock is a data structure that contains the lock’s mode, the head of a linked list
of lock requests (granted or pending), and a latch. When a transaction attempts to acquire
a lock, the lock manager first ensures the transaction holds higher-level intention locks,
requesting them automatically if needed. If an appropriate coarser-grain lock is found, the
request is granted immediately. Otherwise, the manager probes a hash table to find the
desired lock. Once the lock is located, it is latched and the new request is appended to
the request list. If the request is incompatible with the lock’s mode the transaction must
block. Finally, the lock is unlatched and the request returns. Each transaction maintains a
list of all its lock requests, in the order that it acquired them. At transaction completion,
the transaction releases the locks one by one starting from the youngest. To release a lock
(shown on the right side of Figure 6.5, the lock manager latches the lock and unlinks the
corresponding request from the list. Before unlatching the lock, it traverses the request list
to compute the new lock mode and to find any pending requests which may now be granted.
Due to longer lists of lock requests, the effort required to grant or release a lock grows with
the number of active transactions. Frequently-accessed locks, such as table locks, will have
many requests in progress at any given point. Deadlock detection imposes additional lock
request list traversals. The combination of longer lists of lock requests, with the increased
number of threads executing transactions and contending for locks leads to detrimental
results.
Figure 6.6 shows where the time is spent inside the lock manager of Shore-MT when it
runs the TPC-B benchmark [TPC94] and the system utilization increases on the x-axis. The
6.3. A DATA-ORIENTED ARCHITECTURE FOR OLTP
107
Figure 6.7: DORA is implemented as a layer on top of a storage manager. Its main three
components are (a) a resource manager, (b) a dispatcher of actions, and (c) a set of worker
threads that execute actions.
breakdown is on the time it takes to acquire the locks, to release them, and the corresponding
contention of each operation. When the system is lightly loaded, it spends more than 85%
of the time on useful work inside the lock manager. As the load increases, however, the
contention dominates. At 100% CPU utilization, more than 85% of the time inside the lock
manager is contention (spinning on latches).
6.3
A Data-ORiented Architecture for OLTP
In this section, we present the design of an OLTP system which employs a thread-to-data
assignment policy. We exploit the coordinated access patterns of this assignment policy
to eliminate interactions with the contention-prone centralized lock manager. At the same
time, we maintain the ACID properties and do not physically partition the data. We call
the architecture data-oriented architecture, or DORA.
6.3.1
Design overview
DORA is implemented as a layer on top of a mostly traditional storage manager, as depicted
in Figure 6.7. Its functionality includes three basic operations:
• It binds worker threads to disjoint subsets of the database.
• It distributes the work of each transaction across transaction-executing threads according
to the data accessed by the transaction.
• It avoids interactions with the centralized lock manager as much as possible during request
execution.
108
Next we describe each operation in detail. We use the execution of the Payment transaction of the TPC-C benchmark as our running example. The Payment transaction updates
a Customer’s balance, reflects the payment on the District and Warehouse sales statistics,
and records it in a History log [TPC07].
Binding threads to data
DORA couples worker threads with data by setting a routing rule for each table in the
database. A routing rule is a mapping of sets of records, or datasets, to worker threads,
called executors. Each dataset is assigned to one executor and an executor can be assigned
multiple datasets from a single table. The only requirement for the routing rule is that each
possible record of the table to map to a unique dataset. With the routing rules each table
is logically decomposed into disjoint sets of records. All data resides in the same buffer pool
and the rules imply no physical separation or data movement.
A table’s routing rule may use any combination of the fields of the table. The columns
used by the routing rule are called the routing fields. The columns of the primary or candidate
key do not necessarily have to be the routing fields – any column can be. But in practice
we have seen them to work well as routing fields. For example, the primary key of the
Customers table of the TPC-C database consists of the Warehouse id (C W ID), the District
id (C D ID), and the Customer id (C C ID). The routing fields may be all those fields or
any subset of them. In the Payment transaction example, we assume the Warehouse id is
the routing field in each of the four accessed tables.
The routing rules are maintained at runtime by the DORA resource manager. Periodically, the resource manager updates the routing rules to balance load. The resource manager
varies the number of executors per table depending on the size of the table, the number of
requests for that table, and the available hardware resources. In the following chapter, we
discuss the load balancing mechanism in more detail (Section 7.6).
Transaction flow graphs
In order to distribute the work of each transaction to the appropriate executors, DORA
translates each transaction to a transaction flow graph. A transaction flow graph is a graph
of actions to datasets. An action is a subset of a transaction’s code which involves access to
a single or a small set of records from the same table. The identifier of an action identifies
the set of records this action intends to access. Depending on the type of the access the
identifier can be a set of values for the routing fields or the empty set. Two consecutive
actions can be merged if they have the same identifier (refer to the same set).
109
Figure 6.8: A possible transaction flow graph for TPC-C Payment.
The more specific the identifier of an action is, the easier is for DORA to route the action
to its corresponding executor. That is, actions whose identifier are all the routing fields
are directed to their executor by consulting the routing rule of the table. Actions whose
identifier is a subset of the routing field set may map to multiple datasets. In that case, the
action is broken to a set of smaller actions, each of them resized to correspond to a dataset.
Secondary index accesses typically fall in this category. Finally, actions that do not contain
any of the routing fields have the empty set as their identifier. For these secondary actions,
the system cannot decide who their responsible executor is. In Section 6.3.2 we discuss how
DORA handles secondary actions, while in Section 6.4.5 we evaluate DORA’s performance
on transaction with secondary actions.
To control the distributed execution of the transaction and to transfer data between actions with data dependencies, DORA uses shared objects across actions of the same transaction. Those shared objects are called rendezvous points or RVPs. If there is data dependency
between two actions, an RVP is placed between them. The RVPs separate the execution of
the transaction to different phases. The system cannot concurrently execute actions from
the same transaction that belong to different phases. Each RVP has a counter initially set to
the number of actions that need to report to it. Every executor which finishes the execution
of an action decrements the corresponding RVP counter by one. When an RVP’s counter
becomes zero, the next phase starts. The executor which zeroes a particular RVP initiates
the next phase by enqueueing all the actions of that phase to their corresponding executors.
The executor which zeroes the last RVP in the transaction flow graph calls for the transaction commit. On the other hand, any executor at any time can abort the transaction and
hand it to recovery.
110
A transaction flow graph for the Payment transaction is shown in Figure 6.8. Each
Payment transaction probes a Warehouse and a District record and updates them. In each
case, both actions (record retrieval and update) have the same identifier and they can be
merged. The Customer, on the other hand, 60% of the time is probed through a secondary
index and then updated. That secondary index contains the Warehouse id, the District
id, and the Customer’s last name. If the routing rule on the Customer table uses only
the Warehouse id or/and the District id fields, then the system knows which executor is
responsible for this secondary index access. If the routing rule uses also the Customer id field
of the primary key, then the secondary index access needs to be broken to smaller actions
that cover all the possible values for the Customer id. If the routing rule uses only the
Customer id, then the system cannot decide which executor is responsible for the execution
and this secondary index access becomes a secondary action. In our example, we assume that
the routing field is the Warehouse id. Hence, the secondary index probe and the consequent
record update have the same identifier and can be merged. Finally, an RVP separates the
Payment transaction to two phases, because of the data dependency between the record
insert on the History table and the other three record probes.6
Payment’s specification requires the Customer to be randomly selected from a remote
Warehouse 15% of time. In that case, a shared-nothing system that partitions the database
on the Warehouse will execute a distributed transaction with all the involved overheads.
DORA, on the other hand, handles gracefully such transactions by simply routing the Customer action to a different executor. Hence, its performance is not affected by the percentage
of “remote” transactions.
Executing requests
DORA routes all the actions that intend to operate on the same dataset to one executor. The
executor is responsible to maintain isolation and ordering across conflicting actions. In this
sub-section, we describe how DORA executes transactions avoiding centralized locking and
maintaining transaction isolation; while, a detailed example of the execution of a transaction
in DORA is described in the following sub-section (Section 6.3.1.
To maintain isolation and ordering across actions, each executor has three data structures
associated with it: a queue of incoming actions, a queue of completed actions, and a threadlocal lock table. The actions are processed in the order they enter the incoming queue. To
detect conflicting actions the executor uses the local lock table. The conflict resolution
6
We should note that in [PTB+ 11] we present a tool that automatically generates the transaction flow
graph of any arbitrary transaction, based on its SQL.
111
happens at the action-identifier level. That is, the input to the local lock table are action
identifiers. The local locks have only two modes, shared and exclusive. Since the action
identifiers may be only a subset of the routing fields, the locking scheme employed is similar
to that of key-prefix locks [Gra07a]. Once an action acquires the local lock, it can proceed
without centralized concurrency control.
In regular transaction processing and under strict 2-phase locking each transaction releases every lock it acquired after the commit (or abort) log record has been flushed to disk.
Similarly in DORA each transaction holds the local locks it acquired (through the actions
it enqueued on every executor) until the transaction commits (or aborts) globally. That is,
while at the terminal RVP, each transaction first waits for a response from the underlying
storage manager that the log flush has been completed, which means that the commit (or
abort) has completed as well. Then, it enqueues all the actions that participated in the
transaction to the completion queues of their executors. Each executor removes entries from
its local lock table as actions complete, and serially executes any blocked actions which can
now proceed.
Each executor implicitly holds an intent exclusive (IX) lock for the whole table, and
does not have to interface the centralized lock manager in order to re-acquire it for every
transaction. Transactions that intend to modify large data ranges which span multiple
datasets or cover the entire table (e.g. a table scan, or an index or table drop) enqueue
an action to every executor operating on that particular table. Once all the actions are
granted access, the “multi-partition” transaction can proceed. In transaction processing
workloads such operations already hamper concurrency, and therefore occur rarely in scalable
applications.
In addition, in order to avoid frequent interaction with the metadata manager, each
executor caches the metadata information that it needs for accessing the datasets it has
been assigned to. Caching the metadata information on the executor thread has similar
effect with passing information with agent threads, which we described in Section 5.3.1.
Detailed transaction execution example
In this sub-section we describe in detail the execution of one TPC-C Payment transaction,
our running example whose transaction flow graph is shown at Figure 6.8. Figure 6.9 shows
the execution flow in DORA. Each circle is color-coded to depict the worker thread (executor or dispatcher) which executes that step. In total there are 12 steps for executing this
transaction:
112
Figure 6.9: Execution example of the TPC-C Payment transaction in DORA.
Step 1. The execution of the transaction starts from the thread that receives the request
(e.g. from the network). That thread enqueues the actions of the first phase of the
transaction to the corresponding executors. As we see from Figure 6.8, the first phase
of Payment consists of three actions that are enqueued to corresponding Warehouse,
District, and Customer executors.
Step 2. The executor consumes actions enqueued to its incoming queue in a first-comefirst-served order. Once an action reaches the head of the queue it is picked by the
executor.
Step 3. Each executor probes its local lock table to determine whether it can process the
action it is currently serving or not. If there is a logical lock conflict with a previous
action, the action is added to a list of blocked actions. Its execution will resume
once the transaction whose action blocks the particular action finishes. Otherwise, the
executor executes the action without system-wide concurrency control.
Step 4. Once the action is completed (with a set of operations in the underlying storage manager without system-wide concurrency control), the executor decrements the
counter of the RVP of the first phase (RVP1).
Step 5. If it is the last action to report to the RVP, the executor of the action that zeroed
the RVP initiates the next phase by enqueueing the corresponding (one) action to the
History table executor.
Step 6. The History table executor does the same routine, picking the action from the
head of the incoming queue.
Step 7. The History table executor probes the local lock table.
113
Step 8. The Payment transaction inserts a record to the History table and for a reason we
explain at Section 6.3.2, the execution of that action needs to interface the system-wide
(centralized) lock manager.
Step 9. Once the action is completed, the History executor updates the terminal RVP and
calls for the transaction commit.
Step 10. When the underlying storage engine returns from the system-wide commit (with
the log flush and the release of any centralized locks), the History executor enqueues
the identifiers of all the actions back to their executors.
Step 11. The executors pick the committed action identifier.
Step 12. The executors remove the entry from their local lock table, and search the list of
pending actions for action which may now proceed.
The detailed execution example and especially steps 9-12 show that the commit operation
in DORA is similar with the 2-phase commit protocol [GR92], in the sense that the thread
that calls the commit (“coordinator” in 2PC) also sends messages to release the local locks
to the various executors (“participants” in 2PC). The main difference with the traditional
2-phase commit is that the messaging happens asynchronously and that the participants
do not have to vote. Since all the modifications are logged under the same transaction
identifier there is no need for additional messages and log inserts (separate “Prepare” and
“Commit” messages and records in 2PC). That is, the commit is a one-off operation in terms
of logging but still involves the asynchronous exchange of a message from the coordinator to
the participants for the thread-local locking.
This example shows how DORA converts the execution of each transaction to a collective effort of multiple threads. Also, it shows how DORA minimizes the interaction with the
contention-prone centralized lock manager, at the expense of additional inter-core communication bandwidth.
6.3.2
Challenges
In this section we describe three challenges in the DORA design. Namely, we describe how
DORA handles record inserts and deletes (Section 6.3.2), how it executes secondary actions
(Section 6.3.2), and how it avoids deadlocks (Section 6.3.2).
114
Record inserts and deletes
Record probes and updates in DORA require only the local locking mechanism of each
executor. However, there is still a need for centralized coordination across concurrent record
inserts and deletions (executed by different executors) for their accesses to specific page slots.
That is, it is safe to delete a record without centralized concurrency control with respect
to any reads to this record, because all the probes will be executed serially by the executor
responsible for that dataset. But, there is problem with the record inserts by other executors.
The following interleaving of operations by transactions T1 executed by executor E1 and T2
executed by executor E2 can cause a problem: T1 deletes record R1. T2 probes the page
where record R1 used to be and finds its slot free. T2 inserts its record. T1 then aborts.
The rollback fails because it is unable to reclaim the slot which T2 now uses. This is a
physical conflict (T1 and T2 do not intend to access the same data) which row-level locks
would normally prevent and which DORA must address. To avoid this problem, the insert
and delete record operations lock the record id (RID), and along with it the accompanying
slot, through the centralized lock manager.
Although the centralized lock manager can be a source of contention, typically the rowlevel locks that need to be acquired due to record insertions and deletes are not contended,
and they make up only a fraction of the total number of locks a conventional system would
lock. For example, when executing the Payment transaction DORA needs to acquire only
1 such lock (for inserting the History record), of the 19 a conventional system normally
acquires.
This challenge with the inserts and deletes is caused because DORA employs partitioning
only at the logical level, and there are some unavoidable physical conflicts. In the next
chapter (Chapter 7), we extend DORA’s design to the physical layer. One of the benefits
of extending the DORA design to the physical layer is that we eliminate the possibility of
physical conflicts and we do not have to acquire even those few centralized locks.
Secondary actions
The problem with secondary actions (Section 6.3.1) is that the system does not know which
executor is responsible for their execution. To resolve this difficulty the indexes whose
accesses cannot be mapped to executors, store the RID as well as all the routing fields at
each leaf entry. The RVP-executing thread of the previous phase executes those secondary
actions and uses the additional information to determine which executor should perform the
access of the record in the heap file.
115
For example, consider a non-clustered secondary index on the last name of a Customers
table and a DORA partitioning that does not use any of the fields of that index as the routing
fields. In that case, all the accesses to that secondary index are secondary actions. To access
a Customer though that index, DORA follows the following steps: (a) Any thread that
completed the execution of the previous RVP (or the dispatcher thread) probes the secondary
index under normal centralized concurrency control; (b) The probing thread retrieves the
RIDs and the routing fields of the records that match the probing criteria; (c) The probing
thread groups the matched RIDs according to the routing table and enqueues an action of
special type with the list of RIDs to be accessed, to every partition that contains at least
an involved RID; (d) Finally, when the executors of each involved partition dequeue such
a special action they consult their local lock table and proceed directly to the heap file to
access the selected record.
Under this scheme uncommitted record inserts and updates are properly serialized by
the executor, but deletes still pose a risk of violating isolation. Consider the interleaving of
operations by transactions T1 and T2 using primary index Idx1 and a secondary index Idx2
which is accessed by any thread. T1 deletes Rec1 through Idx1. T1 deletes entry from Idx2.
T2 probes Idx2 and returns not-found. T1 rolls back, causing Rec1 to reappear in Idx2. At
this point T2 has lost isolation because it saw the uncommitted (and eventually rolled back)
delete performed by T1. To overcome this danger, we can add a ’deleted’ flag to the entries
of Idx2. When a transaction deletes a record it does not remove the entry from the index;
any transaction which attempts to access the record will go through its owning executor and
find that it was, or is being, deleted.
Once the deleting transaction commits, it goes back and sets the flag for each index entry
of a deleted record outside of any transaction. Transactions accessing secondary indexes
ignore any entries with a deleted flag, and may safely re-insert a new record with the same
primary key. Because deleted secondary index entries will tend to accumulate over time,
we can modify the B-Tree’s leaf-split algorithm to first garbage collect any deleted records
before deciding whether a split is necessary. For growing or update-intensive workloads, this
approach will avoid wasting excessive space on deleted records. If updates are very rare,
there will be little potential wasted space in the first place.
In Section 6.4.5 we evaluate the performance of DORA with secondary actions. It is
expected that the performance of DORA in workloads with very frequent secondary actions
will not be optimal. However, DORA partitioning is only logical and a DBA or the application designer can easily modify the partitioning to reduce the frequency of secondary actions.
For example, in the scenario with the secondary index on the last name of Customers, a sim-
116
ple solution would be to add the field of last names to the set of routing fields. Adding an
additional field to the routing fields would only increase the number of datasets (partitions)
but nothing other than that. At the same time it would eliminate any secondary actions.
In [PTJA11] and [TPJA11] we show that the cost of repartitioning in DORA is very small,
while in [PTB+ 11] we present a tool that monitors the accesses of the database and alerts
when it observes high frequency of secondary actions.
Deadlock Detection
DORA transactions can block on local lock tables. Hence, the storage manager must provide
an interface for executors to propagate this information to the deadlock detector.
DORA proactively reduces the probability of deadlocks. Whenever a thread is about to
submit the actions of a transaction phase, it latches the incoming queues of all the executors
it plans to submit to, so that the action submission appears to happen atomically.7 This
ensures that transactions with the same transaction flow graph will never deadlock with each
other if they are on the same phase. That is, two transactions with the same transaction
flow graph which are on the same phase will deadlock only if their conflicting requests are
processed in reverse order. But that is impossible, because the submission of the actions
appears to happen atomically, the executors serve actions in FIFO order and the local locks
are held until the transaction commits. The transaction which will enqueue its actions first
it will finish before the other.
6.3.3
Improving I/O and microarchitectural behavior
We exploit the regularity and predictability of the accesses in DORA only in order: (a) to
reduce the interaction with the centralized lock manager and hence reduce the number of
expensive latch acquisitions and releases; and (b) to improve single thread performance by
replacing the execution of the expensive lock manager code with the execution of a much
lighter-weight thread-local locking mechanism. But, the potential of the DORA execution
does not stop there. Potentially DORAs predictable access patterns can be exploited to
improve both the I/O, as well as, the microarchitectural behavior in OLTP.
In particular, the I/O executed during conventional OLTP is random and low performing.8 The DORA executors can buffer the I/O requests and issue them in batches since those
7
8
There is a strict ordering between executors. The threads acquire the latches in that order, avoiding
deadlocks on the latches of the incoming queues of executors.
As a proof, the performance of conventional OLTP systems is significantly improved with the usage of
Flash-based storage technologies which exhibit high random access bandwidth [LMP+ 08]
117
I/Os are expected to target pages that are physically close to each other, improving the I/O
behavior.
Furthermore, the main characteristic of the micro-architectural behavior of conventional
OLTP systems is the very large volume of shared read-modify accesses by multiple processing
cores [BW04]. Accesses which, unfortunately, are also highly unpredictable [SWH+ 04].
Due to the two aforementioned reasons, emerging hardware technologies such as reactive
distributed on-chip caches (e.g. [HFFA09, BW04]) and/or the most advanced hardware
prefetchers (e.g. [SWAF09]) fail to significantly improve the performance of conventional
OLTP. Since DORAs design is based on that the majority of the accesses to a specific data
region are coming by a specific thread, we expect a friendlier behavior which can realize the
full potential of the latest hardware developments by providing more private and predictable
memory accesses.
6.3.4
Prototype Implementation
In order to evaluate the DORA design, we implemented a prototype DORA OLTP engine
over our baseline system, the Shore-MT storage manager [JPH+ 09] (and Section 5.2). ShoreMT is a modified version of the SHORE storage manager [CDF+ 94] with a multi-threaded
kernel. SHORE supports all the major features of modern database engines: full transaction
isolation, hierarchical locking, a CLOCK buffer pool with replacement and prefetch hints, BTree indexes, and ARIES-style logging and recovery [MHL+ 92]. We use Shore-MT because
it has been shown to scale better than any other open-source storage engine.
Our prototype does not have a optimizer which transforms regular transaction code to
transaction flow graphs. Thus, all transactions are partially hard-coded. The database
metadata and back-end processing are schema-agnostic and general purpose, but the code
is schema-aware. This arrangement is similar to the statically compiled stored procedures
that commercial engines support, converting annotated C code into a compiled object that
is bound to the database and directly executed. For example, for maximum performance,
DB2 allows developers to generate compiled “external routines” in a shared library for the
engine to dlopen and execute directly within the engine’s core.9 The prototype is implemented as a layer over Shore-MT. Shore-MT’s sources are linked directly to the code of the
prototype. Modifications to Shore-MT were minimal. We added an additional parameter
to the functions which read or update records, and to the index and table scan iterators.
This flag instructs Shore-MT to not use concurrency control. Shore-MT already has a builtin option to access some resources without concurrency control. In the case of insert and
9
http://publib.boulder.ibm.com/infocenter/ db2luw/v9r5/index.jsp
118
delete records, another flag instructs Shore-MT to acquire only the row-level lock and avoid
acquiring the whole hierarchy.
Even though according to DORA’s design a single executor (worker thread) can be assigned multiple datasets from various tables, in the prototype we assign only a single dataset
(from a single table). Thus, if a transaction, like TPC-C Payment, accesses four tables it is
going to be handled by at least four different executor threads. That impacts the performance of DORA on low core counts, as we will see in Section 6.4.7 where we evaluate the
performance of DORA in a machine with limited hardware parallelism.
6.4
Performance Evaluation
For the evaluation we use one of the most parallel multicore machines available and we
compare against Shore-MT, which we label as Baseline. Shore-MT’s current performance
and scalability make it one of the first systems to face the contention problem on commodity
chip multicores. As hardware parallelism increases and transaction processing systems solve
other scalability problems, they are expected to similarly face the problem of contention in
the lock manager. Our evaluation covers several areas:
• We measure how effectively DORA reduces the interaction with the centralized lock
manager and what is the impact on performance (Section 6.4.2).
• We quantify how DORA exploits the intra-transaction parallelism of transactions (Section 6.4.3).
• We compare the peak performance Shore-MT and DORA achieve, if a perfect admission
control mechanism is used (Section 6.4.4).
• We evaluate the performance of DORA on secondary index accesses that can be either
aligned with the partitioning scheme or not (Section 6.4.5).
• We evaluate the performance on complicated transactions with joins (Section 6.4.6).
• We compare how the two systems behave on hardware with limited parallelism (Section 6.4.7).
• We compare the anatomy of critical sections for baseline Shore-MT and DORA (Section 6.4.8).
6.4.1
Experimental Setup and Workloads
Hardware. We perform all our experiments on a Sun T5220 “Niagara II” box configured
with 32GB of RAM and running Sun Solaris 10. The Niagara II chip [JN07] contains 8 cores,
each capable of supporting 8 hardware contexts, for a total of 64 “OS-visible” CPUs. Each
core has two execution pipelines, allowing it to simultaneously process instructions from any
6.4. PERFORMANCE EVALUATION
119
two threads. Thus, it can process up to 16 instructions per machine cycle, using the many
available contexts to overlap delays in any one thread.
I/O Subsystem. When running OLTP workloads on the Sun Niagara II machine, both
the baseline Shore-MT system and the DORA prototype are capable of high performance.
The demand on the I/O subsystem scales with throughput due to dirty page flushes and log
writes. For the random I/O generated, hundreds or even thousands of disks may be necessary
to meet the demand. 10 Given the limited budget and that we are interested in the behavior
of the systems when a large number of hardware contexts are utilized, we store the database
and the log on an in-memory file system. This setup exercises all the codepaths of the storage
manager yet allows us to saturate the CPU. In addition, preliminary experimentation using
high performing Flash drives indicates that the relative behavior remains the same.
Workloads. We use transactions from three OLTP benchmarks: Nokia’s Network Database
Benchmark or TATP [NWMR09] (also formerly known as TM1), TPC-C [TPC07], and
TPC-B [TPC94, A+ 85]. Business intelligence workloads, such as the TPC-H benchmark
[TPC06], spend a large fraction of their time on computations outside the storage engine
imposing small pressure on the transaction processing system components, such as the lock
manager. Hence, they are not an interesting workload for this study.
The TATP benchmark consists of seven transactions, operating on four tables, implementing various operations executed by mobile networks. Three of the transactions are read-only
while the other four perform updates. The transactions are extremely short, yet exercise all
the codepaths in typical transaction processing. Each transaction accesses only 1-4 records,
and must execute with low latency even under heavy load. We use a database of 5M subscribers (∼7.5GB).
The TPC-C benchmark models an OLTP database for a retailer. It consists of five transactions that follow customer orders from creation to final delivery and payment. We set the
buffer pool to be 4GB and use a TPC-C database of scaling factor 150, a database with 150
Warehouses, which occupies around 20GB on the disk. 150 Warehouses can support enough
concurrent requests to saturate the machine, but the database is still small enough to fit in
the in-memory file system.
The TPC-B benchmark models a bank where customers deposit and withdraw from their
accounts. We use a TPC-B database of scaling factor 100, a database with 100 Branches,
which occupies 2GB on disk and fits entirely in the buffer pool.
10
As an example, consider the top results on the TPC-C OLTP benchmark, at http://www.tpc.org/tpcc/.
All of them use I/O subsystems that worth hundreds thousands or millions of dollars.
120
For each run, the driver code spawns a certain number clients and the clients start
submitting transactions. Although the clients run on the same machine with the rest of the
system, they add small overhead (<3%). We repeat the measurements multiple times, and
the measured relative standard deviation is less than 5%. We compile the sources using the
highest level of optimization options using Sun’s CC v5.10 compiler 11 . For measurements
that needed profiling, we used tools from Sun Studio 12 suite 12 . The profiling tools impose
a certain overhead (∼15%) but the relative behavior between the two systems remains the
same.
6.4.2
Eliminating Contention in the Lock Manager
First, we examine the impact of contention on the lock manager for the Baseline system and
DORA as they utilize an increasing number of hardware resources. The workload for this
experiment consists of clients repeatedly submitting GetSubscriberData transactions of the
TATP benchmark.
The results are shown in Figure 6.2 and Figure 6.3. Figure 6.2 shows the throughput per
CPU utilization of the two systems on the y-axis as the CPU utilization increases. Figure 6.3
shows the time breakdown for each of the two systems. We can see that the contention in
lock manager becomes the bottleneck for the Baseline system, growing to more than 85% of
the total execution. In contrast, for DORA the contention on the lock manager is eliminated.
We can also observe that the overhead of the DORA mechanism is small. Much smaller than
the centralized lock manager operations it eliminates even when those are uncontended. It is
worth mentioning that the GetSubscriberData is a read-only transaction. Yet, the Baseline
system suffers from contention in the lock manager. That is because the threads will contend
even if they want to acquire the same lock in compatible mode; acquiring any database lock,
even in compatible mode, needs synchronization.
Next, we quantify how effectively DORA reduces the interaction with the centralized
lock manager and the impact in performance. We measure the number of locks acquired
by the Baseline and DORA. We instrument the code to report the number and the type of
the acquired locks. Figure 6.10 shows the number of locks acquired per 100 transactions
when the two systems execute transactions from the TATP and TPC-B benchmarks, as well as
TPC-C OrderStatus transactions. The locks are categorized in three types. The record-level
(row-level) locks, the locks of the centralized lock manager that are not at the record-level
(labeled higher level), and the thread local locks DORA uses.
11
12
http://developers.sun.com/sunstudio/documentation/ss12u1/mr/READMEs/c++.html
http://download.oracle.com/docs/cd/E19205-01/821-0304/
121
800
5000
600
Thread-Local
4000
Row-level
3000
Higher-level
400
2000
200
1000
0
0
Base
DORA
TATP
Base
DORA
TPC-B
Base
DORA
TPC-C OrderStatus
Figure 6.10: Absolute number of locks acquired, categorized by type, when Baseline and
DORA execute 100 transactions from various workloads.
In typical OLTP workloads the contention for the row-level locks is limited, because there
is a very large number of randomly accessed records. But, as we go up in the hierarchy of
locks, we expect the contention to increase. For example, every transaction needs to acquire
intention locks on the tables it is going to access. Figure 6.10 shows that DORA has only
minimal interaction with the centralized lock manager.
Figure 6.10 gives an idea on how those three workloads behave. TATP consists of extremely short running transactions. For their execution the conventional system acquires as
many higher-level locks as row-level. In TPC-B, the ratio between the row-level to higher-level
locks acquired is 2:1. Consequently, we expect the contention on the lock manager of the
conventional system to be smaller when it executes the TPC-B benchmark than TATP. The
conventional system is expected to scale even better when it executes TPC-C OrderStatus
transactions, which they have even larger ratio of row-level to higher-level locks.
Figure 6.11 confirms our expectations. We plot the performance of both systems in the
three workloads. The x-axis is the offered CPU load. We calculate the offered CPU load by
adding to the measured CPU utilization, the time the threads spend in the runnable queue
waiting for a processor to run. We see that the Baseline system experiences scalability problems, more profound in the case of TATP. DORA, on the other hand, scales its performance
as much as the hardware resources allow. When the offered CPU load exceeds 100%, the performance of the conventional system in all three workloads collapses. This happens because
the operating system needs to preempt threads, and in some cases it happens to preempt
threads that are in the middle of contended critical sections. The performance of DORA, on
122
Throughput (Ktps)
TATP
TPC-B
100
50
80
40
60
30
40
20
20
10
TPC-C OrderStatus
40
30
20
10
0
0
0
25 50 75 100
CPU Load (%)
DORA
Baseline
0
0
25
50
75
CPU Load (%)
100
0
25
50
75
100
CPU Load (%)
Figure 6.11: Performance of baseline and DORA for the TATP and TPC-B benchmarks, as well
as for TPC-C OrderStatus transactions as load increases along the x-axis, with throughput
shown on the y-axis. DORA consistently achieves higher throughput, exhibiting almost
linear scalability up to the 64 available hardware contexts.
the other hand, remains high; another proof that DORA reduces the number of contended
critical sections.
Figure 6.4 shows the detailed time breakdown for the two systems at 100% CPU utilization for the TATP benchmark and the TPC-C OrderStatus transactions. DORA outperforms
the Baseline system in OLTP workloads independently of whether the lock manager of the
Baseline system is contended or not.
6.4.3
Intra-transaction Parallelism
DORA exploits intra-transaction not only as a mechanism for reducing the pressure to the
contended centralized lock manager, but also for improving response times when the workload
does not saturate the available hardware. Exploiting intra-transaction parallelism is useful in
several cases. For example, it can be useful for applications that exhibit limited concurrency
due to heavy contention for logical locks or for organizations that simply do not utilize their
available processing power. An transactional application can exhibit limited concurrency
either because ti is poorly written (e.g. all transactions update the same few records) or
when the database is large enough so that the system is I/O-bound. If the system is I/O
bound exploiting intra-transaction parallelism improves performance because the system
issues multiple requests in parallel which can improve I/O bandwidth.
123
Norm. Response Time
1
Baseline
0.8
DORA
0.6
0.4
0.2
0
GetNewDest UpdSubData NewOrder
Payment
TPC-B
Figure 6.12: Single-transaction response times. DORA exploits the intra-transaction parallelism, inherent in many workloads, to achieve faster responses.
Response times of intra-parallel transactions
In the experiment shown in Figure 6.12 we compare the average response time per request
achieved by the Baseline system and DORA, when a single client submits intra-parallel transactions from the three workloads and the log resides in a in-memory file system. DORA
exploits the available intra-transaction parallelism of the transactions and achieves lower
response times. For example TPC-C NewOrder transactions are executed 2.1x faster under
DORA. In badly designed applications where some records are extremely “hot”, DORA’s
ability to exploit intra-transaction parallelism will immediately provide a significant performance boost.
Intra-transaction parallelism with aborts
One challenge with intra-transaction parallelism for DORA, are transactions with nonnegligible abort rates. For example, one of the characteristics of the TATP benchmark is
that a large fraction of transactions (around 25%) need to abort due to invalid inputs. In
such workloads, DORA may end up executing actions from already-aborted transactions,
wasting useful cycles, extending the critical path and eventually performing poorly.
There are two execution strategies DORA can follow for such intra-parallel transactions
with high abort rates. The first strategy, is to execute such transactions in parallel and to
check frequently for aborts. The second is to serialize their execution. That is, even though
there is opportunity to proceed in parallel the execution of actions from such transactions,
DORA can be pessimistic and execute them serially. This strategy ensures that if an action
aborts there is no work wasted by the execution of any other parallel action.
124
TATP-UpdSubData
Throughput (Ktps)
60
DORA-S
Baseline
DORA-P
50
40
30
20
10
0
0
25
50
75
CPU Load (%)
100
Figure 6.13: Performance when executing the UpdateSubscriberData transaction of TATP,
a transaction with high abort rate. In such workloads DORA should be pessimistic and
prefer serial transaction flow graphs.
Figure 6.13 compares the throughput of the Baseline system and two variations of DORA,
one with parallel and one with serial execution strategy, when an increasing number of
clients submit repeatedly UpdateSubscriberData transactions from the TATP benchmark.
This transaction, whose parallel and serial transaction flow graphs are depicted on the right
side of the figure, consists of two independent actions. One action attempts to update a
Subscriber and always succeeds. The other action attempts to update a corresponding
SpecialFacility entry and it succeeds only 62.5% of the time, failing the rest of the time
due to wrong input. The parallel execution is labeled DORA-P. While, the serial execution,
which first attempts to update the SpecialFacility and only if that succeeds it tries to
update the Subscriber, is labeled DORA-S. As we can see, the parallel plan is a bad choice
for this workload. DORA-P achieves lower performance than even the Baseline, whereas
DORA-S scales almost linearly as expected.
The DORA resource manager monitors the abort rates of entire transactions and individual actions in each executor, and can adapt to abort rates. For example, when the abort
rates are high, DORA can switches to serial execution plans. The simple way to convert
an intra-parallel execution plan to serial is by inserting empty rendezvous points between
actions of the same phase of the parallel plan. The higher the abort rate of a specific action,
the soon it should be executed in the serial plan.
Norm. Peak Throughput
2
1.5 74%
98%
100% 100%
60%
66%
78%
76%
88%
70%
90%
95%
95%
95%
85%
125
70%
90%
88%
100%
95%
85%
DORA-Ideal
DORA
Baseline-Ideal
Baseline
1
0.5
0
TATP
TPC-C
TPC-B
Figure 6.14: Maximum throughput achieved by Baseline and DORA for various workloads
when a perfect admission control mechanism is applied. The number above each bar is the
CPU utilization where the peak occurs. The light bars, labeled Baseline-Ideal and DORAIdeal, show the projected throughput if the machine was fully utilized and the system was
scaling with the same rate with the one it achieved the peak throughput.
6.4.4
Maximizing Throughput
Admission control can limit the number of outstanding transactions, and in turn, limit
contention within the lock manager of the system. Properly tuned, admission control allows
the system to achieve the highest possible throughput, even if it means leaving the machine
underutilized. In Figure 6.14 we compare the maximum throughput of Baseline and DORA
achieve, if the systems were employing perfect admission control. For each system and
workload we report the CPU utilization, when this peak throughput was achieved. DORA
achieves higher peak throughput in all the transactions we study, and this peak is achieved
closer to the hardware limits.
With the light bars, labeled Baseline-Ideal and DORA-Ideal we plot the ideal projected
throughput if the machine was fully utilized and each system was scaling with the same
rate with the one it achieved the peak throughput. In most of the cases the project ideal
performance for the Baseline system in lower than DORA’s. This is happening because
DORA substitutes the complex heavy-weight lock manager with a much lighter-weight one.
For the TPC-C and TPC-B benchmarks, DORA achieves relatively smaller improvements.
This happens for two reasons. First, those transactions do not expose the same degree of
contention within the lock manager, and leave little room for improvement. Second, some
of the transactions (NewOrder and Payment of TPC-C and the TPC-B) impose great pressure
on the log manager that becomes the new bottleneck.
126
6.4.5
Secondary index accesses
Non-clustered secondary indexes are pervasive in transaction processing, since they are the
only means for speeding up transactions that access records using non-primary key columns.
For example, consider a database table with Customers and a transaction which accesses
those Customers with their last name, which is not the primary key of this table. It is absolutely necessary to have a secondary index on the last names, otherwise for every Customer
retrieval through her last name the entire heap file needs to be scanned.
As we already discussed in Section 6.3.1 and Section 6.3.2, secondary index accesses
pose several challenges to the DORA design. We explore some of them in the next two
subsections where we break the analysis of secondary index accesses to two cases: when the
secondary index is aligned with the partitioning scheme and when it is not. To investigate
the impact of non-clustered secondary index accesses, we conduct an experiment where we
modify the GetSubscriberData transaction of the TATP benchmark to perform a range index
scan on the secondary index with the names of the Subscribers and we control the number
of matched records. In the original version of the transaction only one Subscriber is found.
In the modified version, we probe for 1, 10, 100 and 1000 Subscribers, even though index
scans for thousands of records are not typical in high-throughput transactional workloads.
Partitioning-aligned range index scans
We consider the case where a secondary index is aligned with the partitioning scheme. That
is the case where the secondary index columns are a subset of the routing columns. In that
case, a secondary index scan may return a large number of matched RIDs (record ids of
entries that match the selection criteria) from several partitions. All the executors need to
send the probed data to an RVP where an aggregation of the partial results takes place. As
the range of the index scans become larger (or the selectivity drops), more data need to be
sent to the RVPs potentially causing a bottleneck due to excessive data transfers.
Figure 6.15 compares the performance of Baseline and DORA as an increasing number
of clients (on the x-axis) repeatedly submit the transaction with the index scan, and the
scanned index is aligned with the partitioning scheme. DORA improves performance by
101%, 82%, 78%, and 43% respectively. DORA’s improvement gets smaller as the range of
the index scan increases for two reasons. First, as a larger number of records are accessed per
index scan, the fraction of high-level locks to row-level locks decreases. Consequently, the
contention for hot locks for the Baseline system decreases and its performance improves. In
addition, the data transfers impose a non-negligible overhead for DORA as a larger number
of records need to be sent to the RVP for aggregation. Still though, as long as the index
Throughput (Ktps)
250
Range=1
160
Range=10
140
200
127
30
Range=100
150
100
3
20
2
15
80
60
2
10
1
40
50
5
20
0
0
Range=1000
3
25
120
100
4
1
0
0
DORA
Baseline
0 10 20 30 40 50 60
0 10 20 30 40 50 60
0 10 20 30 40 50 60
0 10 20 30 40 50 60
# Clients
# Clients
# Clients
# Clients
Figure 6.15: Performance of Baseline and DORA as the range of a (non-clustered) secondary
index scan which is aligned with DORA’s partitioning increases. As the range of the index
scan increases, more than one DORA partitions are accessed and more data need to be sent
to the RVP.
scans of partitioning-aligned secondary indexes are selective and touch a relatively small
number of records, DORA provides a significant performance improvement. Transactions
that touch tenths of thousands of records through index scans are not common on scalable
transactional applications.
Non partitioning-aligned range scans
Next, we consider the case where a secondary index is not aligned with the partitioning
scheme. We already detailed the drawbacks of this case and how DORA handles it in
Section 6.3.2. Figure 6.16 compares the performance of Baseline and DORA as an increasing
number of clients (on the x-axis) repeatedly submit the transaction with the index scan.
But this time the index is non-aligned with the partitioning scheme. In that case, DORA
improves performance by 121% and 13% when 1 and 10 records are accessed respectively. On
the other hand, when 100 and 1000 records are accessed, DORA is 17% and 31% slower than
the Baseline. We note that DORA exhibits intra-transaction parallelism for non-aligned
secondary index accesses, since one thread does the secondary index access and another
thread does the record access in the heap file. This is evident in all range sizes, as long as
the number of concurrent clients is less than around 30.
Non-aligned secondary index accesses impose significant overhead for DORA. On top of
the reasons that close the performance gap between DORA and Baseline and we discussed
in the previous subsection, the extra overhead comes from the extra work needed for the
128
Throughput (Ktps)
300
Range=1
80
Range=10
70
250
12
Range=100
150
0.8
6
40
30
100
1.0
8
50
0.6
4
0.4
20
50
2
10
0
0
Range=1000
1.2
10
60
200
1.4
0.2
0
DORA
Baseline
0.0
0 10 20 30 40 50 60
0 10 20 30 40 50 60
0 10 20 30 40 50 60
0 10 20 30 40 50 60
# Clients
# Clients
# Clients
# Clients
Figure 6.16: Performance of Baseline and DORA as the range of a (non-clustered) secondary
index scan which is not aligned with DORA’s partitioning increases. For the non-aligned
index accesses DORA needs to do additional work per record accessed.
record probes. In particular, each record probe is a two step process, where the secondary
index probe is done by one thread conventionally which then requests from the appropriate
executor threads to retrieve the selected records. Whereas a conventional system right after
the secondary index probe it would access directly the records in the heap file through their
RIDs, in DORA a set of packets are constructed and sent to the appropriate partition-owning
threads, which access the records in the heap file. On top of the extra cycles wasted there, in
DORA we increase the size of the index by appending in each leaf entry the routing fields of
each record. Nevertheless, the benefits of DORA are substantial even in such “non-friendly”
workload with secondary actions, as long as they probe for a limited number of records
(tenths to low hundreds). On the other hand, a DBA can always modify DORA’s logical
partitioning to eliminate any problematic secondary actions, if it is necessary.
6.4.6
Transactions with joins
With the next experiment we try to quantify how DORA performs on transactions with
joins. Joins are particularly challenging for DORA because they involve records coming
from two tables, which in DORA are being accessed by at least two different threads. That
means that a possibly significant number of records needs to be transferred from one thread
to the other, through an RVP. As the amount of data transferred increases, we expect the
performance of DORA to decrease.
To quantify the performance on transactions with joins, we slightly modified the StockLevel
transaction from the TPC-C benchmark, which contains a join between Orderlines and
129
Normalized Throughput
1.40
1.20
1.00
0.80
0.60
Baseline
0.40
DORA
0.20
0.00
20
200
2000
20000
Tuples Joined
Figure 6.17: Performance comparison between baseline and DORA on a slightly modified
version of the TPC-C StockLevel transaction, which is a transaction with a join, where we
regulate the number of records joined. As more records are joined, DORA’s performance
gets lower.
Stocks. In the default version of the transaction, 200 Orderlines join with equal number of
Stocks. We modify the transaction to control the number of Orderlines joined, from 20 to
20000, even though transactions joins with tenths of thousands of records are not common in
transactional workloads. The join is executed as a nested-loops join. That is, the Orderlines
table is index-scanned for matching records which are sent to the RVP, and then the Stocks
table is probed for joining records. Figure 6.17 plots the performance of DORA normalized
to the performance of baseline for an increasing number of records joined. When only 20
tuples are joined DORA is faster than baseline by about 25%. As an increasing number of
tuples are joined the performance gap closes. But it is only when tenths of thousands of
records joined when baseline outperforms DORA. In particular, when 20000 tuples are joined
baseline is faster than DORA by 8%. Thus, DORA is useful also for transactional workloads
that contain joins of a modest number of records. We revisit this experiment in the following chapter (Section 7.7.6) where we show even further improvements in transactions with
nested-loops joins because the index probes are faster.
6.4.7
Limited hardware parallelism
As we already discussed in Section 1.4, the focus of this dissertation is not parallelismconstrained hardware. In the last part of the performance analysis section, we study the
behavior of DORA on machines with limited hardware parallelism. To do that, we run
transactions that exhibit intra-transaction parallelism in a machine where we control the
number of available processing cores.
130
Throughput (Ktps)
2500
4000
TPC-C Payment
2000
Baseline
DORA
TPC-B
3000
1500
2000
1000
1000
500
0
0
1
2
3
Available CPUs
4
1
2
3
4
Available CPUs
Figure 6.18: Performance of Baseline and DORA running intra-parallel transactions on a
machine with limited hardware parallelism. DORA does not perform well in very low core
counts, due to context switches and preemptions.
Figure 6.18 compares the performance of baseline and DORA when a single client repeatedly submits TPC-C Payment transactions (left) and the TPC-B transaction (right) on
a machine where we control the number of available hardware cores, from 1 to 4. Those
two transactions exhibit intra-transaction parallelism and in DORA several threads participate in their execution. In particular, Payment (whose transaction flow graph is shown in
Figure 6.8) consists of four actions out of which three execute in parallel in the first phase
and one in the second; while TPC-B also consists of four actions all of them executing in parallel in one phase. We see that when the number of available cores at least matches the number
of parallel actions (and in compliance with what we saw in Section 6.4.3), DORA gets lower
response times proportional to the intra-transaction parallelism of each transaction.
On the other hand, when the number of available cores is lower we observe different
behavior. In the uni-processor case (one core available) DORA’s performance is lower but
comparable with baseline, on the other hand in the two cores case DORA under-performs.
Two are the main reasons for DORA’s drop in performance. First, it is the additional
work that needs to be done per transaction in DORA in order to execute the transaction
as a data flow; additional work for each transaction that wastes the sparingly available
processor cycles. For example, DORA for each transaction creates several actions and sends
them to the various participating threads, which need to context switch in, dequeue their
action, execute it and context switch out. The second reason is the much higher number of
preemptions that cause convoys [BGMP79] and are shown in the form of involuntary context
131
Voluntary Ctxs
Context switches (x1000)
600
500
400
300
200
100
0
600
500
400
300
200
100
0
TPC-C Payment
Base DORA Base DORA Base DORA Base DORA
1
2
3
Available CPUs
4
Involuntary Ctxs
TPC-B
Base DORA Base DORA Base DORA Base DORA
1
2
3
4
Available CPUs
Figure 6.19: Number of voluntary and involuntary context switches for Baseline and DORA
when they run intra-parallel transactions on a machine with limited hardware parallelism.
DORA does around an order of magnitude more context switches that impact performance.
Even worse, a fraction of the context switches are involuntary which indicates preemptions.
switches. Figure 6.19 plots the number of voluntary and involuntary context switches for
the same duration of time for baseline and DORA when they run the same workloads with
Figure 6.18. We see that a very large number of context switches take place in DORA,
almost an order of magnitude more than baseline. Even worse, a significant fraction of them
are involuntary context switches, which means that we have preemptions. A preemption
may happen while a thread is inside a critical section. If the following scheduled thread also
wants to execute the same critical section it will have to wait idle, forming a convoy.
The aforementioned problems are related both to DORA’s design but also to a limitation
of our prototype, where each executor thread is assigned datasets from a single database
table. If we were able to assign datasets from multiple tables to a single executor thread,
then we could have as many executor threads as the number of available cores, and there
would be no need for context switches. But in that case the problem would be mitigated
within each executor thread and the scheduling decisions it would have to make. That is, if
each executor thread serves requests from multiple queue, it would have to make “intelligent”
decisions on which queue to serve first. The bottom line is that DORA under-performs in
the case of limited hardware parallelism. In some cases, like in the case of 2 hardware
cores available, things can get bad, because of preemptions and un-intelligent scheduling
decisions. Some of the problems would have been prevented in a more elaborate prototype
implementation. In Section 6.5 we summarize DORA’s weaknesses.
132
CSs per Transaction
80
Uncategorized
70
Message passing
60
Xct mgr
50
Log mgr
40
Metadata
30
Bpool
20
Page Latches
10
Lock mgr
0
Baseline
DORA
Figure 6.20: Anatomy of critical section for the TATP UpdateLocation transaction. The
radical redesign of DORA eliminates almost all the un-scalable critical sections at the lock
manager (and metadata manager) at the very small expense of increased message passing,
which is point-to-point communication – belongs to the fixed type of critical sections.
6.4.8
Anatomy of critical sections
To conclude the performance evaluation section, Figure 6.20 compares the breakdown of
critical sections for our running example, the TATP UpdateLocation transaction, for baseline
(Shore-MT without the optimizations presented in Chapter 5) and DORA. We can see that
the result of DORA’s drastic change of the transaction execution model is a dramatic change
in the anatomy of critical sections. We see that at the expense of a few message passes, which
belong to the fixed contention type, DORA eliminates nearly all the critical sections related
to the centralized lock manager and metadata. Overall, DORA reduces the number of
unscalable critical sections acquired by more than 75%.
We also observe that two of the bigger remaining sources of critical sections come from
the log manager and the page latches. The critical sections of the log manager are taken
care with the composable log buffer inserts mechanism, presented in Section 5.4; while the
page latches are taken care with the design presented in the following chapter (Chapter 7).
6.5
Weaknesses
Even though there is a fairly wide design space where data-oriented transaction execution
outperforms conventional transaction processing, there is also a set of cases where DORA is
not suitable. Next we summarize some of the limitations of DORA.
6.5. WEAKNESSES
133
Applications that put less pressure to the storage manager
First of all, data-oriented execution is designed for high performance transaction processing
that imposes pressure on the internal of the database storage layer. Thus, certain classes of
applications may not benefit from it, or even get penalized. For example, for most of our
evaluation we use the specialized TATP and TPC-B benchmarks instead of the more popular
TPC-C. The reason for that is that the baseline system (Shore-MT) does not encounter any
of the issues we try to address in TPC-C and there is less room for improvement.
Another example, are business intelligence applications with large file scans or joins.
In such workloads DORA may penalize performance since it may require the transfer of
large volumes of data between the participating threads (we showed an example of that in
Section 6.4.6). It is common practice, however, to employ dedicated database engines (usually column-stores [SAB+ 05, BZN05]) for processing such business intelligence workloads.
Limited hardware parallelism
As Section 6.4.7 showed, DORA under-performs when the hardware parallelism is limited.
The main source of problems come from the data-flow and the need for frequent context
switches. This problem is related both to DORA’s design but also to a limitation of our
prototype, where each executor thread is assigned datasets from a single database table. If
we were able to assign datasets from multiple tables to a single executor thread, then we
could have as many executor threads as the number of available cores, and there would be
no need for context switches. But in that case the problem would be mitigated within each
executor thread and the scheduling decisions it would have to make. That is, if each executor
thread serves requests from multiple queue, it would have to make “intelligent” decisions on
which queue to serve first.
From the graphs presented in Section 6.4.7 we observe that DORA under-performs when
the available hardware parallelism is smaller than the number of tables touched in the workload. But the number of tables used in transactional workloads increases with much lower
rate than the rate at which multicore parallelism increases (there is no indication that Moore’s
Law while slow down in the near future and the roadmap of all the major processor lines
predicts increases in the number of cores per chip). For example, the TPC-E benchmark,
the most recent transactional benchmark from TPC which introduced in early 2007, uses
33 tables [TPC10]. The TPC-C benchmark, TPC-E’s predecessor and the de facto OLTP
benchmark that has been in use since 1992, uses 9 tables [TPC07]. That is a 3.6x increase in
the number of tables over a period of 15 years, not even close to the rate at which hardware
parallelism increases. Therefore, if not now, in the near future there would hardly be any
134
hardware platform which will not have as much hardware parallelism as the number of tables
used in a transactional workload. Thus, we predict that this limitation of DORA will not
raise concerns in the future.
Non partitioning-aligned index accesses
DORA partitions each table using range-based partitioning to the keys of a specific subset
of the columns of the table. The DBA, however, may have decided to build indexes (usually
non-clustered secondary indexes) that do not contain the routing fields – the columns that
DORA uses for the partitioning. We analyzed this case in Section 6.3.2 and evaluated how
DORA behaves in this case in Section 6.4.5. As Figure 6.16 showed, such non-partitioning
aligned secondary indexes can be burdensome for DORA. To tackle this problem we take
both proactive and reactive measures. As a proactive measure, we demonstrated a tool that
helps the application developer and the DBA to avoid very frequent such index accesses.
This tool analyzes the workload and suggests a partitioning scheme (routing fields for each
table) that tries to minimize the frequency of secondary actions [PTB+ 11]. As a reactive
measure, the resource manager monitors the performance of the system and warns the DBA
for sudden increases in the frequency on non partitioning-aligned index accesses and drops in
performance. The DBA can react modifying the partitioning to eliminate any problematic
secondary actions. Since DORA’s partitioning is only logical, runtime modifications of the
partitioning are possible and lightweight, in contrast with shared-nothing systems where repartitioning is expensive since it involves the physical movement of data from one database
instance to the other [CJZM10, PJZ11].
Producing transaction flow graphs
As Section 6.3.1 described, the DORA runtime does not accept transactions in the form of
a sequence of SQL queries. Instead, the transactions need to be analyzed and divided into
smaller actions based on the data accessed in different parts of the transaction. These actions
are represented as a directed graph (transaction flow graph) to understand the transaction
flow and dependencies among the actions. This representation also helps us to exploit intratransaction parallelism for the independent actions. However, it introduces the initial cost of
identifying these actions. Analyzing transactions at runtime, even though possible, it would
increase the response time of the system which would make the system design less appealing.
On the other hand, an OLTP application usually has a limited number of transactions that
execute at runtime and are heavily optimized — ad-hoc transactions are not frequent. In
[PTB+ 11] we demonstrated a tool for DORA application developers. This tool automatically
forms transaction flow graphs given a transaction’s SQL statement.
6.6. RELATED WORK
6.6
135
Related Work
DORA improves the scalability of transaction processing system by employing a threadto-data assignment of work policy. The improvement mostly comes from converting the
unscalable communication in the centralized lock manager to message-passing and decentralized thread-local lock management. Locking overhead is a known problem even for singlethreaded systems. Harizopoulos et al. [HAMS08] analyze the behavior of the single-threaded
SHORE storage manager [CDF+ 94] running two transactions from the TPC-C benchmark.
When executing the Payment transaction, the system spends 25% of its time on code related
to logical locking, while with the NewOrder transaction it spends 16%. We corroborate the
results and reveal the lurking problem of latch contention that makes the lock manager the
system bottleneck when increasing the hardware parallelism.
Rdb/VMS [Jos91] is a parallel database system design optimized for the inter-node communication bottleneck. In order to reduce the cost of nodes exchanging lock requests over the
network, Rdb/VMS keeps a logical lock at the node which last used it until that node returns
it to the owning node or a request from another node arrives. Cache Fusion [LSC+ 01], used
by Oracle RAC, is designed to allow shared-disk clusters to combine their buffer pools and
reduce accesses to the shared disk. Like DORA, Cache Fusion does not physically partition
the data but distributes the logical locks. However, neither Rdb/VMS nor Cache Fusion
handle the problem of contention. A large number of threads may access the same resource
at the same time leading to poor scalability. DORA ensures that the majority of resources
are accessed by a single thread.
A conventional system could potentially achieve DORA’s functionality if each transactionexecuting thread holds an exclusive lock on a region of records. The exclusive lock is associated with the thread,rather than any transaction, and it is held across multiple transactions.
Locks on separator keys [Gra07a] could be used to implement such behavior.
Our work on Speculative lock inheritance (SLI) [JPA09] (and Section 5.3.1) detects “hot”
locks at run-time and those locks may be held by the transaction-executing threads across
transactions. SLI, similar to DORA, reduces the contention on the lock manager. However,
it does not reduce the other overheads inside the lock manager.
Reducing lock contention with data-oriented execution is also studied for data-streams’
operators [DAAEA09] by making threads delegate the work on some data to the thread that
already holds the lock for that data and move to the next operation in their queues.
Advancements in virtual machine technology [BDGR97] enable the deployment of sharednothing systems on multicores. In shared-nothing configurations, the database is physi-
136
cally distributed and there is replication of both instructions and data. For transactions
that span multiple partitions, a distributed consensus protocol needs to be applied. HStore [SMA+ 07] takes the shared-nothing approach to the extreme by deploying a set of
single-threaded engines that serially execute requests, avoiding concurrency control. While
Jones et al. [JAM10] study a “speculative” locking scheme for H-Store for workloads with
few multi-partition transactions. The complexity of coordinating distributed transactions
[Hel07, DHJ+ 07] and the imbalances caused by skewed data or requests are significant problems for shared-nothing systems. DORA, by being shared-everything, is less sensitive to
such problems and can adapt to load changes more readily.
Staged database systems [HSA05] share similarities with DORA. A staged system splits
queries into multiple requests which may proceed in parallel. The splitting is operatorcentric and designed for pipeline parallelism. Pipeline parallelism, however, has little to
offer to typical OLTP workloads. On the other hand, similar to staged systems, DORA
exposes work-sharing opportunities by sending related requests to the same queue.
DORA uses intra-transaction parallelism to reduce contention. Intra-transaction parallelism has been a topic of research for more than two decades (e.g. [GMS87, SLSV95]).
Colohan et al. [CASM05] use thread-level speculation to execute transactions in parallel.
They show the potential of intra-transaction parallelism, achieving up to 75% lower response
times than a conventional system. Thread-level speculation, however, is an hardware-based
technique not available in today’s hardware. DORA achieves lower response times by exploiting the intra-transaction parallelism, but at the same time, its mechanism requires only
fast inter-core communication that is already available in multicore hardware.
Finally, optimistic concurrency control schemes [KR81, BG83] may improve concurrency
by resolving conflicts lazily at commit time instead of eagerly blocking them at the moment
of a potential conflict. When conflicts are rare this allows the system to avoid the overhead of
enforcing database locks. On the other hand, if the conflicts occur frequently the performance
of the system drops rapidly, since the transaction abort rate is how. There is a great body
of work that compares the concurrency control schemes in database systems. Notable is
the work by Agrawal et al. [ACL87], while the book of Bernstein et al. [BHG87] and
Thomasian’s survey [Tho98] are good starting points for the interested reader. On the
other hand, the focus of DORA is on the contention for accessing the locks rather than the
concurrency scheme used. For example, in a recent work, [LBD+ 12] presents a new flavor
of lightweight multiversioning concurrency control mechanism for main-memory databases.
This system applies lessons learned from our study on data-oriented execution and uses a
decentralized lock manager.
6.7. CONCLUSION
6.7
137
Conclusion
The thread-to-transaction assignment of work of conventional transaction processing systems fails to realize the full potential of the multicores. The resulting contention within the
transaction processing system becomes burden on scalability (usually expressed as bottleneck in the lock manager). This chapter shows the potential for thread-to-data assignment
to eliminate this bottleneck and improve both performance and scalability. As multicore
hardware continues to stress scalability within the storage manager and as DORA matures,
the gap with conventional systems will only continue to widen.
138
139
Chapter 7
Page Latch-free and Dynamically Balanced
Shared-everything OLTP
Developments in transaction processing technology, such as those presented in the previous
chapters, remove locking and logging from being scalability bottlenecks on transaction processing systems, leaving page latching as the next potential problem. To tackle the page latching
problem, we design a system around physiological partitioning (PLP). PLP employs the dataoriented transaction execution model, maintaining the desired properties of shared-everything
designs. In addition, it introduces a multi-rooted B+Tree index structure (MRBTree) that
enables partitioning of the accesses at the physical page level. That is, logical partitioning
(inherited by data-oriented execution), along with MRBTrees ensure that all accesses to a
given index page come from a single thread and, hence, can be entirely latch-free. We extend the design to make heap page accesses thread-private as well. The elimination of page
latching allows us to simplify key code paths in the system such as B+Tree operations leading
to more efficient yet easier maintainable code. The combination of data-oriented execution
and MRBTrees also offers an infrastructure for quickly detecting load imbalances and easily
repartitioning to adapt to load changes. We present one such a lightweight dynamic load
balancing mechanism (DLB) for PLP systems. Profiling of a prototype PLP system shows
that it acquires 85% and 68% fewer contentious critical sections per transaction than an optimized conventional design and one based on logical-only partitioning respectively. As a result
the PLP prototype improves performance by up to 50% and 25% over the existing systems on
two multicore machines. While the dynamic load balancing mechanism enhances the system
with rapid and robust behavior in both detecting and handling load imbalances. 1
7.1
Introduction
Due to concerns over power draw and heat dissipation processor vendors have stopped improving processors’ performance by clocking them into higher operational frequency or using
1
This chapter is based on the work presented in VLDB 2011 [PTJA11] and at [TPJA11].
140
CHAPTER 7. PHYSIOLOGICAL PARTITIONING
complicated micro-architectural techniques. Instead, they try to improve the overall chip
performance by fitting as many independent processing cores as they can within a single
chip’s area. The resulting multicore designs move the pressure on the software’s side for
converting Moore’s Law into performance. The software must provide enough execution
parallelism to exploit the abundant and rapidly growing hardware parallelism. However,
this is not an easy task. Especially when there is lots of resource sharing between the parallel threads of the application, because the accesses to those shared resources need to be
coordinated.
On-line transaction processing (OLTP) is an important and particularly complex application with excessive resource sharing, which needs to perform efficiently in modern computing
environments. It has been shown that conventional shared-everything OLTP systems may
face significant scalability problems in highly parallel hardware [JPH+ 09]. There is increasing evidence that one source of scalability problems arises from the transaction-oriented
policy of assigning work to threads in conventional systems [PJHA10]. According to this
policy, a worker thread is assigned the execution of a transaction; the transaction, along with
the physical distribution of records within the data pages, determines what resources (e.g.
records and pages) each thread will access. The random nature of transaction processing
requests leads to unpredictable data accesses [SWH+ 04, PJHA10] that complicate resource
sharing and concurrency control.
Such conventional systems are therefore pessimistic and clutter the transaction’s execution path with many lock and latch acquisitions to protect the consistency of the data. These
critical sections often lead to contention which limits scalability [JPH+ 09] and in the best
case they impose significant overhead to single-thread performance [HAMS08]. In addition,
the performance of shared-everything systems is amenable to the application design due to
the possibility of page false sharing effects where hot but unrelated records happen to reside
on the same page. Careful tuning and expensive DBAs are often needed to detect and resolve
such issues, for example by padding the hot records to spread them out to different data
pages.
Following a different approach, shared-nothing systems deploy a set of independent
database instances which collectively serve the workload [Sto86, DGS+ 90]. In shared-nothing
designs the contention for shared data resources can be explicitly tuned (the database administrator determines the number of processors assigned to each instance), potentially leading
to superior performance. The H-Store [SMA+ 07] and HyPer [KN11] systems take this approach to the extreme instantiating only a single software thread per database instance and
eliminating critical sections altogether. However, shared-nothing systems physically partition
7.1. INTRODUCTION
141
the data and deliver poor performance when the workload triggers distributed transactions
[Hel07, CJZM10] or when skew causes load imbalance [CJZM10]. Further, repartitioning to
rebalance load requires the system to physically move and reorganize all affected data. These
weaknesses become especially problematic as partitions become smaller and more numerous
in response to multicore hardware trend.
7.1.1
Multi-rooted B+Trees
To alleviate the difficulties imposed by page latching and repartitioning, we propose a new
physical access method, a type of multi-rooted B+Tree called MRBTree. The root of each
sub-tree in this structure corresponds to a logical partition of the data, and the mapping of
key ranges to sub-tree roots forms a durable part of the index’s metadata. Partition sizes
are non-uniform, making the tree robust against skewed access patterns, and repartitioning
is cheap because it involves very few data movement.
When deployed in a conventional shared-everything system, the MRBTree has the immediate benefit of eliminating latch contention at the tree root with requesting threads
distributed over many partitions (partitions sized so as to equalize traffic), and effectively
reducing the height of the tree by one level. Thanks to the tree’s fast repartitioning capabilities, the system can respond quickly to changing access patterns. Further, the MRBTree can
also potentially benefit systems which use shared-nothing parallelism in a shared-memory
environment (e.g. possibly H-Store [SMA+ 07] and HyPer [KN11]).
7.1.2
Dynamically-balanced physiological partitioning
To address problems with conventional execution while avoiding the weaknesses of sharednothing approaches, data-oriented execution, presented in the previous chapter, employs
logical-only partitioning. Logical-only partitioning assigns each partition to one thread; the
latter manages the data locally without the overheads of centralized locking. However, purely
logical partitioning does not prevent conflicts due to false sharing, nor does it address the
overhead and complexity of page latching protocols and the contention page latching imposes.
Ideally, we would like a system with the best properties of both shared-everything and
shared-nothing designs: a centralized data store that sidesteps the challenges of moving data
during (re)partitioning, and a partitioning scheme that eliminates contention and the need
for page latches.
This chapter presents physiological partitioning (PLP), a transaction processing approach
that partitions logically the physical data accesses. Briefly PLP employs data-oriented transaction execution on top of MRBTrees. Under PLP, a partition manager assigns threads to
142
sub-tree roots of MRBTrees and ensures that requests distributed to each thread reference
only the corresponding sub-tree. As a result, threads can bypass the partition mapping and
their accesses to the sub-tree are entirely latch-free. In addition, PLP can extend the partitioning down into the heap pages where non-clustered records are actually stored, eliminating
another class of page latching (similar to shared-nothing systems).
The combination of data-oriented execution (where all the data remain in a single database
instance) and MRBTrees enabled the implementation of lightweight yet effective dynamic
load balancing and repartitioning mechanism (called DLB) on top of PLP. The data-oriented
execution obviates the need of distributed transactions when repartitioning takes place.
While MRBTrees enable the fast repartitioning within a single database instance. DLB
monitors the request queues of the partitions and employs a simple data structure, called
aging two-level histogram, to collect information about the current access patterns and load
in a workload to dynamically guide the decision on the partition maintenance.
7.1.3
Contributions and organization
This chapter introduces the physiological partitioning (PLP) design. The structure and
contributions of the remaining of this paper is as follows:
• Section 7.2 categorize the communication patterns with in a transaction processing. An-
alyzing the communication patterns of a software system clearly highlights the latent
scalability bottlenecks. Using this categorization we identify page latching as a lurking
performance and scalability bottleneck in modern transaction processing systems, whose
effect is proportional to the available hardware parallelism.
• Section 7.3 discusses the pros and cons of various deployment strategies and partitioning
schemes for efficient transaction processing within a single node: shared-everything vs.
physical partitioning (or shared-nothing) vs. logical partitioning (like the one applied by
DORA), and concludes that we need a solution that combines the pros of the three.
• Section 7.4 shows the need for page latching during accesses to both index and heap
pages can be eliminated within a shared-everything OLTP system by deploying a design
based on physiological partitioning. Physiological partitioning (PLP) extents the idea of
data-oriented transaction execution, logically partitioning the physical accesses as well.
• Section 7.5 makes the case for the need of dynamic load balancing capabilities by partitioning-
based systems to protect them against sudden load changes and imbalances. It also analyses the repartitioning cost for PLP and shows that this cost is much lower than the cost
of repartitioning of physically-partitioning (shared-nothing systems).
7.2. COMMUNICATION PATTERNS IN OLTP
143
CS per Transaction
80
Uncategorized
70
Message passing
60
Xct mgr
50
Base Log mgr
40
Metadata
30
Bpool
20
Page Latches
10
Lock mgr
0
Conventional
SLI & Aether
Logical
PLP-Regular
PLP-Leaf
Figure 7.1: Breakdown of the critical sections when running the TATP UpdLocation transaction. The PLP variants enter on average almost an order of magnitude fewer unscalable
critical sections than state-of-the-art conventional systems.
• Section 7.6 presents a lightweight yet effective dynamic load balancing mechanism (DLB)
for PLP, which is enabled by PLP’s low repartitioning cost.
• Section 7.7 presents a thorough evaluation of a prototype implementation of PLP in-
tegrated with DLB. PLP acquires 85% and 68% fewer contentious critical sections per
transaction than an optimized conventional design and a vanilla data-oriented system
that applies logical-only partitioning, respectively. PLP improves scalability and yields
up to almost 50% higher performance on multicore machines. In the meantime, the overhead of DLB is minimal in regular processing (in the worst case at most 8%), and it
achieves low response times in both detecting and balancing imbalances.
• Finally, Section 7.8 presents related work, and Section 7.9 concludes by promoting PLP
as a very promising OLTP system design in the light of the upcoming hardware trends.
7.2
Communication patterns in OLTP
In Section 4.2 we made the important observation that not all forms of communication in
a software system pose the same threat to its scalability. We concluded that the only way
that leads to scalability lies in converting all unscalable communication to either the fixed
or composable type, thus removing the potential for bottlenecks to arise.
The two left-most bars of Figure 7.1 compare the number and types of critical sections
executed by the conventional systems presented in Chapter 5: baseline Shore-MT ([JPH+ 09]
and Section 5.2, labeled “Conventional”) and a Shore-MT variant with speculative lock
144
Breakdown of Page Latches
100%
80%
60%
INDEX
HEAP
40%
CATALOG / SPACE
20%
0%
TATP
TPC-B
TPC-C
Figure 7.2: Page latch breakdown for three popular OLTP benchmarks. The majority of
page latches reside in index structures.
inheritance ([JPA09] and Section 5.3.1) as well as consolidated log buffer inserts ([JPS+ 10]
and Section 5.4). The second bar, labeled “SLI & Aether”, is essentially a highly optimized
conventional system. The third left-most bar is from a data-oriented transaction execution
prototype ([PJHA10] and Chapter 6) built on top of Shore-MT. We labeled this bar as
Logical because by definition data-oriented execution applies logical-only partitioning of the
accesses.
Each bar shows the number of critical sections entered during the execution of the TATP
UpdLocation transaction, categorized by the storage manager service that triggers them.
Locking and latching form a significant fraction of the total communication in the baseline
system. SLI achieves a performance boost by sidestepping the most problematic critical
sections associated with the lock manager, but fails to address the remaining (still-unscalable)
communication in that category 2 . The data-oriented system with its logical partitioning,
in contrast, eliminates nearly all types of locking replacing both contention and overhead of
centralized communication with efficient, fixed communication via message passing queues.
Once locking is removed, latching remains by far the largest source of critical sections.
There is no predefined limit to the number of threads which might attempt to access a
given page simultaneously, so page latching represents an unscalable form of communication
which should be either eliminated or converted to a scalable type. The remaining categories
represent either fixed communication (e.g. transaction management), composable operations
(e.g. logging), or a minor fraction of the total unscalable component.
2
With SLI we can cache information and transfer across transactions which are not related with locking.
One piece of such information is the metadata. That’s why the metadata component in SLI is much lower
the the Baseline
7.3. SHARED-EVERYTHING VS. PHYSICAL VS. LOGICAL PARTITIONING
145
Figure 7.3: Comparison of logical and physiological partitioning schemes (bottom) with
shared-everything and shared-nothing designs (top).
Examining page latching more closely, Figure 7.2 decomposes the page latches acquired
by both the conventional systems and the logical-only partitioning prototype during the
execution of three popular OLTP benchmarks (TATP, TPC-B and TPC-C). We categorize the
database pages into different types: metadata, index pages, and heap pages. The majority
of the page latches, between 60% and 80%, are due to index structures. Heap page latches
are another non-negligible component, accounting for nearly all remaining page latches.
7.3
Shared-everything vs. physical vs. logical partitioning
With the preceding characterization of communication patterns in mind, we now return
to the question of which configuration for transaction processing is more appropriate for
deployment within a single node. Does the traditional shared-everything approach remain
the optimal, or system designer should explore other approaches such as physical partitioning
(or shared-nothing) and logical partitioning.
Shared-everything. The previous chapter (Chapter 6) outlines the scalability problems
faced by a conventional shared-everything transaction processing system: assigning trans-
146
actions to threads means that any worker thread might access any data item at any time,
requiring careful concurrency control at both the logical and physical levels of the system.
This is illustrated by the “shared-everything” case of Figure 7.3 (top left).
Physical partitioning (shared-nothing). Another possibility is to partition the data
physically to reduce contention. This is the so-called “shared-nothing” approach. As illustrated in Figure 7.3 (top right), shared-nothing approaches physically separate the partitions
to set of independent database instances which collectively serve the workload [Sto86, DGS+ 90].
Shared-nothing deployments are an appealing design even within a single node, because the
designer has explicit control on the number of threads and processing cores that participate
in each instance, and, thus, control or even eliminate the contention on each component of
the system. For example, the H-Store system assigns a single worker thread to each partition
[SMA+ 07]. This arrangement naturally produces a thread-to-data assignment of work and
achieves the desirable elimination of any contention-prone critical sections.
However, such designs give up too much by eliminating all communication within the
engine. That is, the physical separation produces at least three undesirable side effects:
• Even the composable and fixed types of critical sections, which do not threaten scalability
become problematic. For example, database logging is not amenable to distribution
[JPS+ 10, JPS+ 11], and physically-partitioned systems either use a shared log [LARS92]
or eliminate it completely [SMA+ 07].
• Perhaps the biggest challenge of physical partitioning is that transactions which access
data in more than one partition must coordinate using some distributed protocol, such
as two phase commit [GR92]. The scalable execution of distributed transactions has
been an active field of research for the past three decades, with researchers, from both
academia and industry, persuasively arguing that they are fundamentally not scalable
[Bre00, Hel07].
• Furthermore, the performance of shared-nothing systems is very sensitive to imbalances
in load arising from skew in either data or requests where some partitions see high load
and others see very little [CJZM10]; while non-partition-aligned operations (such as nonclustered secondary indexes) may pose significant barriers to physical partitioning. For
example, in the figure, the rightmost partition of the shared-nothing example is accessed
by three transactions while its neighbor is accessed by only one. Unfortunately, frequent
repartitioning is prohibitively expensive under the shared-nothing discipline because the
data involved must be physically migrated to a different location.
7.4. PHYSIOLOGICAL PARTITIONING
147
Logical partitioning. To achieve the benefits of physical partitioning without the costs
that usually accompany it, in the previous chapter we observed that partitioning is effective
because it forces single-threaded access to each item in the database. However, physical partitioning is not necessary: any scheme which arranges for a thread-to-data policy, or applies
“logical partitioning”, should achieve the same regularity and reduced reliance on centralized mechanisms. As illustrated in Figure 7.3 (bottom left), a data-oriented transaction
execution system logically partitions the data among worker threads, breaking transactions
into smaller actions which access only one logical partition (similar to how shared-nothing
systems need to distribute data accesses among physical partitions).
As its name suggests, logical partitioning eliminates most unscalable communication at
the logical level, namely database locking. However, it has little impact on the remaining
communication, which arises in the physical layers of the system and cannot be managed
cleanly from the application level. As a result, threads must acquire page latches and potentially perform other unscalable communication, even though there is no communication
between requests at the the application level.
7.4
Physiological partitioning
We have seen how both logically- and physically-partitioned designs offer desirable properties,
but also suffer from weaknesses which threaten their scalability. In this chapter we therefore
propose a hybrid of the two approaches, physiological partitioning (or PLP), which combines
the best properties of both: like a physically-partitioned system the majority of physical data
accesses occur in a single-threaded environment which renders page latching unnecessary;
like the logically-partitioned system, locking is distributed without resorting to distributed
transactions and load balancing is inexpensive because almost no data movement is necessary.
7.4.1
Design overview
Transactions in a typical OLTP workload access a very small subset of records. The system
relies on index structures because sequential scans are prohibitively expensive by comparison.
PLP therefore centers around the indexing structures of the database. The left-most graph
of Figure 7.4 gives a high-level overview of a physiologically-partitioned system. We adapt
the traditional B+Tree [BM70] for PLP by splitting it into multiple sub-trees which each
covers a contiguous subset of the key space. A partitioning table becomes the new root and
maintains the partitioning as well as pointers to the corresponding sub-trees. We call the
resulting structure a multi-rooted B+Tree (MRBTree). The MRBTree partitions the data
but unlike a horizontally-partitioned workload (e.g. top right of Figure 7.3), all sub-trees
148
Figure 7.4: Different variations of physiological partitioning (PLP). PLP-Regular logically
partitions the index page accesses, while PLP-Partition and PLP-Leaf logically partition the
heap page accesses as well.
belong to the same database file and can exchange pages easily; the partitioning, though
durable, is dynamic and malleable rather than static.
With the MRBTree in place, the system assigns a single worker thread to each sub-tree,
guaranteeing it exclusive access for latch-free execution. A partition manager layer controls
all partition tables and makes assignments to worker threads. The worker threads in PLP
do not reference partition tables during normal processing, which might otherwise become
a bottleneck, instead the partition manager ensures that all work given to a worker thread
involves only data that this thread owns.
The transactions are broken down into a directed graph of potentially parallel partition
accesses which are passed to worker threads to assemble a complete transaction. Since in PLP
each table is assigned to have different set of worker threads, whenever a transaction touches
more than one table it becomes a multi-site transaction, according to the terminology of
[JAM10]. However, multi-site transactions are not expensive as in a shared-nothing system
because PLP has a shared-everything setting.
All indexes in the system –primary, secondary, clustered, non-clustered– can be implemented as MRBTrees; data are stored directly in clustered indexes, or in tightly integrated
heap file pages referenced by record ID (RID). Secondary (non-clustered) indexes that can
align to the partitioning scheme (i.e. contain the fields that are used for the partitioning
decision) are managed by the worker thread of the corresponding partition. On the other
hand, secondary indexes that cannot align to the partitioning scheme are accessed as in the
conventional system, but each leaf entry contains the associated fields used for the partitioning so that the result of each probe can be passed to its partition’s owning thread for further
processing, as we discussed in Section 6.3.2.
7.4.2
149
Multi-rooted B+Tree
The “root” of an MRBTree is a partition table that identifies the disjoint subsets of the
key range assigned to each sub-tree as well as a pointer to the root of each tree. Because
the routing information is cached in memory as a ranges map by the partition manager,
the on-disk layout favors simplicity rather than optimal access performance. We, therefore,
employ a standard slotted page format to store key-root pairs. If the partitioning information
cannot fit on a single page (for example, if the number of partitions is large or the keys are
very long) the routing page is extended as a linked list of routing pages. In our experiments
we have never encountered the need to extend the routing page, however, as several dozen
mappings fit easily in 8KB, even assuming rather large keys.
Record insertion (deletion) takes place as in regular B+Trees. When the key to insert
(delete) is given, the ranges map routes it to the sub-tree that corresponds to the key range
the key belongs to and the insert (delete) operation is performed as in a regular B+Tree in
that sub-tree. The other sub-trees, ranges map, and the routing page do not get affected by
the insert (delete) operation at all.
When deployed in a conventional shared-everything system, the MRBTree eliminates
latch contention at the index root; fewer threads attempt to grab the latch for the same
index root at a time. Partitioning also reduces the expected tree level by at least one, which
reduces the index probe time.
7.4.3
Heap page accesses
In PLP a heap file scan is distributed to the partition-owning threads and performed in
parallel. Large heap file scans reduce concurrency of OLTP applications and PLP has little
to offer. Still, heap page management opens up an additional design option, since we can
extend the partitioning of the accesses at the heap pages. That is, when records reside in a
heap file rather than in the MRBTree leaf pages, PLP can ensure that accesses to pages are
partitioned in the same way as index pages. There are three options on how to place and
access records in the heap pages, leading to three variations of PLP depicted in Figure 7.4:
• Keep the existing heap page design, called PLP-Regular.
• Each heap page keeps records of only one logical partition, called PLP-Partition.
• Each heap page is pointed by only one leaf page of the primary MRBTree, called PLP-
Leaf.
PLP-Regular simply keeps the existing heap page operations. Without any modification,
the heap pages still need to be latched because they can be accessed by different threads in
150
parallel. But, the heap page accesses are not the biggest fraction of the total page accesses in
OLTP. According to Figure 7.2, where we categorized the types of pages that are latched in
a conventional OLTP system, the heap pages can be as low as only 30% of the pages that are
latched. Thus, there is room for significant improvement even if we ignore them. However,
allowing heap pages to span partitions prevents the system from responding automatically
to false sharing or other sources of heap page contention.
In PLP-Partition and PLP-Leaf the MRBTree and heap operations are modified so that
heap page accesses are partitioned as well. The difference between the two is that in the
former a heap page can be pointed by many leaf pages as long as they belong to the same
partition, while in the latter a heap page is pointed by only one leaf page.
Even though those two variations provide latch-free heap page accesses, they also have
some disadvantages. Forcing a heap page to contain records that belong to a specific partition results in some fragmentation. In the worst case, each leaf has room for one more entry
than fits in the heap page, resulting in nearly double the space requirement (Section 7.7.8
measures this cost). Further, in PLP-Leaf every leaf split must also split one or more heap
pages, increasing the overhead of record insertion (deletions are simple because a leaf may
point to many heap pages). On the other hand, in PLP-Partition allowing multiple leaf
pages from a partition to share a heap page forces the system to reorganize potentially significant numbers of heap pages with every repartitioning, which goes against the philosophy
of physiological partitioning. We therefore opt for the PLP-Partition option and favor the
PLP-Leaf.
The two extensions impose one additional piece of complexity for record inserts. When
a typical system inserts a record to a table it follows a straightforward procedure. It first
inserts the record in a free slot of a random heap page and then updates any index to that
table, by inserting a corresponding entry for the new record along with its record ID, that
identifies the heap page and slot. In contrast, when PLP-Partition and PLP-Leaf insert a
new record they must first identify the leaf page in the MRBTree that will eventually point
to the record, and then find an free slot in an existing heap page (or allocate a new one)
according to the partitioning strategy. Because the storage management layer is completely
unaware of the partitioning strategy (by design), it must make callbacks into the upper layers
of the system to identify an appropriate heap page for each insertion.
Similarly, a partition split in PLP-Partition and PLP-Leaf may split heap pages as well,
invalidating the record IDs of migrated records. The storage manager, therefore, exposes
another callback so the metadata management layer can update indexes and other structures
151
that reference the stale RIDs. We note that when PLP-Leaf splits leaf pages during record
insertion, the same kinds of record relocations arise and use the same callbacks.
7.4.4
Page cleaning
Page cleaning cannot be performed naively in PLP. Conventionally there is a set of page
cleaning threads in the system that are triggered when the system needs to clean dirty pages
(for example, when it needs to truncate log entries). Those threads may access arbitrary
pages in the buffer pool, which breaks the invariant of PLP where a single thread can access
a page at each point of time.
To handle the problem of page cleaning in PLP each worker thread does the page cleaning
for its logical partition. Each logical partition has an additional input queue which is for
system requests, and the page cleaning requests go to that queue. The system queue has
higher priority than the queue of completed actions. Their execution won’t be delayed by
more than the execution time of one action (typically very short). In addition, because page
cleaning is a read-only operation, the worker thread can continue to work (and even re-dirty
pages) during the write-back I/O.
7.4.5
Benefits of physiological partitioning
Under physiological partitioning, each partition is permanently locked for exclusive physical
access by a single thread, which then handles all the requests for that partition. This allows
the system to avoid several sources of overhead, as described in the following paragraphs.
Latching contention and overhead. Though page latching is inexpensive compared
with acquiring a database lock, the sheer number of page latches acquired imposes some
overhead and can serialize B+Tree operations as transactions crab down the tree during a
probe. The problem becomes more acute when the lower levels of the tree do not fit in
memory, because a thread which fetches a tree node from disk holds a latch on the node’s
parent until the I/O completes, preventing access to 80-100 other siblings which may well
be memory-resident. Section 7.7.3 evaluates a case where latching becomes expensive for
B+Tree operations and how PLP can eliminate this problem by allowing latch-free accesses
on index pages.
False sharing of heap pages. One significant source of latch contention arises when multiple threads access unrelated records which happen to reside on the same physical database
page. In a conventional system false sharing requires padding to force problematic database
records to different pages. A PLP design that allows latch-free heap page accesses achieves
152
the same effect automatically (without the need of expensive tuning) as it splits hot pages
across multiple partitions. Section 7.7.3 evaluates this case as well.
Serialization of structural modification operations (SMOs). The traditional ARIES/KVL
indexes [Moh90] allow only one structural modification operation (SMO), such as a leaf split,
to occur at a time, serializing all other accesses until the SMO completes. Partitioning the
tree physically with MRBTrees eases the problem by distributing SMOs across sub-trees
(whose roots are fixed) without having to apply more complicated protocols, as such those
described in [ML92, JSSS06]. The benefits of parallel SMOs are apparent in the case of
insert-heavy workloads, which we evaluate in Section 7.7.5.
Repartitioning and load monitoring costs. In PLP, repartitioning can occur at a
higher level in the partition manager and therefore can be latch-free as well; the partition
manager simply quiesces affected threads until the process completes. Moreover, MRBTrees
require very few pointer updates and data movement in order to map to an updated partitioning assignment. In addition, the partition manager can easily determine if there are any
load imbalances in the system, by simply inspecting the incoming request queues of each
partition. The combination of those characteristics enables the implementation of a robust
and lightweight dynamic load balancing mechanism, as we further discuss in Section 7.5 and
Section 7.6 and evaluate in Section 7.7.9.
Code complexity. Finally, with all latching eliminated, the code paths that handle contention and failure cases can be eliminated as well, simplifying the code significantly. To the
extend that the index can be substituted with a much simpler implementation. For example,
a huge source of complexity in traditional B+Trees arises due to the sophisticated protocols
that maintain consistency during an SMO in spite of concurrent probes from other threads.
The simpler code not only is more efficient but also it is easier to maintain. While building
the prototype we used at the evaluation section, we did not attempt the code refactoring
required to exploit these opportunities, and the performance results we report are therefore
conservative.
We note that index probes are the most expensive remaining component of PLP. Therefore, we expect significant performance improvements if we substitute the B+Tree implementation of our prototype with, for example, a cache-conscious [RR99, RR00] or prefetchingbased B+Tree [CGMV02].
7.5. NEED AND COST OF DYNAMIC REPARTITIONING
7.5
153
Need and cost of dynamic repartitioning
Although partitioning is an increasingly popular solution for scaling up the performance of
database management systems, it is not the panacea since there are many challenges associated with it. One of these challenges is the system behavior under skewed and dynamically
changing workloads, which is the rule rather than an exception in real settings – consider,
for example, the Shashdot effect [Adl05].
This section shows how even mild access skew can severely hurt performance in a statically
partitioned database, rendering partitioning useless in many realistic workloads (Section 7.5.1).
Then exhibits that PLP provides an adequate infrastructure for dynamic repartitioning,
mainly because it is based on data-oriented execution and because of its usage of MRBTrees
(Section 7.5.2). The low repartitioning cost facilitates the implementation of a robust yet
lightweight dynamic load balancing mechanism for PLP, which is presented in the following
section (Section 7.6).
7.5.1
Static partitioning and skew
This dissertation argues that in order to scale up the performance of transaction processing
we need to reduce the level of unscalable communication within the system by employing
a form of partitioning, like data-oriented execution or physiological partitioning. In general
one of the disadvantages of partitioning-based transaction processing designs is that those
systems are vulnerable to skewed and dynamically changing workloads; in contrast with
shared-everything systems that do not employ any form of partitioning and tend to suffer
less. Unfortunately, skewed and dynamically changing workloads are the rule rather than an
exception in transaction processing. Thus, it is imperative for partitioning-based designs to
alleviate the problem of skewed and dynamically changing accesses.
To exhibit how vulnerable partitioning-based systems are to skew, Figure 7.5 plots the
throughput of a non-partitioned (shared-everything) system and a statically partitioned system when all the clients in a TATP database submit the GetSubscriberData read-only transaction [NWMR09]. Initially the distribution of requests is uniform to the entire database.
But at time point 10 (sec) the distribution of the load changes with 50% of the requests
being sent to 30% of the database (see Section 7.7.1 for experimental setup details). As
we can see from the graph, initially and as long as the distribution of requests is uniform
the performance of the non-partitioned system is around 15% lower than the partitioned
one. After the load change the performance of the non-partitioned system remains pretty
much the same (at around 325Ktps), while the performance of the partitioned system drops
sharply by around 35% (from 375Ktps to around 360Ktps). The drop in the performance
154
400
Throughput (Ktps)
350
300
250
200
150
Not partitioned
100
50
Statically partitioned
0
0
5
10
15
20
Time (sec)
Figure 7.5: Throughput of a statically partitioned system when load changes at runtime.
Initially the requests are distributed uniformly, at time t=10 50% of the requests are sent to
30% of the database.
is severe even though the skew is not that extreme; easily a higher fraction of the requests
could go to a smaller portion of the database, for example following the 80-20 rule of thumb
where the 80% of the accesses go to only 20% of the database.
There are two ways to attack the problem of skewed access in partitioning-based transaction processing systems: proactively by configuring the system with an appropriate initial
partitioning scheme; and reactively by using a dynamic balancing mechanism. Starting with
the appropriate partitioning configuration is key. If the workload characteristics are known
a priori, previously proposed techniques [RZML02, CJZM10] can be used to create effective
initial configurations. If the workload characteristics are not known, then simpler approaches
like round-robin, hash-based, and range-based partitioning can be used [DGS+ 90]. As time
progresses, however, skewed access patterns gradually lead to load imbalance and lower performance, as the initial partitioning configuration eventually becomes useless no matter how
carefully it was chosen. Thus, it is far more important and challenging to dynamically balance
the load through repartitioning based on the observed, and ever changing, access patterns.
A robust dynamic load balancing mechanism should eliminate any bad choices made during
initial assignment.
7.5.2
Repartitioning cost
As the previous subsection argued, a dynamic load balancing mechanism would be useless if
the cost of repartitioning in a partitioning-based transaction processing system is high. The
lower the cost of repartitioning, the more frequently the system can trigger load balancing
procedures and the faster it will react to load changes. This subsection models the cost
155
of repartitioning for a physically-partitioned (shared-nothing) system and the three PLP
variations to highlight the clear advantage of PLP-Regular and PLP-Leaf. It also describes
the way to perform repartitioning for the three PLP designs.
The basic case of repartitioning that we need to calculate its cost, is when a partition
needs to split into two. Thus, for all the PLP variations and the physically-partitioned case
our repartitioning cost model calculates the number of records and index entries that have
to be moved, the number of update/insert/delete operations on the indexes, the number of
pointer updates on the index pages and the routing page, and the remaining number of read
operations that have to be performed when a partition is split into two. We also discuss
merging two partitions but do not give as detailed cost model.
Let’s assume that there is a heap file (table) with an index on it, which in the case of
PLP it is an MRBTree. When a partition needs to be split into two, that means that a
sub-tree in the index needs to be split to two as well. In that case we define as: h the height
of the tree; n the number of entries in an internal B+Tree node; mi the number of entries
to be moved from the B+Tree at level i; and, M the number of records in the heap file that
have to be moved.
The number of read operations during a key value search in the B+Tree is omitted since
it is the same for all the systems (a binary search at each level from root to leaf).
7.5.3
Splitting non-clustered indexes
The first case we consider, is when the heap file that needs to be re-partitioned has a unique
non-clustered primary and a secondary index and the data are partitioned based on the
primary index key values.
PLP-Regular. The cost of repartitioning in PLP-Regular is very low. Only a few index
entries need to move from one sub-tree of the MRBTree index(es) to another newly created
sub-tree. Algorithm 1 shows the procedure that needs to be executed to split an MRBTree
sub-tree. First, we need to find the leaf page that the starting key of the new partition should
reside (Lines 4–8 in Algorithm 1). Let’s assume that there are m1 entries that are greater
than or equal to the starting key on the leaf page where the slot for this key is found. All
needs to be done is to move these m1 entries on that leaf page to a newly created (MRBTree)
index node page and this procedure has repeat as the tree is traversed from this leaf page
to the root (Lines 9–13 in Algorithm 1). It is not necessary to move any entry from the
pages that keep the key values greater than the ones in the leaf page containing the starting
key. Setting the previous/next pointers of the pages at the boundaries of the old and new
156
Algorithm 1 Splitting an MRBTree sub-tree.
1: {binary-search routine used below performs binary search to find the key on the page. If
an exact match for the key is found, f ound is returned as true and the function returns
the slot for the key on the page. Otherwise, f ound is false and the function returns
which slot on the page the key should reside.}
2: page = root
3: f ound = f alse
4: while page! = N U LL & !f ound do
5:
slot = binary-search(page, key, f ound)
6:
slots.push(slot)
7:
pages.push(page)
8:
page = page[slot].child
9: while nodes.size > 0 do
10:
slot = slots.pop()
11:
page = pages.pop()
12:
Create pagenew
13:
Move starting from slot at page to pagenew
Table 7.1: Repartitioning costs for splitting a partition into two
System
PLP-Regular
PLP-Leaf
PLP-Partition
Physically-partitioned
#Records Moved (M)
#Entries Moved
h
mk
k=1
h
m1
k=1 mk
h−2
h−l−1
h
m1 +
(n
× (mh−l − 1))
k=1 mk
m1 +
l=0
h−2
(nh−l−1 × (mh−l − 1))
l=0
PLP (Clustered)
Shared Nothing
(Clustered)
m1 +
h−2
l=0
m1
(nh−l−1 × (mh−l − 1))
h
k=2
-
Primary Index
Secondary Index
#Reads #Pages Read #Pointer Updates
Changes
Changes
2×h+1
M
1
2×h+1
M updates
M updates
M
M
mk
1+
1+
M −m1
n
M −m1
n
2×h+1
-
-
-
2×h+1
-
-
-
M updates
M updates
M inserts
M deletes
-
M inserts
M deletes
M updates
M inserts
M deletes
M inserts
M deletes
partitions is sufficient. Finally, a new entry to the routing page should be added for the new
partition.
The overall cost is given in the first row of Table 7.1. The cost model in Table 7.1
describes the worst case scenario for PLP-Regular. If the starting key of the new partition is
in one of the internal index node pages, there is no need to move any entries from the pages
that are below this page because the moved entries from the internal node page already have
pointers to their corresponding child pages; resulting in fewer reads, updates, and moved
entries.
PLP-Leaf. The partition splitting cost related with the MRBTree index structure is the
same as in PLP-Regular. But, as mentioned in Section 7.4.3, in addition to modifying the
index structure, when repartitioning in PLP-Leaf, we also have to move records from the
(a)
(b)
157
(c)
Figure 7.6: Example of splitting a partition in PLP-Leaf, which is a three-step process. In
the worst case, m1 heap pages need to be touched.
Algorithm 2 Splitting heap pages in PLP-Leaf and PLP-Partition.
1: leaf = leftmost leaf node
2: Create pagenew
3: while leaf ! = N U LL {Omit for PLP-Leaf } do
4:
for all t pointed by leafcurrent do
5:
if pagenew does not have space then
6:
Create pagenew
7:
Move t to pagenew
8:
Update pointers at all the secondary indexes
9:
leaf = leaf.next {Omit for PLP-Leaf }
heap file to new heap pages. Figure 7.6 shows the three-step process for splitting a partition
to two in PLP-Leaf.
The height of the sub-tree is 2 and the dark slot in Figure 7.6 (a) indicates the slot which
contains the leaf entry with the starting key of the new partition. Figure 7.6 (b) shows that
a new sub-tree is created as a result of the split. Those two steps are the same with the
repartitioning process in PLP-Regular.
In PLP-Leaf, however, we also have to move the records at the heap file that belong
to the new partition to a new set of heap data pages. Algorithm 2 shows the pseudo code
for updating the heap pages upon a partition split in PLP-Leaf (and PLP-Partition). The
dark records on the heap pages in Figure 7.6 (b) indicate those records that belong to the
new partition (sub-tree) and need to move. Those records are pointed by the m1 leaf page
entries that moved to the newly created sub-tree. Thus, in the worst case m1 records have to
move (Lines 4-7 in Algorithm 2). Since the index is non-clustered, we have to scan these m1
entries in order to get the RIDs of the records to be moved and spot their heap pages. The
result of the split after the records are moved is shown in Figure 7.6 (c). Whenever a record
158
Figure 7.7: Example of splitting a partition in PLP-Partition. To identify which records
from the heap pages need to move to the new partition, the system needs to scan all the
heap pages of the old partition, increasing significantly the cost of repartitioning.
moves its RID changes. Thus, once all the records are moved, all the indexes (primary and
secondary) need to update their entries (Line 8 in Algorithm 2).
The cost for repartitioning in PLP-Leaf is given in the second row of Table 7.1. This
cost, again, illustrates the worst case scenario. If the starting key of the new partition is
found in one of the internal nodes, then no record movement has to be done since there will
be no leaf page splits and the constraint of having all heap pages pointed by only one leaf
page is already preserved. Moreover, even if the key is found on the leaf page, we might not
have to move all the records that are specified by the model above. If all the records on a
heap page are pointed only by leaf entries of the new partition, then these records can stay
on that heap page.
PLP-Partition. In PLP-Partition, the process for splitting the index structure is the same
as in PLP-Regular and PLP-Leaf. Therefore, it is omitted from Figure 7.7, which shows the
rest of the process for splitting a partition into two in PLP-Partition.
In the worst case, in PLP-Partition we may have to move records from all the heap pages
that belong to the old partition. Those records are indicated with the dark rectangles in the
heap pages of Figure 7.7 (a). The number of records be to moved is equal to the number
of entries that are on the leaf pages of the new sub-tree. As in PLP-Leaf, the RIDs of the
records are retrieved with an index scan of the newly created sub-tree, the records are moved
to new heap pages and they get new RIDs, and all the indexes are updated with the new
RIDs after the record movement is completed (shown in Lines 3-9 in Algorithm 2). The
result of the partitioning is shown in Figure 7.7 (b); while the cost model for PLP-Partition
is given in the third row of Table 7.1.
Physically-partitioned (shared-nothing). In a physically-partitioned or shared-nothing
system, the cost for the record movement is equal to the worst case of PLP-Partition. Be-
159
cause, the entire old partition needs to be scanned for records that belong to the new partition. But, in addition to that, the cost of index maintenance may be prohibitively expensive.
That is, in a physically-partitioned system each record move across partitions results
to a deletion of an index entry (or entries if there are multiple indexes) from the old partition and an insertion of an index entry to the new partition. In contrast with the PLP
variant where every record move is a result of a few MRBTree entries updates. The cost
of index maintenance when repartitioning physically-partitioned systems sometimes can be
prohibitive. In order to avoid the index maintenance, a common technique is to drop and
bulkload the index from scratch upon every repartition. For physically-partitioned systems
which employ replication, like H-Store [SMA+ 07], this procedure has to be repeated for all
the partition replicas. The repartitioning cost for one replica in a physically-partitioned
system is given as in the fourth row of Table 7.1. Given how expensive repartitioning can
be, physically-partitioned systems are reluctant in frequently triggering repartitioning.
7.5.4
Splitting clustered indexes
Let’s consider the case where we have a unique clustered primary index and a secondary
index, and the data partitioning is done using the primary index key columns. In this setup,
no heap file exists, since the primary index contains the actual data records rather than
RIDs, and the three PLP variations are equivalent, because their differences lie on how they
treat the records in the heap pages.
When the actual records are part of the clustered primary index, the cost of record
movement for PLP equals with the number of leaf page entries that need to move. While
the cost of the primary index maintenance equals with the entry movements in the internal
nodes of the MRBTree index. The cost model is given in the fifth row of Table 7.1.
On the other hand, the repartitioning cost for the physically-partitioned system is similar
to the non-clustered case. Because there is not a common index structure and data need to
move from the index of the one partition to the other. The only difference is that there is
no need to scan the leaf page entries to get the RIDs of the records to be moved since the
leaf pages have the actual records. Therefore, the repartitioning cost model for a replica in
the last row of Table 7.1.
7.5.5
Moving fewer records
With some additional information we can actually move fewer data during repartitioning
with the cost of increased number of reads. For example, in PLP-Partition instead of directly
moving all the records that belong to a new partition, we can scan all the index leaf pages
160
Table 7.2: Repartitioning costs when splitting a partition with 466 MB data in half (U:
Updates, D: Deletes, I: Inserts).
System
PLP-Regular
PLP-Leaf
PLP-Partition
PLP (Clustered)
(Clustered)
Records
Moved
8.3KB
233MB
233MB
8.3KB
233MB
Primary Index
Secondary
Entries #Pages #Pointer
Changes
Index
Moved
Read
Updates
Changes
8KB
7
8KB
1
7
85 U
85 U
8KB
14365
7
2.44M U
2.44M U
14365
2.44M I + 2.44M D 2.44M I + 2.44M D
5.3KB
7
85 U
-
-
-
2.44M I + 2.44M D 2.44M I + 2.44M D
that are going to be split and collect information for all the records. With this information,
we can determine whether a heap page has more records that belong to the old partition
or the new partition and act accordingly. That is, if a heap page has more records that
belong to the new partition, we can move out of the page the records that belong to the old
partition. The number of reads when scanning the leaf pages can easily become a bottleneck
in disk-resident databases, due to the number of I/O operations that have to be performed.
On the other hand, in in-memory databases or systems that use flash storage devices, the
I/O bottleneck can be prevented [Che09] and the above mentioned technique can reduce the
amount of data movement during repartitioning. This technique, unfortunately, cannot be
used in a physically-partitioned system because the pages of the two partitions do not share
the same storage space.
7.5.6
Example of repartitioning cost
Table 7.2 gives an example of the repartitioning cost for the different systems under consideration based on the cost model given in Table 7.1. In this example, a partition, which
contains 433MB of 100-byte data records in a heap file is split in half. We assume that there
is a primary index of height 3 with 170 32-byte entries on each page. The first four rows
of the table assume there is a unique non-clustered primary index and a secondary index in
the system, whereas for the last two rows there is a unique clustered primary index and a
secondary index. The cost for the physically-partitioned system is just for one replica (if we
assume that it uses replication for durability). For the PLP variations the number of moved
records represents the worst case scenario.
As Table 7.2 shows, the PLP variations, except for PLP-Partition, move very few records
compared to the physically-partitioned one. In the worst case, PLP-Partition moves the
same number of records as the physically-partitioned system. For the clustered index case,
161
PLP is cheaper to repartition than the physically-partitioned system, both in terms of record
movement and index maintenance. When we calculate the corresponding costs for a larger
heap file with a index of height 4, the repartitioning cost for the physically-partitioned system
(and PLP-Partition) becomes prohibitive.
7.5.7
Cost of merging two partitions
Another cost related to repartitioning is the cost of merging two partitions to one. For
any PLP variation, the merge operation only requires index reorganization and no data
movement, again in contrast with the physically-partitioned design. During the index reorganization in PLP, there are three cases to consider depending on the height of the two
sub-trees to be merged, which we present in the next paragraphs. After the merging of the
two sub-trees is completed, the partitioning table of the MRBTree is updated accordingly.
One entry of the two that correspond to the two partitions in the partitioning map is removed
while the other is updated with the new key range and possibly a new root page id.
Merging two sub-trees that have the same height. When the two sub-trees to be
merged have the same height, the entries of Th ’s root are appended at the end of the entries
of Tl ’s root. Since the entries of the root page have information about the pointers to the
internal nodes, copying the entries of the root page is sufficient for this merge operation. In
this case the cost of the merge operation only depends on the number of entries in the root
page of Th . If the number of entries destined to the new root exceeds the page capacity, a
new root page is created the same way a page split happens after a record insert (through a
structure modification operation – SMO).
Merging a sub-tree with lower key values (Tl ) which is taller than the other subtree. When Tl is taller than Th , Tl is traversed down to one level higher than the height
of Th . Then an entry is inserted at the right-most node of this level that points to Th and
has the key value equal to the starting key of the key range of Th . Therefore, the cost of the
merge operation is only a tree traversal, which depends on the height difference between the
two trees and an insert operation.
Merging a sub-tree with higher key values (Th ) which is taller than the other
sub-tree. When Th is taller, the merge operation is very similar to the second case and the
cost is the same. Th is traversed down to one level higher than the height of Tl and instead
of the right-most node, the left-most node gets the entry that points to Tl and has the key
value equal to the starting key of the key range of Tl .
162
Overall the cost of merging two partitions in PLP is quite low. On the other hand, a
physically-partitioned system has to copy all the records from one partition to the other and
insert the corresponding index entries at the resulting partition. Therefore, in a physicallypartitioned system the cost of the merge operation is proportional to the number of records
in a partition and its way higher than the merge cost for any PLP variation.
We conclude that, in contrast with physically-partitioned systems, the PLP-Regular and
PLP-Leaf designs provide low repartitioning costs which allow frequent repartitioning attempts and facilitate the implementation of responsive and lightweight dynamic load balancing mechanisms. We present one such mechanism in the next section.
7.6
A dynamic load balancing mechanism for PLP
At the high level, any dynamic load balancing mechanism performs the same functionality.
During normal execution it has to observe the access patterns and detect any skew that
causes load imbalance among the partitions. Once the mechanism detects the troublesome
imbalance, it triggers a repartition procedure. It is very important for the detection mechanism to incur minimal overhead during normal operation and to not trigger repartitioning
when it is not really needed. After the mechanism decides to proceed to a repartition, it
needs to determine a new partitioning configuration, so that the load is again uniformly
distributed. This decision depends on various parameters, such as the recent load of each
partition and the available hardware parallelism. Finally, after the new configuration has
been determined, the system has to perform the actual repartitioning. The repartitioning
should be done in a way that minimizes the drop in performance and the duration of this
process.
Thus, any dynamic load balancing mechanism that we build on top of PLP (or any
partitioning-based system in general) should:
• Perform lightweight monitoring.
• Make robust decisions on the new partition configuration.
• Repartition efficiently, when such decision is made.
We have already shown that PLP provides the infrastructure for efficient repartitioning,
in Section 7.5. In this section, we present techniques for lightweight monitoring and decision
making. The overall mechanism is called DLB.
7.6. A DYNAMIC LOAD BALANCING MECHANISM FOR PLP
163
Figure 7.8: A two-level histogram for MRBTrees and the aging algorithm
7.6.1
Monitoring
DLB needs to monitor some indicators of the system behavior, and based on the collected
information to decide: (a) when to trigger a repartition operation and (b) what the new
partitioning configuration to be. Candidate indicators can be the overall throughput of the
system, the frequency of accesses in each partition and the amount of work each partition
should do.
There is need for DLB to continuously collect information of multiple indicators. For
example, let’s consider that DLB monitors only the overall throughput of the system and
raises flags when changes in throughput are larger than a threshold value. If the initial
partitioning configuration of the system was not optimal (for example, with load imbalance
among partitions) then its throughput to be low without fluctuations –the effect caught
when monitoring only the throughput, and the monitoring would fail. Or there could be
uniform drops or increases in the incoming request traffic, which would trigger unnecessary
repartitioning. Thus, DLB needs to maintain additional information about the load of each
partition. In addition, the information about the throughput is not useful for the component
that decides on the new configuration (presented in Section 7.6.2). Thus, DLB needs to
collect and maintain information about the load not only across partitions, but also within
each partition.
To that end, DLB uses the length of the request queue of each partition and a two-level
histogram structure that employs aging. The histogram structure is depicted on the left side
of Figure 7.8. To monitor the differences in the load across partitions, DLB monitors the
number of requests waiting at each partition’s request queue. To have accurate information
about the load distribution within each partition, in addition to the one bucket it maintains
164
for each partition (left side of the figure), the histogram has sub-buckets on ranges within
each partition’s key range (shown on the right side of the figure). The number of sub-buckets
within each partition is tunable and determines the monitoring granularity.
DLB frequently checks whether the partition loads are balanced or not. The load of
each partition is calculated based on an aging algorithm. Each bucket in the histogram is
implemented as an array of age-buckets, shown on the right side of Figure 7.8. At each point
of time there is one active age-bucket. When a record is accessed, the active age-bucket of
the sub-bucket of the range where the record belongs to increments by one. At regular time
intervals the age of the histogram increases. Whenever the age of the histogram increases,
the next age-bucket is reset and starts to count the accesses.
When calculating the load of a sub-bucket in the histogram, the recent age-buckets are
given more weight than the older ones. More specifically, if a sub-bucket consists of A agebuckets, the load for the ith age-bucket is li , and the current age-bucket is the cth bucket,
then we calculate the total load L for the sub-bucket as follows:
L=
A+c−1
i=c
100 × limod(A)
.
(i − c + 1)
Figure 7.8 (right) shows an example of the aging algorithm, when the load to a particular
sub-bucket increases by 10 for five consecutive time intervals (T 1 to T 5). W is the weight
of each age-bucket and L is the load value of this sub-bucket at each interval, calculated by
the formula above.
Because they are both light-weight, DBL very frequently monitors the throughput and
the length of the request queues. On the other hand, the histograms are analyzed only
whenever an imbalance is observed. The overall monitoring mechanism does not incur much
overhead and it also provides adequate information for DLB to decide on the new partitions.
7.6.2
Deciding new partitioning
The algorithm DLB employs for reconfiguring the partition key-ranges is highly dependent
on the request queues and two-level aging-histogram structure discussed previously. First we
describe the algorithm that determines the partitioning configuration within a single table,
and then we consider that case when we decide the partition across all tables.
Deciding the partitioning within a single table To describe the algorithm, let N be
the total number of partitions, and Qi be the number of requests at the request queue of the
ith partition. Then, the ideal number of requests for each partitions queue is:
7.6. A DYNAMIC LOAD BALANCING MECHANISM FOR PLP
165
Algorithm 3 Calculating ideal loads.
1: for i = 1 → N − 1 do
2:
while Li < LIi − t do
3:
Move leftmost sub-bucket range from i + 1 to i
4:
Li ⇐ Li + Lsubbucket
5:
if Li > LIi + t then
6:
Distribute sub-bucket range into µ sub-buckets
7:
while Li > LIi + t do
8:
Move rightmost sub-bucket range from i to i + 1
9:
Li ⇐ Li − Lsubbucket
10:
if Li < LIi − t then
11:
Distribute sub-bucket range into µ sub-buckets
Qideal =
N
Qi
i=1
N
.
Knowing Qideal , we have to decide on the ideal data access load for each partition. Let
Li be the aging load of the ith partition, which can be calculated as the sum of the aging
loads of its sub-buckets. We have to calculate the ideal data access load for partition i, LIi ,
based on the ideal request load and how much request load, Qi , each Li creates. Therefore,
LIi is:
LIi =
Qideal × Li
.
Qi
Because the granularity on the load information is determined by the number of subbuckets in the histogram, it is difficult for DLB to achieve precise ideal loads.That’s why
DLB only tries to approximate the precise ideal value. Algorithm 3 sketches how the new keyranges are assigned. It iterates over all partitions except the last one. While the estimated
load Li at a partition is less than LIi − t for some t value, it moves the range of the leftmost
sub-bucket from the (i + 1)th partition to ith. Similarly, while the load at a partition is
larger than LIi + t it moves the range of right-most sub-bucket from the ith partition to
(i + 1)th. If the moved sub-bucket causes a significant change in the calculated load (more
than 2 × t), then this sub-bucket is substituted by a larger number of sub-buckets to observe
that range in finer-granularity.
Figure 7.9 shows an example of how Algorithm 3 is applied. In the example, there are
three partitions on a table and Figure 7.9 shows the two-level histogram for each partition.
The first-level of the histogram tracks down the number of accesses to a partition’s range,
166
Figure 7.9: Example of how DLB decides on the new partition ranges.
which is 40 units in this example. The second-level of the histogram, the 4 sub-buckets,
keeps the number of accesses to sub-ranges in a partition, which is 10 units in this example.
The higher bar in a sub-bucket indicates that the sub-range that corresponds to that subbucket has more load. Initially each partition has equal key-ranges, shown in the left part
of Figure 7.9. If we assume that each partition has to perform equal amount of work per
request, the loads in this configuration are not balanced among the partitions. In that case,
the repartition manager triggers repartitioning. Based on Algorithm 3 the new partitions are
decided by moving around the sub-buckets to create almost-equal loads among the partitions.
The result is shown on the right part of Figure 7.9; the most loaded regions end up in
partitions with smaller ranges, like the second partition in Figure 7.9, and the lightly loaded
regions are merged together.
Deciding the number of partitions of each table The algorithm presented previously,
is just for one table and assumes that the number of partitions before and after the repartitioning operation does not change. To determine how many partitions a table should have is
another issue and requires knowledge on all of the tables in the database. Next, we provide
a formulation to determine the number of partitions for a table.
In our setting, initially, the number of partitions for a table is determined automatically
to be equal to the number of hardware contexts supported by the underlying machine. To
find what the number of partitions for a table should be dynamically, based on the workload
trends; let T be the number of tables, Ntotal be the upper limit on the total number of
partitions for the whole database, Qi be the total number of requests for table i, Ni be the
number of partitions for table i, QTavg be the average number of requests for all the tables,
Navg be the average number of partitions for a table, and #CT X be the total number of
7.7. EVALUATION
167
available hardware contexts supported by the machine that executes the transactions run on
this database.
Based on the initial total number of partitions, we define Ntotal as: Ntotal = T × #CT X.
As a result, Navg will be: Navg = Ntotal
= #CT X. The QTi values are known from the
T
T
i=1
QTi
request queues and therefore QTavg can be calculated as: QTavg = T . The goal is to
avg
i
find the Ni values, which can be derived from the following formula: QT
= QT
. Using
Navg
Ni
the formulas and algorithm presented above, DLB efficiently decides on the new partitioning
configuration.
7.6.3
Using control theory for load balancing
In our prototype implementation, the system immediately tries to adjust to a new configuration, once a target load value is determined for each partition. Thus there is always the
danger of over-fitting, especially for the workloads that observe access skew with frequently
changing hot-spots. Since repartitioning is not expensive for PLP (except for PLP-Partition),
it can repartition again very quickly to alter the bad effects of a previous bad partitioning
choice. Rather than directly aiming to reach the target load, a more robust technique
would be to employ control theory while converging to the target load [LSD+ 07]. Control
theory can increase the robustness of our algorithm, prevent the system from repartitioning unnecessarily and/or resulting with wrong partitions, and reduce the downtime faced
by PLP-Partition during repartitioning. Nevertheless, it is orthogonal with the remaining
infrastructure, and it could be easily integrated in the current design. The prototype implementation does not employ control theory techniques. But the evaluation, presented next,
shows that DLB allows PLP to balance the load effectively.
7.7
Evaluation
The evaluation consists of four parts.
1. In the first part we measure how useful PLP can be. In particular. Section 7.7.2
quantifies how different designs impact page latching and critical section frequency;
Section 7.7.3 examines how effectively PLP reduces latch contention on index and heap
page latches; and Section 7.7.4 shows the performance impact of those changes.
2. In the second part we try to quantify any overheads related to PLP. To do that we
measure PLP’s behavior in challenging workloads that seem to not fit well with physiological partitioning, such as transactions with joins (Section 7.7.6) and secondary index
168
accesses that can be aligned with the partitioning or not (Section 7.7.7). In addition,
Section 7.7.8 inspects the fragmentation overhead of the three PLP variations.
3. In the third part (Section 7.7.5) we quantify how useful MRBTrees can be also for nonPLP systems, like in conventional or logically-partitioned systems.
4. In the last part of the evaluation, we measure the overhead and effectiveness of the
dynamic load balancing mechanism of PLP (Sections Section 7.7.9–Section 7.7.10).
Finally, In Section 7.7.11, we highlight the key conclusions of the whole evaluation.
7.7.1
Experimental setup
To ensure reasonable comparisons, all the prototypes are built on top of the same version of
the Shore-MT storage manager [JPH+ 09] (and Section 5.2), incorporate the logging optimizations of [JPS+ 10] (and Section 5.4), and share the same driver code. We consider five
different designs:
• A optimized version of a conventional, non-partitioned system, labeled as “Conventional”.
This system employs speculative lock inheritance [JPA09] (and Section 5.3.1) to reduce
the contention in the lock manager, and essentially corresponds to the second bar of
Figure 7.1.
• Logical-only or DORA is a data-oriented transaction processing prototype [PJHA10]
(and Chapter 6) that applies logical-only partitioning.
• PLP or PLP-Regular prototypes the basic PLP variation. This variation accesses the
MRBTree index pages without latching.
• PLP-Partition extends PLP-Regular, so that one logical partition “owns” each heap
page, allowing latch-free both index and heap page accesses.
• PLP-Leaf assigns heap pages to leaves of the primary MRBTree index, also allowing
latch-free index and heap page accesses.
In addition, we experiment with the PLP variations with the dynamic load balancing
mechanism integrated. We label those systems with a “-DLB” suffix (PLP-Reg-DLB, PLPPart-DLB, and PLP-Leaf-DLB).
All experiments were performed on two machines: an x64 box, with four sockets of quadcore AMD Opteron 8356 processors, clocked at 2.4GHz and running Red Hat Linux 5; and
a Sun UltraSPARC T5220 server with a 64-core Sun Niagara II chip clocked at 1.4GHz
and running Solaris 10. Due to unavailability of a suitably fast I/O sub-system, all the
Thousands
Page latches acquired
7.7. EVALUATION
169
800
700
600
500
INDEX
400
HEAP
300
CATALOG / SPACE
200
100
0
Conventional
Logical
PLP-Regular
PLP-Leaf
Figure 7.10: Average number of page latches acquired by the different systems when the run
the TATP benchmark. The PLP variants by design eliminate the majority of page latching.
experiments are with memory-resident databases. But the relative behavior of the systems
will be similar with larger databases.
7.7.2
Page latches and critical sections
First we measure how PLP reduces the number of page latch acquisitions in the system.
Figure 7.10 shows the number and type of page latches acquired by the conventional, the
logically-partitioned and two variations of the PLP design, PLP-Regular and PLP-Leaf. Each
system executes the same number of transactions from the TATP benchmark. PLP-Regular
reduces the amount of page latching per transaction by more than 80%; while PLP-Leaf
reduces the total further to roughly 1% of the initial page latching. The remaining latches
are associated with metadata and free space management.
The two right bars of Figure 7.1 compare total critical section entries of PLP vs. the
conventional and logically-partitioned systems. The two PLP variants eliminate the vast
majority of lock- and latch-related critical sections, leaving only metadata and space management latching as a small fraction of the critical sections. Transaction management, which
is the largest remaining component, mostly employs fixed-contention communication to serialize threads that attempt to modify the transaction object’s state. Similarly, the buffer
pool-related critical sections are mostly due to the communication between cleaner threads,
which again do not impact scalability. Overall, PLP-Leaf acquires 85% and 65% fewer contentious critical sections than the conventional and logically-partitioned systems respectively.
160
140
120
100
80
60
40
20
0
Other
Heap Latch Cont.
Idx Latch Cont.
16
32
48
PLP
Logical
Conv.
PLP
Logical
Conv.
PLP
Logical
Conv.
PLP
Logical
Latching
Conv.
Time breakdown (per xct)
170
64
# HW Contexts
Figure 7.11: Time breakdown per transaction in an insert/delete-heavy benchmark.
7.7.3
Reducing index and heap page latch contention
Having established that PLP effectively reduces the number of page latch acquisitions and
critical sections, we next measure what is the impact of that change in the time breakdown.
Figure 7.11 shows the impact in the transaction execution time as PLP eliminates the
contention on index page latches. The graph gives the time breakdown per transaction for
the different designs as an increasing number of threads run an insert/delete-heavy workload
on the TATP database. In this benchmark, each transaction makes an insertion or a deletion
to the CallFwd table, causing page splits and contention for the index pages that lead to the
records being inserted/deleted. As Figure 7.11 shows, the conventional and the logicallypartitioned systems experience contention on the index page latches. They both spend
15-20% of their time waiting, while PLP eliminates the contention. We expect PLP to
achieve proportional performance improvements.
Similarly, Figure 7.12 shows the time breakdown per transaction when 16 and 40 hardware contexts are utilized by the conventional, the logically-partitioned and PLP-Partition
systems when they run a slightly modified version of the StockLevel transaction of the
TPC-C benchmark. StockLevel contains a join, and in this version, 2000 tuples are joined.
We see that the conventional system wastes 20-25% of its time in contention in the lock manager and for page latching. Interestingly enough, the logically-partitioned system eliminates
the contention in the lock manager, but this elimination is not translated to performance
improvements. Instead the contention is shifted and aggravated to the page latches. On the
other hand, PLP eliminates the contention both in the lock manager and for page latches
and achieves higher performance.
7.7. EVALUATION
171
60
50
Other
40
Btree
30
BPool
20
Latching
10
Locking
0
16
40
Conventional
16
40
Logical
16
40
PLP-Partition
#HW Contexts
Figure 7.12: Time breakdown per TPC-C StockLevel transaction, when 2000 tuples joined.
The PLP variation eliminates the contention related to both locking and latching.
Figure 7.13 gives the time breakdown per transaction when we run the TPC-B benchmark
[TPC94]. In this experiment we do not pad records to force them onto different pages.
Transactions often wait for others because the record(s) they update happen to reside on
latched heap pages. The conventional, logically-partitioned, and PLP-Regular all suffer from
this false sharing of heap pages. At high utilization this contention wastes more than 50%
of execution time. On the other hand, PLP-Leaf is immune, reducing response time by
13-60% and achieving proportional performance improvement. In a way, PLP-Leaf provides
automatic and more robust padding for the workloads that require manual padding in the
conventional system to reduce contention on the heap pages.
7.7.4
Impact on scalability and performance
Since PLP effectively reduces the contention (and the time wasted) to acquire and release
index and heap page latches, we next measure its impact on performance and overall system
scalability. The four graphs of Figure 7.14 show the throughput of the optimized conventional system, as well as the DORA and PLP prototypes, as we increase hardware utilization of the two multicore machines. On the two left-most graphs the workload consists
of clients that repeatedly submit the GetSubscriberData transaction of the TATP benchmark [NWMR09], while on the two right-most graphs the workload consists of clients that
repeatedly submit the StockLevel transaction of the TPC-C benchmark [TPC07]. Both
transactions are read-only and ideally should impose no contention whatsoever. Those two
workloads corresponding to the time breakdowns presented in Figure 7.11 and Figure 7.13
respectively.
350
300
250
200
150
100
50
0
Useful
Heap Latch Cont.
Idx Latch Cont.
16
32
48
PLP-Leaf
PLP-Reg
DORA
Conv.
PLP-Leaf
PLP-Reg
DORA
Conv.
PLP-Leaf
PLP-Reg
DORA
Conv.
PLP-Leaf
PLP-Reg
DORA
Latching
Conv.
172
64
# HW Contexts
Figure 7.13: Time breakdown per transaction in TPC-B with false sharing on heap pages.
As expected, PLP shows superior scalability, evidenced by the widening performance
gap with the other two systems as utilization increases. For example, from the right-most
graph we see that for StockLevel DORA delivers a 11% speedup over the baseline case in
the 4-socket Quad x64 system. In its turn, PLP delivers an additional 26% over DORA, or
nearly 50% over the conventional. The corresponding improvements in the Sun machine’s
slower but more numerous cores are 13% and 34%. Note that eight cores of the x64 machine
match the fully-loaded Sun machine, so the latter does not expose bottlenecks as strongly in
spite of its higher parallelism. A significant fraction of the speedup actually comes from the
MRBTree probes, which are effectively one level shallower, since threads bypass the “root”
partition table node during normal operation.
7.7.5
MRBTrees in non-PLP systems
The MRBTree can improve performance even in the case of conventional systems in three
ways. First, since it effectively reduces the height of the index by one level, each index
probe traverses one fewer node and hence it is faster. Second, any possible delay due to
contention on the root index page is also reduced roughly proportionally with the number
of sub-trees. We see the effect of those two in Figure 7.15, which highlights the difference
in the peak performance of the conventional and the logically-partitioned system when they
run with and without MRBTrees. The workload is the TATP benchmark. In both case the
improvement in performance is in the order of 10%.
Third, MRBTrees allow each sub-tree to have a structure modification operation (SMO)
in flight at any time; in contrast with traditional B+Trees that can have only one SMO in
7.7. EVALUATION
173
TATP GetSubscriberData
TPCC StockLevel
2.5
700
600
Thousands
Throughput (Ktps/sec)
400
350
300
250
200
150
100
50
0
0 16 32 48 64
HW ctxs util.
Sun Niagara II
2.0
500
400
1.5
300
1.0
200
100
0
0
4 8 12 16
HW ctxs util.
Intel x64
6
5
4
3
2
0.5
1
0.0
0
0 16 32 48 64
HW ctxs utl.
Sun Niagara II
Opt. Shore-MT
DORA
PLP-Partition
0
4 8 12 16
HW ctxs util.
Intel x64
Figure 7.14: Throughput when the systems run the GetSubscriberData transaction of the
TATP benchmark, and the StockLevel transaction of the TPC-C benchmark in two multicore
machines. PLP shows superior scalability, as evidenced by the widening performance gap
with the other two systems as utilization increases.
flight. Consequently, in workloads with high entry insertion (deletion) rates, the MRBTree
improves performance by parallelizing the SMOs. Figure 7.16 shows the time breakdown
of the conventional system with and without MRBTrees as we run a microbenchmark that
consists of either a record probe or insert, and we increase the percentage of inserts. Without
MRBTrees, the system spends an increasing amount of time blocked waiting for SMOs to
complete as the insertion rate increases. When MRBTrees are used, there is no time wasted
waiting for SMOs and performance improves by up to 25%. Overall, there are compelling
reasons for systems other than PLP to adopt MRBTrees.
7.7.6
Transactions with joins in PLP
Next we turn our attention to workloads that seem to not fit well with physiological partitioning. First, we inspect how PLP behaves on workloads with transactions with join
operations.
To evaluate the performance of PLP on transactions with joins, we slightly modified the
StockLevel transaction from the TPC-C benchmark [TPC07] to determine the number of
tuples joined. (This experiment is the same with the one presented in Section 6.4.6.) In its
un-modified version, StockLevel joins 200 tuples between two tables. We created different
versions of the transaction where 20, 200, 2000, 20000, and 200000 tuples are joined. For
Throughput (Ktps/cpu)
174
70
60
50.0
50
55.8
65.3
59.4
40
30
20
10
0
Normal MRBT Normal MRBT
Conv.
Logical
75
Other
Bpool
50
TxMgr
Log
25
Locking
Latch-smo
0%
20%
40%
60%
Percentage of Inserts
80%
MRBT
Normal
MRBT
Normal
MRBT
Normal
MRBT
Normal
MRBT
Normal
MRBT
0
Normal
Figure 7.15: Performance of the conventional and the logically-partitioned system in TATP.
MRBTree is beneficial also for non-PLP systems.
100%
Figure 7.16: Time-breakdown of conventional transactions when parallel SMOs are allowed
with MRBTrees.
each different number of tuples joined, Figure 7.17 plots the maximum throughput the
conventional, the logically-partitioned and the PLP-Partition systems achieved, normalized
to the maximum throughput of the conventional. The three systems achieved their maximum
throughput when the 4-socket Quad x64 machine was 100% utilized, which means that there
were no significant scalability bottlenecks. Figure 7.17 shows that the PLP variation achieves
higher performance than the conventional system regardless of the number of tuples joined.
When only 20 tuples are joined PLP achieves 2.1x higher performance than conventional,
while when 200K tuples are joined PLP achieves 33% higher performance. PLP achieves
higher performance because it eliminates the contention for page latches, as Figure 7.12
Normalized Throughput
7.7. EVALUATION
175
2.25
2.00
1.75
1.50
1.25
1.00
0.75
0.50
0.25
0.00
Conventional
Logical
PLP-Partition
20
200
2000
20000
200000
Tuples Joined
Figure 7.17: Maximum throughput when running the TPC-C StockLevel transaction, normalized the throughput of Conventional.
illustrates. That is in contrast with the logically-partitioned system (DORA), which for
large number of tuples joined performs lower than conventional.
7.7.7
Secondary index accesses
Non-clustered secondary indexes are pervasive in transaction processing, since they are the
only means to speed up transactions that access records using non-primary key columns.
Nevertheless, secondary index accesses pose several challenges to PLP, which we explore in
Figure 7.18. We break the analysis of secondary index accesses to two cases: when the
secondary index is aligned with the partitioning scheme and when it is not.
We conduct an experiment where we modify TATP’s GetSubscriberData transaction
to perform a range index scan on the secondary index with built on the names of the
Subscribers and we control the number of matched records. In the original version of
the transaction only one Subscriber is found. In the modified version, we probe for 10,
100, 1000, and 10000 Subscribers, even though index scans for thousands of records are
not typical in high-throughput transactional workloads.
This experiment is very similar to the one conducted in Section 6.4.5. If the secondary
index columns are a subset of the routing columns, then the secondary index is aligned
with the partitioning scheme. In that case, a secondary index scan may return a large
number of matched RIDs (record ids of entries that match the selection criteria) from several
partitions. All the executors need to send the probed data to a coordination point where an
aggregation of the partial results takes place. As the range of the index scans become larger
(or the selectivity drops), this causes a bottleneck due to excessive data transfers. When
176
300
50
Range = 10
Throughput (Ktps)
250
40
200
30
150
20
100
50
10
0
0
PLP-Aligned
PLP-NonAligned
Conventional
0
4
5
Throughput (Ktps)
Range=100
8
12
#HW Contexts
0
16
4
0.50
Range = 1000
8
12
16
# HW Contexts
Range=10000
4
0.40
3
0.30
2
0.20
PLP-Aligned
1
0.10
PLP-NonAligned
0
0.00
0
4
8
12
# HW Contexts
16
Conventional
0
4
8
12
16
# HW Contexts
Figure 7.18: Performance on transactions with aligned and non-aligned secondary index
scans.
the secondary index is not aligned with the partitioning scheme, then on top of the above
mentioned bottleneck there is also an important overhead. This overhead is because each
record probe becomes a two step process, where the secondary index probe is done by one
thread conventionally and then requests from the appropriate executor threads to retrieve
the selected records.
Figure 7.18 compares the performance of Conventional system with PLP-Part-Aligned,
which performs partitioning aligned secondary index accesses, and PLP-Part-NonAligned,
which performs non-partitioning aligned secondary index accesses, as more hardware contexts
are utilized in the system. PLP-Part-Aligned improves performance over Conventional by
46%, 14%, 8%, and 1% respectively for ranges 10, 100, 1000, 10000. On the other hand, even
though PLP-Part-NonAligned improves performance by 11% when 10 records are scanned,
for larger ranges it hinders performance. PLP-Part-Aligned is 3%, 11%, and 38% slower
than Conventional for ranges 100, 1000, and 10000, respectively.
Normalized # of Heap Pages
7.7. EVALUATION
177
2.00
1.75
1.50
1.25
1.00
0.75
0.50
0.25
0.00
Conventional
PLP-Regular
PLP-Partition
PLP-Leaf
1MB
10MB 100MB 1GB
10GB
1MB
10MB 100MB 1GB
100B
10GB
1000B
Record and Database Size
Figure 7.19: Space overhead of the PLP variations.
As expected, the performance improvement for PLP-Part-Aligned gets smaller as the
range of the index scan increases. However, as long as the index scans of partitioning-aligned
secondary indexes are selective and touch a relatively small number of records, PLP provides
decent performance improvement. For PLP-Part-NonAligned, however, such workloads are
very unfriendly, though unless the scan range is over 1000 records it is not disastrous.
7.7.8
Fragmentation overhead
PLP-Partition and PLP-Leaf, create some fragmentation on the heap file since they change
the regular heap file structure (see Section 7.4.3). Given the increased number of data pages
due to fragmentation, we expect the heap file scan times to increase proportionally.
Figure 7.19 shows the ratio between the number of pages used in the three PLP variations
and the conventional system as we increase the database size. The x-axis shows the total
size of the database when each record is 100B (left side of the graph) and 1000B (right
side of the graph). The y-axis is the ratio between the number of pages used in each
design and the conventional system. The conventional system has one partition, where
the PLP variations have 100 and 10 partitions for the cases where record size is 100B and
1000B, respectively. The heap page size is 8KB. As expected, PLP-Regular does not create
any fragmentation since it maintains the regular heap file format. For PLP-Partition, the
amount of fragmentation becomes negligible as the database size increases for small records.
However, PLP-Leaf uses up to 80% more heap pages than a conventional system for the same
case creating a visible fragmentation on the heap file. On the other hand, as we increase
the record size, the fragmentation decreases because each heap page is able to keep fewer
records, and thus the amount of empty space on each heap page is reduced.
178
Throughput (Ktps)
700
600
No Histogram
500
No Sub-bucket
400
2 Sub-buckets
300
5 Sub-buckets
200
10 Sub-buckets
100
20 Sub-buckets
0
0
5
10
15
#CPUs Utilized
20
Figure 7.20: Overhead of DLB under normal operation.
Overall, among the PLP variations, only PLP-Leaf may introduce some significant fragmentation when a heap page can keep many database records. As the number of records a
heap page can keep decreases, this cost becomes less significant. We also note that PLP is a
design optimized for high-performing transactional applications, where entire heap file scans
are rare.
7.7.9
Overhead and effectiveness of DLB
In this section we first quantify the overhead of the dynamic load balancing mechanism
(DLB) under normal operation. Then we measure how quickly and effectively DBL reacts against skew and load imbalances. All the experiments use the GetSubscriberData
transaction from TATP benchmark.
Overhead in normal operation
Under normal operation, DLB should impose minimal overhead. DLB’s monitoring component performs three operations: it maintains the histograms with access information, it
continuously monitors the throughput, and it periodically analyzes the request queues of
the worker threads for load imbalances. In an optimally configured system (where the load
is precisely balanced across partitions) we measure the performance of the system as we
increase the load in the system (the number of concurrent clients that submit transactions).
Figure 7.20 shows the overhead caused by updating the aging histogram for each data access. Since the number of threads that try to update the histogram increases, as we utilize
more CPUs, the overhead of updating the histogram increases as well. On the other hand,
increasing the number of sub-buckets does not have much effect.
7.7. EVALUATION
179
400
Throughput (Ktps)
350
300
Conventional
250
PLP-Regular
200
PLP-Reg- DLB
150
PLP-Part-DLB
100
PLP-Leaf-DLB
50
0
0
5
10
15
20
25
30
Time (sec)
Figure 7.21: Example of dynamic load balancing in action. At time t=10 50% of the requests
are sent to 30% of the database.
Overall, we observe that the monitoring component of DLB is fairly lightweight. On
average histogram updates cause 6% drop in throughput compared to the system running
without a histogram and maximum drop is 7-8%. Considering that the transaction we
execute in our system is a read-only transaction, we actually evaluate the worst case behavior
here. For a transaction with updates, the number of transactions executed per second and
hence the number of data accesses would be lower. Fewer data accesses would cause fewer
updates in histogram and therefore less overhead.
Reacting to load imbalances
In order to evaluate how effectively DLB handles load imbalances, we execute the same
experiment as the one in Figure 7.5. The PLP variations (PLP-Regular, PLP-Reg-DLB,
PLP-Part-DLB, and PLP-Leaf-DLB) use 64 partitions, apply aging in every 1 sec, and the
load difference threshold value t is 10%. Initially the requests are distributed uniformly and
at time point 10 (sec), 30% of the database starts to receive 50% of the requests.
As Figure 7.21 shows, the change in the access pattern causes a 30% drop in the throughput of PLP-Regular, making its performance worse than the performance of the non-partitioned
Conventional system. On the other hand, the DLB-integrated PLP variations quickly detect
the skew and bring the performance back to the pre-skew levels in less than 10 secs. In
particular, 2 secs after the change in the access pattern, DLB has already decided on the
new partitioning configuration, and around 8 secs later it has performed 126 repartition
operations (63 splits and 63 merges). The throughput has some spikes for a short time after
repartitioning, but in the end settles down.
180
Figure 7.22: Partitions Before & After the repartitioning
In PLP-Reg-DLB, very few index entries are updated, leading to a shorter dip in throughput during repartitioning. PLP-Leaf-DLB experiences an almost equally short dip. PLPPart-DLB suffers a much longer dip. For the statically partitioned PLP, Figure 7.21 has
only the results for the statically partitioned PLP-Regular since the drop in throughput is
almost the same for the other two statically partitioned PLP variations (PLP-Partition and
PLP-Leaf).
DLB triggers a global repartitioning process which affects all the partitions in the system. PLP-Regular and PLP-Leaf can handle this process very well. However, such global
repartitioning is not suitable for PLP-Partition. PLP-Partition is the closest to a physicallypartitioned (shared-nothing) system in terms of repartitioning cost since it reorganizes a
large number of heap pages (see Section 7.5.2). Therefore, its non-optimal behavior with
DLB is as expected.
Speeding up accesses to hot spots
When DLB is effective, the “hot” regions end up to narrow partitions. The indexes for these
partitions are shallower and provide shorter access times for the “hot” records. In addition,
“hot” records that could previously belong to the same partition, due to their key proximity,
end up to different partitions. Figure 7.22 illustrates graphically the impact of DLB on the
ranges of 10 partitions before and after an repartitioning. The area within the rectangular
region highlights the “hot” range; it is 10% of the total area that receives the 50% of the
total load. Initially, labeled Before, the system has equal-length range partitions. After DLB
kicks in and repartitioning completes, labeled After, the “hot” region has shorter-length range
partitions while the not-so-loaded regions have larger-length partitions.
7.7. EVALUATION
181
Table 7.3: Average index probe times (in microseconds) for a hot record, as skew increases.
Skewed region (%) Before Skew
50
69
20
67
10
69
5
68
2
68
After Skew
67
66
66
64
64
After Repartitioning
65
63
62
61
60
Table 7.4: Average record probes per sec for a hot record, as skew increases.
Skewed region (%) After Skew After Repartitioning
50
13
13
20
7
29
10
7
73
5
32
108
2
63
155
Table 7.3 shows the average index probe time (in microseconds) for a hot record as we
increase the skew. For this experiment we use a single table with 640000 records for a total
size of around 1GB. There is a index of this table, with 8KB pages and the primary key is an
integer (4B). When there are 10 equal-range partitions, the height of each partition’s sub-tree
is 3. Each row in the table shows the average access time of a randomly picked record from
a “hot” region which gets 50% of all the requests, as the range of the “hot” region decreases
–and the skew increases. The first column (“Before Skew”) shows the average access time
when the requests are uniformly distributed. The second column (“After Skew”) shows
the average access time when DLB is disabled and there request distribution is skewed. The
thirds column is the average access time after DLB kicked in and completed a repartitioning.
As Table 7.3 shows, the access times for the randomly picked record is lower after we
set the skew. This is probably due to some caching effect since the record is accessed more
frequently when there is skew in data accesses. However, the access time after repartitioning
is the shortest since the height of the sub-tree in the new “hot” partition 2 whereas in the
old partition it was 3 (the height of the sub-trees for the other partitions remains as 3)
Table 7.4 shows the number of finished requests for the “hot” record after the skew and
after DLB’s repartitioning. Before repartitioning fewer requests are satisfied for the picked
record because its partition is highly loaded with requests for other records in the same “hot”
partition range. DLB distributes the “hot” range between multiple shorter-range partitions.
182
Throughput (Ktps)
18
18
15
15
12
12
9
9
6
6
3
3
0
0
0
5
10
15
20
No index
1 sec. idx
2 sec. idx
3 sec. idx
4 sec. idx
0
5
10
Time (sec)
Time (sec)
PLP-Leaf
PLP-Partition
15
20
Figure 7.23: Overhead of updating secondary indexes during repartitioning. At time t=5
50% of the requests are sent to only 10% of the database, which triggers repartitioning.
Therefore, a single partition can serve more requests for the “hot” record. This results in
the observe small throughput increase after repartitioning in Figure 7.21.
7.7.10
Overhead of updating secondary indexes for DLB
In PLP-Leaf and PLP-Partition, whenever a record moves every non-clustered index of the
table needs to be updated with the record’s new RID (see Section 7.4.3). In this section, we
measure the overhead of updating the secondary indexes during repartitioning.
Figure 7.23 shows the effect of repartitioning on throughput as we increase the number
of secondary indexes for a table for PLP-Leaf (left) and PLP-Partition (right). For this
experiment we use the Subscribers table of the TATP database. Initially, there are 2 partitions of 320000 records each that receive uniform requests. After 5 seconds, 50% of the
requests are sent to only 10% of the table and DLB triggers a repartitioning. We measure
the throughput of the system as we increase the number of secondary indexes on the table,
from none up to 4 secondary indexes.
Figure 7.23 (left) shows that the overhead for PLP-Leaf to update the secondary is
relatively low, because very few or none of the records needs to be moved. On the other
hand, the overhead for PLP-Partition is much higher. PLP-Partition has to move more
records and update more entries in the secondary indexes. Therefore, repartitioning in PLPPartition takes longer time as we increase the number of secondary indexes for a table.
7.8. RELATED WORK
7.7.11
183
Summary
As the experimental results show, PLP, successfully, manages to eliminate two major sources
of unscalable critical sections in conventional shared-everything systems; locking and latching. In addition, it provides a good infrastructure for easy repartitioning and dynamic
load balancing. It is important to note that each PLP variation has its drawbacks. For
example, PLP-Leaf comes with some fragmentation (Section 7.7.8) and PLP-Partition cannot repartition efficiently (Section 7.7.9 and Section 7.7.10). Considering the long lasting
throughput drops during repartitioning for the PLP-Partition, we favor PLP-Leaf for workloads that need dynamic load balancing. If the workload does not heavily suffer from heap
page latching, but only index page latching, then PLP-Regular is definitely a great design
choice because it neither has fragmentation nor faces long and sharp drops in throughput
during repartitioning.
7.8
Related work
The related work for this chapter can be categorized in three: analyzing and reducing the
critical sections in DBMSs, partitioned B+Trees and concurrency control mechanisms, and
dynamic load balancing and repartitioning.
7.8.1
Critical Sections
The complexity and overheads of database management systems are well-known. For example, [HAMS08] shows that, even in a single-threaded OLTP system, logging, locking,
latching, and buffer pool accesses contribute roughly equal overheads and together account
for the majority of machine instructions executed during a transaction. The previous chapters and work show that these overheads become scalability burdens in multicore hardware
[JPH+ 09, PJHA10]. PLP eliminates at its entirety a category of serializations, along with
the corresponding bottlenecks, page latching.
In the shared-everything arena, the two techniques presented in the previous chapter,
speculative lock inheritance [JPA09] and data-oriented transaction execution [PJHA10],
minimize the need for interaction with a centralized lock manager. Where speculative lock
inheritance allows the system to spread lock operations across multiple transactions to reduce
contention, data-oriented systems replace the central lock manager with thread-local lock
management. Reducing lock contention with data-oriented execution is also studied for
data-streams’ operators [DAAEA09].
184
Other proposals tackle the weakness posed by the centralized log manager, with [JPS+ 10]
(and Section 5.4) presenting a scalable log buffer and [Che09] exploiting flash technology
to reduce logging latencies. These proposals show even seemingly-pervasive forms of communication can be reduced or sidestepped to great effect. However, none of them addresses
physical data accesses involving page latching and the buffer pool, the other two major
overheads in the system, which PLP eliminates.
Oracle RAC, with Cache-Fusion [LSC+ 01], allows database instances in the shared-disk
cluster to share their buffer pools and avoid accesses to the shared-disk. It can also partition
the data to reduce both logical and physical contention on a particular portion of the data.
However, it does not enforce each partition to be accessed only by a single thread. Therefore,
it does not eliminate physical latch contention while accessing pages from the shared-cache
as much as PLP does.
As discussed previously, shared-nothing [Sto86, DGS+ 90, SMA+ 07] systems have an appealing design that eliminates critical sections altogether. However, they struggle both proactively to reduce the need to execute distributed transactions through efficient partitioning
[CJZM10, PJZ11] as well as re-actively to reduce overheads when distributed transactions
cannot be avoided [JAM10]. On the other hand, PLP, in addition to eliminating a big
portion of the unscalable critical sections, offers a less costly way of load balancing and
communication for distributed transactions since partitions share the same memory space.
7.8.2
B+Trees and alternative concurrency control protocols
Alternatives to traditional B+Tree concurrency control protocol are studied to allow multiple
concurrent SMOs [ML92, JSSS06]. The MRBTree index structure provides an alternative to
these techniques, allowing concurrent SMOs with less code complexity. However, these techniques could be implemented alongside with MRBTrees to achieve concurrency within a partition, should that be desirable for a conventional system. As an addition to these techniques
MRBTrees also allow multiple root split operations in parallel. Several earlier works propose B+Trees having multiple roots to reduce contention due to locking [MOPW00, Gra03].
However, again none of these proposals targets physical latch contention in the system.
In addition, there are latch-free B+Tree implementations that use alternative synchronization methods. CO B-Tree [BFGK05] uses load-linked/store-conditional (LL/SC) instead
of latching to synchronize operations on a B+Tree. However, it does not eliminate contention on the B+Tree. PALM [SCK+ 11] eliminates both page latching and contention on
the B+Trees by using Bulk Synchronous Parallel model. However, it has to perform B+Tree
7.8. RELATED WORK
185
operations in batches in order to exploit this technique, which might not be desirable all the
time and harder to integrate within a database management system.
Finally, optimistic and multiversioning concurrency control schemes [KR81, BG83, LBD+ 12]
may improve concurrency by resolving conflicts lazily at commit time instead of eagerly
blocking them at the moment of a potential conflict. When conflicts are rare this allows
the system to avoid the overhead of enforcing database locks. On the other hand, if the
conflicts occur frequently the performance of the system drops rapidly, since the transaction
abort rate is high. Moreover, there is work that compares the concurrency control schemes
in database systems. Notable is the work by Agrawal et al. [ACL87], while the book of
Bernstein et al. [BHG87] and Thomasian’s survey [Tho98] are good starting points for the
interested reader. On the other hand, the focus of PLP is on the contention for latches rather
than the concurrency scheme used.
We also note that there is a large body of work on cache-conscious index implementations
(e.g. [RR99, RR00, CGMV02]). Such indexes are not being used on transaction processing
systems. Instead, they target business intelligence workloads, which lack updates and therefore do not need complicated concurrency control mechanisms. PLP eliminates the need for
latching and concurrency control at the index level. Therefore, we expect to get a significant
performance boost if we substitute the index implementation with a cache-friendlier B+Tree
alternative, since the B+Tree probes are the most expensive remaining component of PLP.
7.8.3
Load balancing
There is a large body of related work, but most of it focuses on clustered (shared-nothing)
environments. For example, [AON96] analyzes and compares different approaches for index
reorganization during repartitioning in shared-nothing deployments. Lee et al. [LKO+ 00]
propose an index structure similar to the MRBTree, which eases the index reorganization
during repartitioning in a shared-nothing system and Mondal et al. [MKOT01] extend this
design by keeping statistics for each branch pointed by the root node of a partition’s subtree. While the structure of [MKOT01] enables the observation of access patterns at a fine
granularity all the accesses have the same weight, no matter how recent or old they are. The
two-level aging-based histogram assigns higher weight to the recent accesses. This allows
us to have a more accurate view of the skewed access patterns and detect load imbalances
quickly.
Shinobi [WM11] uses a cost model to decide whether the benefits of a new partitioning configuration worth to pay the cost of repartitioning. Shinobi focuses on insert-heavy
workloads where data is rarely queried and when queried the queries focus on a small region
186
of the most recently inserted records. Its benefits primarily come from avoiding to index
the large infrequently accessed parts of the database. We consider mainstream transactional
workloads where the entire database is accessed and we cannot drop any indexes.
The histogram-based technique we use is influenced from previous work on maintaining
dynamic histograms on data distributions for accurately estimating the selectivity of query
predicates [GMP02, DIR00]. In DLB’s case, we are interested in the frequency of accesses
to a particular region, rather than the data distribution, and on the access pattern.
Finally, our work is orthogonal to techniques that decide initial partitioning configuration.
For example, Schism [CJZM10] create partitions that minimize the number of distributed
transactions by representing the workload as a graph and using a graph partitioning algorithm. Houdini [PJZ11] uses a Markov model in order to decide the partitioning. While,
in [RZML02] the query optimizer is used to get suggestions for the initial partitions. These
tools will only create the initial configuration; if the workload characteristics change over
time, however, the initial configuration is useless and the system has to re-calculate the
partitioning configuration and perform the repartitioning.
7.8.4
PLP and future hardware
As multicore hardware trends evolve, PLP becomes increasingly attractive for several reasons. Conventional OLTP is ill-suited to modern and upcoming hardware for at least three
reasons:
• The code of OLTP system is full of unscalable critical sections [JPH+ 09].
• The access patterns are unpredictable [SWH+ 04] that even the most advanced prefetchers
fail to detect [SWAF09].
• The majority of the accesses are shared read-write and hence they under-perform on
caches with non-uniform access latency [BW04, HFFA09].
As we have seen, PLP, combined with previous advances in logging, eliminates all three
problems. The majority of unscalable critical sections are completely eliminated, access
patterns are regularized by the thread assignments, and threads no longer share data to
communicate, eliminating the shared R/W problem.
This regularity will become increasingly important as hardware continues to make more
and more demands of the software. For example, it is almost inevitable that processor cache
access latencies will be non-uniform [BW04, HPJ+ 07, HFFA09]. Unfortunately, OLTP will
only be able to utilize effectively these new architectures if it can eliminate the majority of
accesses which are shared among multiple processors.
7.9. CONCLUSIONS
187
Another important trend in hardware design is toward non-coherent many-core processors
that are based on message passing, e.g. [V+ 07, H+ 10]. In the area of operating systems, they
have already recognized this trend and have proposed message-passing system designs, such
as Barrelfish [BBD+ 09]. PLP by design requires a small amount of communication between
threads. There is no fundamental difficulty to extend its design to a pure message-passing
shared-everything transaction processing system in order to fit naturally to such hardware.
In short, by eliminating a large class of non-crucial communication, PLP leaves OLTP
engines much better-poised to take advantage of upcoming hardware, whatever form it may
take.
7.9
Conclusions
Unlike conventional systems, which either embrace fully shared-everything or shared-nothing
philosophies, physiological partitioning takes the best features of both to produce a hybrid system that operates nearly latch- and lock-free, while still retaining the convenience
of a common underlying storage pool and log. We achieve this result with a new multirooted B+Tree structure and careful assignment of threads to data, adopting the thread-todata transaction execution principle. This design allows easy repartitioning and enables a
lightweight, robust, and efficient dynamic load balancing mechanism.
188
189
Chapter 8
Future Direction and Concluding Remarks
8.1
Hardware/data-oriented software co-design
We already argued in Section 7.8.4 that systems built around data-oriented execution are
very well-suited for emerging hardware. But, we can go beyond that with a hardware/dataoriented software co-design.
8.1.1
Hardware enhancements
The hardware enhancements to data-oriented software can be as simple as hardware mechanisms to efficiently pass messages from one thread to the other (e.g. [RSV87]) to implementations of entire sub-components.
In addition, data-oriented software designs can benefit from various hardware optimizations which do not seem to be very beneficial to mainstream software, and thus have not
gained popularity until now. For example, since the memory accesses of the threads in a
data-oriented system can be clearly separated, optimistic hardware technologies could potentially work very well of it. Two of them are hardware transactional memory [Her91] and
speculative lock elision [RG01]. Both technologies rely on that only few conflicts actually
happen between concurrent threads, in a way similar to optimistic concurrency control in
transaction processing [KR81]. Unfortunately, the harsh realization was that commercial
software, such as database workloads, was not exhibiting such behavior. As a result, those
technologies never gained popularity, and, for example, in 2009 Sun canceled its $1B Rock
processor project, which was featuring hardware transactional memory [CCE+ 09]. 1
8.1.2
Co-design for energy-efficiency
Let’s consider the problem of energy-efficient database processing, since one of the biggest
challenges for the years to come will be the implementation of energy-efficient and energyproportional systems [BH07, Ham08]. A recent study showed that if we can modify only the
database system configuration, then the most energy-efficient configuration is nearly always
1
http://bits.blogs.nytimes.com/2009/06/15/sun-is-said-to-cancel-big-chip-project/
190
CHAPTER 8. FUTURE DIRECTION AND CONCLUDING REMARKS
the one that has the highest performance [THS10]. On the other hand, recent work in the
computer architecture community showed that major improvements in energy-efficiency are
achieved with custom hardware designs and appropriate modifications of the application
[HQW+ 10]. This particular work implements a 720p HD H.264 encoder which is orders of
magnitude more energy-efficient. Achieving something similar for an application as complex
as a transaction processing system will be an extremely more difficult and challenging task;
but also the dividends will be much greater.
A hardware/data-oriented transaction processing co-design is very appealing. For example, data-oriented software gives the opportunity to drastically reduce the complexity of
the transaction processing codepaths (see Section 7.4.5) making large parts of them implementable in hardware.
8.2
Summary and conclusion
The overall goal of this dissertation was to improve the scalability of transaction processing.
First, we provided evidence that conventional transaction processing designs inevitably will
face significant scalability problems, due to their complexity and the unpredictability of
access patterns, result of the way they assign work to the concurrent worker threads. Then,
we showed that not all the points of serialization (also known as critical sections) are threats
to the scalability of software systems, even though their sheer number imposes significant
overhead to single-thread performance. Then, based on the categorization we attacked the
biggest lurking scalability problems in a conventional design, providing solutions based on
caching data across transactions and downgrading specific critical sections. But, no matter
the optimizations we did, the codepath of the conventional design was still full of critical
sections.
To alleviate the problems of conventional execution, we then made the case for dataoriented transaction execution. The data-oriented execution is based on a thread-to-data
work assignment policy that results into coordinated accesses. The coordination on the
accesses allows all sorts of optimizations, breaking the inherent limitations of conventional
transaction processing. To prove that, we presented two designs each of them removing a
significant source of un-scalable critical sections: the ones inside the centralized lock manager
and the page latching.
Finally, we showed how difficult it is to scale the performance of transaction processing
on non-uniform hardware, such as multisocket multicores. Only software systems that dis-
8.2. SUMMARY AND CONCLUSION
191
tribute the accesses, such as data-oriented systems, can fully exploit such non-uniform or
heterogeneous hardware; certainly conventional transaction processing cannot.
We project that as hardware parallelism and heterogeneity continue to increase, the gap
between conventional and data-oriented transaction execution will only continue to widen.
192
CHAPTER 8. FUTURE DIRECTION AND CONCLUDING REMARKS
193
Bibliography
[A+ 85] Anon. et al. A measure of transaction processing power. Datamation, 31(7),
1985. 2.4.1, 6.4.1
[ACL87] Rakesh Agrawal, Michael J. Carey, and Miron Livny. Concurrency control
performance modeling: alternatives and implications. ACM TODS, 12, 1987.
3.2, 6.6, 7.8.2
[ADH01] Anastasia Ailamaki, David J. DeWitt, and Mark D. Hill. Walking four machines by the shore. In CAECW, 2001. 5.2
[ADHW99] Anastasia Ailamaki, David J. DeWitt, Mark D. Hill, and David A. Wood.
DBMSs on a modern processor: Where does time go? In VLDB, 1999. 1.2
[Adl05] Stephen Adler. The Slashdot effect: An analysis off three internet publications,
2005. Available at: http://hup.hu/old/stuff/slashdotted/SlashDotEffect.html.
7.5
[AFR09] Mohammad Alomari, Alan Fekete, and Uwe Röhm. A robust technique to
ensure serializable executions with snapshot isolation dbms. In ICDE, 2009.
2.2.6, 5.5.1
[Amd67] Gene M. Amdahl. Validity of the single processor approach to achieving large
scale computing capabilities. In AFIPS, 1967. 1.2, 3.3.3
[And90] Thomas E. Anderson. The performance of spin lock alternatives for sharedmemory multiprocessors. IEEE Trans. Parallel Distrib. Syst., 1(1), 1990. 4.3.1
[AON96] Kiran J. Achyutuni, Edward Omiecinski, and Shamkant B. Navathe. Two
techniques for on-line index modification in shared nothing parallel databases.
In SIGMOD, 1996. 7.8.3
[AVDBF+ 92] Peter Apers, Care Van Den Berg, Jan Flokstra, Paul Grefen, Martin Kersten, and Annita Wilschut. PRISMA/DB: A parallel main memory relational
DBMS. IEEE TKDE, 4, 1992. 3.1
[BAC+ 90] Haran Boral, William Alexander, Larry Clay, George P. Copeland, Scott Danforth, Michael J. Franklin, Brian E. Hart, Marc G. Smith, and Patrick Val-
194
BIBLIOGRAPHY
duriez. Prototyping Bubba, a highly parallel database system. IEEE Transactions on Knowledge and Data Engineering, 2, 1990. 3.1
[BBD+ 09] Andrew Baumann, Paul Barham, Pierre-Evariste Dagand, Tim Harris, Rebecca Isaacs, Simon Peter, Timothy Roscoe, Adrian Schüpbach, and Akhilesh
Singhania. The multikernel: a new OS architecture for scalable multicore
systems. In SOSP, 2009. 7.8.4
[BDGR97] Edouard Bugnion, Scott Devine, Kinshuk Govil, and Mendel Rosenblum.
DISCO: running commodity operating systems on scalable multiprocessors.
ACM TOCS, 15(4), 1997. 6.6
[Bea11] Peter Beaumont.
The truth about Twitter, Facebook and the uprisings in the Arab world.
The Guardian, 2011.
Available at
http://www.guardian.co.uk/world/2011/feb/25/twitter-facebook-uprisingsarab-libya. 1.1
[BFGK05] Michael A. Bender, Jeremy T. Fineman, Seth Gilbert, and Bradley C. Kuszmaul. Concurrent cache-oblivious B-trees. In SPAA, 2005. 7.8.2
[BG83] Philip A. Bernstein and Nathan Goodman.
Multiversion concurrency
control—theory and algorithms. ACM TODS, 8(4), 1983. 2.2.6, 6.6, 7.8.2
[BGB98] Luiz André Barroso, Kourosh Gharachorloo, and Edouard Bugnion. Memory
system characterization of commercial workloads. In ISCA, 1998. 1.2
[BGM+ 00] Luiz André Barroso, Kourosh Gharachorloo, Robert McNamara, Andreas
Nowatzyk, Shaz Qadeer, Barton Sano, Scott Smith, Robert Stets, and Ben
Verghese. Piranha: a scalable architecture based on single-chip multiprocessing. In ISCA, 2000. 1.2
[BGMP79] Mike Blasgen, Jim Gray, Mike Mitoma, and Tom Price. The convoy phenomenon. SIGOPS Oper. Syst. Rev., 13(2), 1979. 4.3.1, 6.4.7
[BH07] Luiz André Barroso and Urs Hölzle. The case for energy-proportional computing. Computer, 40, 2007. 8.1.2
[BHG87] Philip A. Bernstein, Vassos Hadzilacos, and Nathan Goodman. Concurrency
control and recovery in database systems. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1987. 6.6, 7.8.2
[BJB09] Luc Bouganim, Bjön Jónsson, and Philippe Bonnet. uFLIP: Understanding
flash IO patterns. In CIDR, 2009. 2.3.1
BIBLIOGRAPHY
195
[BJK+ 97] William Bridge, Ashok Joshi, M. Keihl, Tirthankar Lahiri, Juan Loaiza, and
N. MacNaughton. The Oracle universal server buffer. In VLDB, 1997. 2.2.6,
5.5.1
[BM70] Rudolf Bayer and Edward M. McCreight. Organization and maintenance of
large ordered indices. In SIGFIDET, 1970. 1.7, 2.2.3, 7.4.1
[Bre00] Eric A. Brewer. Towards robust distributed systems (abstract). In PODC,
2000. 7.3
[BW04] Bradford M. Beckmann and David A. Wood. Managing wire delay in large
chip-multiprocessor caches. In IEEE MICRO, 2004. 6.3.3, 7.8.4
[BWCM+ 10] Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev,
M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich. An analysis of
Linux scalability to many cores. In OSDI, 2010. 1.2
[BZN05] Peter Boncz, Marcin Zukowski, and Niels Nes. Monetdb/X100: Hyperpipelining query execution. In VLDB, 2005. 1.4, 6.5
[CAA+ 10] Shimin Chen, Anastasia Ailamaki, Manos Athanassoulis, Phillip B. Gibbons,
Ryan Johnson, Ippokratis Pandis, and Radu Stoica. TPC-E vs. TPC-C: Characterizing the new TPC-E benchmark via an I/O comparison study. SIGMOD
Record, 39, 2010. 9, 2.4.4
[CASM05] Christopher B. Colohan, Anastasia Ailamaki, J. Gregory Steffan, and Todd C.
Mowry. Optimistic intra-transaction parallelism on chip multiprocessors. In
VLDB, 2005. 6.6
[CCE+ 09] Shailender Chaudhry, Robert Cypher, Magnus Ekman, Martin Karlsson, Anders Landin, Sherman Yip, Hakan Zeffer, and Marc Tremblay. Rock: A highperformance sparc cmt processor. IEEE Micro, 29(2), 2009. 8.1.1
[CDF+ 94] Michael J. Carey, David J. DeWitt, Michael J. Franklin, Nancy E. Hall,
Mark L. McAuliffe, Jeffrey F. Naughton, Daniel T. Schuh, Marvin H. Solomon,
C. K. Tan, Odysseas G. Tsatalos, Seth J. White, and Michael J. Zwilling.
Shoring up persistent applications. In SIGMOD, 1994. 3.1, 3.3.1, 4.4.3, 1, 5.2,
6.3.4, 6.6
[CDG+ 06] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A.
Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. Bigtable: A distributed storage system for structured data. In OSDI,
BIBLIOGRAPHY
196
2006. 1.4, 2.4
[CGMV02] Shimin Chen, Phillip B. Gibbons, Todd C. Mowry, and Gary Valentin. Fractal prefetching B+-Trees: optimizing both cache and disk performance. In
SIGMOD, 2002. 7.4.5, 7.8.2
[Che09] Shimin Chen. FlashLogging: exploiting flash devices for synchronous logging
performance. In SIGMOD, 2009. 2.3.1, 5.4, 7.5.5, 7.8.1
[CJZM10] Carlo Curino, Evan Jones, Yang Zhang, and Sam Madden. Schism: a
workload-driven approach to database replication and partitioning. PVLDB,
3, 2010. 1.7, 6.5, 7.1, 7.3, 7.5.1, 7.8.1, 7.8.3
[Cra93] Travis S. Craig. Building FIFO and priority-queueing spin locks from atomic
swap. Technical Report TR 93-02-02, University of Washington, Department
of Computer Science, 1993. 4.3.1
[DAAEA09] Sudipto Das, Shyam Antony, Divyakant Agrawal, and Amr El Abbadi. Thread
cooperation in multicore architectures for frequency counting over multiple
data streams. PVLDB, 2, 2009. 6.6, 7.8.1
[DG92] David J. DeWitt and Jim Gray. Parallel database systems: the future of high
performance database systems. Commun. ACM, 35, 1992. 3.1
[DGS+ 90] David J. Dewitt, Shahram Ghandeharizadeh, Donovan A. Schneider, Allan
Bricker, Hui-i Hsiao, and Rick Rasmussen. The Gamma database machine
project. IEEE Transactions on Knowledge and Data Engineering - TKDE, 2
(1):44–62, 1990. 3.1, 6.1, 7.1, 7.3, 7.5.1, 7.8.1
[DHJ+ 07] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter
Vosshall, and Werner Vogels. Dynamo: Amazon’s highly available key-value
store. SIGOPS Oper. Syst. Rev., 41(6), 2007. 1.4, 2.4, 6.1, 6.6
[DIR00] Donko Donjerkovic, Yannis E. Ioannidis, and Raghu Ramakrishnan. Dynamic
histograms: Capturing evolving data sets. In ICDE, page 86, 2000. 7.8.3
[DKO+ 84] David J. DeWitt, Randy H. Katz, Frank Olken, Leonard D. Shapiro,
Michael R. Stonebraker, and David A. Wood. Implementation techniques
for main memory database systems. In SIGMOD, 1984. 2.3.1, 5.5.2
[DLMN09] Dave Dice, Yossi Lev, Mark Moir, and Daniel Nussbaum. Early experience
with a commercial hardware transactional memory implementation. In ASP-
BIBLIOGRAPHY
197
LOS, 2009. 2
[DLO05] John D. Davis, James Laudon, and Kunle Olukotun. Maximizing CMP
throughput with mediocre cores. In PACT, 2005. 1.2, 3.3.1
[FNPS79] Ronald Fagin, Jurg Nievergelt, Nicholas Pippenger, and H. Raymond Strong.
Extendible hashing–a fast access method for dynamic files. ACM TODS, 4,
1979. 2.2.3
[FR04] Mikhail Fomitchev and Eric Ruppert. Lock-free linked lists and skip lists. In
PODC, 2004. 4.3.2
[GHOS96] Jim Gray, Pat Helland, Patrick O’Neil, and Dennis Shasha. The dangers of
replication and a solution. In SIGMOD, 1996. 5.5.2
[GL92] Vibby Gottemukkala and Tobin J. Lehman. Locking and latching in a memoryresident database system. In VLDB, 1992. 3.2
[GMP02] Phillip B. Gibbons, Yossi Matias, and Viswanath Poosala. Fast incremental
maintenance of approximate histograms. ACM TODS, 27, 2002. 7.8.3
[GMS87] Hector Garcia-Molina and Kenneth Salem. Sagas. SIGMOD Rec., 16(3), 1987.
6.6
[GR92] Jim Gray and Andreas Reuter. Transaction Processing: Concepts and Techniques. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1992.
1.1, 1.4, 2.1, 1, 5.2, 5.3, 6.1.2, 6.3.1, 7.3
[Gra90] Goetz Graefe. Encapsulation of parallelism in the Volcano query processing
system. In SIGMOD, 1990. 4.2.2
[Gra03] Goetz Graefe. Sorting and indexing with partitioned B-trees. In CIDR, 2003.
7.8.2
[Gra07a] Goetz Graefe. Hierarchical locking in B-tree indexes. In BTW, 2007. 6.3.1,
6.6
[Gra07b] Jim Gray. Tape is dead, disk is tape, flash is disk, RAM locality is king. In
CIDR, 2007. 3.1
[GSHS09] Colleen Graham, Bhavish Sood, Hideaki Horiuchi, and Dan Sommer. Market share: Database management system software, worldwide, 2009. See
http://www.gartner.com/DisplayDocument?id=1044912. 1.1
[H+ 10] Jason Howard et al. A 48-Core IA-32 message-passing processor with DVFS
in 45nm CMOS. In IEEE ISSCC, 2010. 7.8.4
198
BIBLIOGRAPHY
[Ham08] James R. Hamilton. Where does the power go and what to do about it? In
HotPower, 2008. 8.1.2
[HAMS08] Stavros Harizopoulos, Daniel J. Abadi, Sam Madden, and Michael Stonebraker. OLTP through the looking glass, and what we found there. In SIGMOD, 2008. 1.3, 5.3, 5.3.2, 5.5.2, 6.6, 7.1, 7.8.1
[Hel07] Pat Helland. Life beyond distributed transactions: an apostate’s opinion. In
CIDR, 2007. 1.7, 5.3, 6.6, 7.1, 7.3
[Her91] Maurice Herlihy. Wait-free synchronization. ACM Trans. Program. Lang.
Syst., 13(1), 1991. 4.3.2, 8.1.1
[HFFA09] Nikos Hardavellas, Michael Ferdman, Babak Falsafi, and Anastasia Ailamaki.
Reactive NUCA: near-optimal block placement and replication in distributed
caches. In ISCA, 2009. 6.3.3, 7.8.4
[HM93] Maurice Herlihy and J. Eliot B. Moss. Transactional memory: architectural
support for lock-free data structures. SIGARCH Comput. Archit. News, 21
(2), 1993. 3.2, 4.3.2
[HM08] Mark D. Hill and Michael R. Marty. Amdahl’s law in the multicore era.
Computer, 41, 2008. 4.2
[HP02] John L. Hennessy and David A. Patterson. Computer architecture: a quantitative approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA,
2002. 1.2
[HPJ+ 07] Nikos Hardavellas, Ippokratis Pandis, Ryan Johnson, Naju Mancheril, Anastasia Ailamaki, and Babak Falsafi. Database servers on chip multiprocessors:
Limitations and opportunities. In CIDR, 2007. 3.1, 7.8.4
[HQW+ 10] Rehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov, Benjamin C. Lee, Stephen Richardson, Christos Kozyrakis, and Mark
Horowitz. Understanding sources of inefficiency in general-purpose chips. In
ISCA, 2010. 8.1.2
[HSA05] Stavros Harizopoulos, Vladislav Shkapenyuk, and Anastasia Ailamaki. QPipe:
a simultaneously pipelined relational query engine. In SIGMOD, 2005. 6.6
[HSH07] Joseph M. Hellerstein, Michael Stonebraker, and James Hamilton. Architecture of a database system. Foundations and Trends (R) in Databases, 1(2),
2007. 1
BIBLIOGRAPHY
199
[HSIS05] Bijun He, William N. Scherer III, and Michael L. Scott. Preemption adaptivity
in time-published queue-based spin locks. In HiPC, 2005. 4.3.1, 6.2
[HSL+ 89] Pat Helland, Harald Sammer, Jim Lyon, Richard Carr, Phil Garrett, and
Andreas Reuter. Group commit timers and high volume transaction systems.
In HPTS, 1989. 2.3.1
[HSY04] Danny Hendler, Nir Shavit, and Lena Yerushalmi. A scalable lock-free stack
algorithm. In SPAA, 2004. 5.4.1
[IBM11] IBM.
IBM
DB2
9.5
information
center
Linux,
UNIX,
and
Windows,
2011.
Available
http://publib.boulder.ibm.com/infocenter/db2luw/v9r5/index.jsp. 5.5.1
for
at
[Int12] Intel.
Intel solid-state drive 520 series:
Product specification,
2012. Available at http://www.intel.com/content/www/us/en/solid-statedrives/ssd-520-specification.html. 2.3.4, 7
[JAM10] Evan Jones, Daniel J. Abadi, and Samuel Madden. Low overhead concurrency
control for partitioned main memory databases. In SIGMOD, 2010. 1.7, 6.6,
7.4.1, 7.8.1
[JASA09] Ryan Johnson, Manos Athanassoulis, Radu Stoica, and Anastasia Ailamaki.
A new look at the roles of spinning and blocking. In DaMoN, 2009. 6.2
[JFRS07] Sudhir Jorwekar, Alan Fekete, Krithi Ramamritham, and S. Sudarshan. Automating the detection of snapshot isolation anomalies. In VLDB, 2007. 2.2.6,
5.3, 5.5.1
[JN07] Tim Johnson and Umesh Nawathe. An 8-core, 64-thread, 64-bit power efficient
SPARC SoC (Niagara2). In ISPD, 2007. 1.3, 5.3.2, 6.1.1, 6.4.1
[Jos91] Ashok M. Joshi. Adaptive locking strategies in a multi-node data sharing
environment. In VLDB, 1991. 5.5.1, 6.6
[JPA08] Ryan Johnson, Ippokratis Pandis, and Anastasia Ailamaki. Critical sections:
Re-emerging scalability concerns for database storage engines. In DaMoN,
2008. 1.3, 3.2, 1
[JPA09] Ryan Johnson, Ippokratis Pandis, and Anastasia Ailamaki. Improving OLTP
scalability using speculative lock inheritance. PVLDB, 2(1), 2009. 1, 6.6, 7.2,
7.7.1, 7.8.1
200
BIBLIOGRAPHY
[JPH+ 09] Ryan Johnson, Ippokratis Pandis, Nikos Hardavellas, Anastasia Ailamaki, and
Babak Falsafi. Shore-MT: a scalable storage manager for the multicore era. In
EDBT, 2009. 1.3, 1.9, 3, 2.3.4, 1, 4.2, 4.4.3, 1, 5.1, 6.1, 6.1.1, 6.2, 6.3.4, 7.1,
7.2, 7.7.1, 7.8.1, 7.8.4
[JPS+ 10] Ryan Johnson, Ippokratis Pandis, Radu Stoica, Manos Athanassoulis, and
Anastasia Ailamaki. Aether: a scalable approach to logging. PVLDB, 3, 2010.
1.4, 2.3.1, 4.2, 1, 5.4, 10, 5.4.2, 5.17, 7.2, 7.3, 7.7.1, 7.8.1
[JPS+ 11] Ryan Johnson, Ippokratis Pandis, Radu Stoica, Manos Athanassoulis, and
Anastasia Ailamaki. Scalability of write-ahead logging on multicore and multisocket hardware. The VLDB Journal, 20, 2011. 2.3.1, 1, 10, 11, 7.3
[JSAM10] Ryan Johnson, Radu Stoica, Anastasia Ailamaki, and Todd C. Mowry. Decoupling contention management from scheduling. SIGPLAN Not., 45(3), 2010.
4.3.1
[JSSS06] Ibrahim Jaluta, Seppo Sippu, and Eljas Soisalon-Soininen. B-tree concurrency
control and recovery in page-server database systems. ACM TODS, 31:82–132,
2006. 7.4.5, 7.8.2
[KAO05] Poonacha Kongetira, Kathirgamar Aingaran, and Kunle Olukotun. Niagara:
A 32-way multithreaded Sparc processor. IEEE MICRO, 25(2), 2005. 3.1,
3.3.1, 4.3.3
[KCK+ 00] Jong Min Kim, Jongmoo Choi, Jesung Kim, Sam H. Noh, Sang Lyul Min,
Yookun Cho, and Chong Sang Kim. A low-overhead high-performance unified
buffer management scheme that exploits sequential and looping references. In
OSDI, 2000. 2.2.5
[Kel11] Kate Kelly. How twitter is transforming trading in commodities, 2011. Available at http://www.cnbc.com/id/41948275. 1.1
[KN11] Alfons Kemper and Thomas Neumann. HyPer – a hybrid OLTP&OLAP main
memory database system based on virtual memory snapshots. In ICDE, 2011.
7.1, 7.1.1
[KR81] H. T. Kung and John T. Robinson. On optimistic methods for concurrency
control. ACM TODS, 6, 1981. 6.6, 7.8.2, 8.1.1
[KSSF10] Ron Kalla, Balaram Sinharoy, William J. Starke, and Michael Floyd. Power7:
IBM’s next-generation server processor. IEEE MICRO, 30(2), 2010. 1.3
BIBLIOGRAPHY
201
[KSUH93] Orran Krieger, Michael Stumm, Ron Unrau, and Jonathan Hanna. A fair fast
scalable reader-writer lock. In ICPP, 1993. 4.3.1, 4.3.3
[LARS92] Dave Lomet, Rick Anderson, T. K. Rengarajan, and Peter Spiro. How the
Rdb/VMS data sharing system became fast. Technical Report CRL-92-4,
DEC, 1992. 7.3
[LBD+ 12] Per-Ake Larson, Spyros Blanas, Cristian Diaconu, Craig Freedman, Jignesh M.
Patel, and Mike Zwilling. High-performance concurrency control mechanisms
for main-memory databases. In VLDB, 2012. 6.1, 6.6, 7.8.2
[Lit80] Witold Litwin. Linear hashing: A new tool for file and table addressing. In
VLDB, 1980. 2.2.3
[LKO+ 00] Mong-Li Lee, Masaru Kitsuregawa, Beng Chin Ooi, Kian-Lee Tan, and Anirban Mondal. Towards self-tuning data placement in parallel database systems.
In SIGMOD, 2000. 7.8.3
[LMP+ 08] Sang-Won Lee, Bongki Moon, Chanik Park, Jae-Myung Kim, and Sang-Woo
Kim. A case for flash memory SSD in enterprise database applications. In
SIGMOD, 2008. 2.3.1, 5.4, 8
[Lof96] Geoffrey R. Loftus. Psychology will be a much better science when we change
the way we analyze data. Current Directions in Psychological Science, 5(6),
1996. 1.1
[LSC+ 01] Tirthankar Lahiri, Vinay Srihari, Wilson Chan, N. MacNaughton, and
Sashikanth Chandrasekaran. Cache fusion: Extending shared-disk clusters
with shared caches. In VLDB, 2001. 6.6, 7.8.1
[LSD+ 07] Sam Lightstone, Maheswaran Surendra, Yixin Diao, Sujay S. Parekh,
Joseph L. Hellerstein, Kevin Rose, Adam J. Storm, and Christian GarciaArellano. Control theory: a foundational technique for self managing
databases. In ICDE Workshops, 2007. 7.6.3
[LW92] Monica S. Lam and Robert P. Wilson. Limits of control flow on parallelism.
SIGARCH Comput. Archit. News, 20(2), 1992. 1.2
[Mal11] Eric Malinowski. Hoops 2.0: Inside the NBA’s data-driven revolution. Wired,
4, 2011. Available at http://www.wired.com/playbook/2011/04/nba-datarevolution/. 1.1
202
BIBLIOGRAPHY
[MCS91a] John M. Mellor-Crummey and Michael L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans. Comput. Syst.,
9(1), 1991. 4.3.1
[MCS91b] John M. Mellor-Crummey and Michael L. Scott. Scalable reader-writer synchronization for shared-memory multiprocessors. SIGPLAN Not., 26(7), 1991.
4.3.1
[MDO94] Ann Marie Grizzaffi Maynard, Colette M. Donnelly, and Bret R. Olszewski.
Contrasting characteristics and cache performance of technical and multi-user
commercial workloads. SIGPLAN Not., 29(11), 1994. 1.2
[MHL+ 92] C. Mohan, Don Haderle, Bruce Lindsay, Hamid Pirahesh, and Peter Schwarz.
ARIES: A transaction recovery method supporting fine-granularity locking
and partial rollbacks using write-ahead logging. ACM TODS, 17(1), 1992.
2.2.2, 2.3.1, 5, 5.2, 5.4, 5.4.1, 5.5.2, 6.3.4
[Mic02] Maged M. Michael. High performance dynamic lock-free hash tables and listbased sets. In SPAA, 2002. 4.3.2
[MKOT01] Anirban Mondal, Masaru Kitsuregawa, Beng Chin Ooi, and Kian-Lee Tan. Rtree-based data migration and self-tuning strategies in shared-nothing spatial
databases. In GIS, 2001. 7.8.3
[ML92] C. Mohan and Frank Levine. ARIES/IM: an efficient and high concurrency
index management method using write-ahead logging. In SIGMOD, 1992.
5.5.2, 7.4.5, 7.8.2
[MLH94] Peter S. Magnusson, Anders Landin, and Erik Hagersten. Queue locks on
cache coherent multiprocessors. In ISPP, 1994. 4.3.1
[MM03] Nimrod Megiddo and Dharmendra S. Modha. ARC: A self-tuning, low overhead replacement cache. In FAST, 2003. 2.2.5
[MNSS05] Mark Moir, Daniel Nussbaum, Ori Shalev, and Nir Shavit. Using elimination
to implement scalable and lock-free FIFO queues. In SPAA, 2005. 1.5, 4.2.2,
5.4.1
[Moh90] C. Mohan. ARIES/KVL: a key-value locking method for concurrency control
of multiaction transactions operating on B-tree indexes. In VLDB, 1990. 5.2,
5.5.2, 7.4.5
BIBLIOGRAPHY
203
[Moo65] Gordon Moore. Cramming more components onto integrated circuits. Electronics, 38(6), 1965. 1.2
[MOPW00] Peter Muth, Patrick O’Neil, Achim Pick, and Gerhard Weikum. The LHAM
log-structured history data access method. The VLDB Journal, 8, 2000. 7.8.2
[NWMR09] Simo Neuvonen, Antoni Wolski, Markku Manner, and Vilho Raatikka.
Telecom application transaction processing benchmark (TATP), 2009. See
http://tatpbenchmark.sourceforge.net/. 2.4.5, 6.1.1, 6.1.1, 6.4.1, 7.5.1, 7.7.4
[ONH+ 96] Kunle Olukotun, Basem A. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang. The case for a single-chip multiprocessor. In ASPLOS-VII, 1996.
1.2
[Ora05] Oracle.
Asynchronous
commit:
Oracle
database
advanced application developer’s guide,
2005.
Available at
http://download.oracle.com/docs/cd/B19306 01/appdev.102/b14251/
adfns sqlproc.htm. 2.3.1
[PJHA10] Ippokratis Pandis, Ryan Johnson, Nikos Hardavellas, and Anastasia Ailamaki.
Data-oriented transaction execution. PVLDB, 3(1), 2010. 1.3, 2.3.4, 1, 7.1,
7.2, 7.7.1, 7.8.1
[PJZ11] Andrew Pavlo, Evan P. C. Jones, and Stanley Zdonik. On predictive modeling
for optimizing transaction execution in parallel oltp systems. PVLDB, 5(2),
2011. 1.7, 6.5, 7.8.1, 7.8.3
[Pos10] PostgreSQL. PostgreSQL 9.0.3 documentation: Asynchronous commit,
2010. Available at http://www.postgresql.org/docs/9.0/static/wal-asynccommit.html. 2.3.1
[Pos11] PostgreSQL.
PostgreSQL archives:
literature on write-ahead logging, 2011. Available at http://archives.postgresql.org/pgsql-hackers/201106/msg00701.php. 12
[PR01] Rasmus Pagh and Flemming Friche Rodler. Cuckoo hashing. In ESA, 2001.
4.4.3
[PTB+ 11] Ippokratis Pandis, Pinar Tözün, Miguel Branco, Dimitris Karampinas, Danica
Porobic, Ryan Johnson, and Anastasia Ailamaki. A data-oriented transaction
execution engine and supporting tools. In SIGMOD, 2011. 6, 6.3.2, 6.5, 6.5
204
BIBLIOGRAPHY
[PTJA11] Ippokratis Pandis, Pinar Tözün, Ryan Johnson, and Anastasia Ailamaki. PLP:
page latch-free shared-everything OLTP. PVLDB, 4(10), 2011. 1.3, 2.3.4, 1,
6.3.2, 1
[Raw10] Mazen Rawashdeh. eBay - how one fast growing company is solving its infrastructure and data center challenges, 2010. Keynote at Gartner Data Center
Conference. 1.1
[RD89] Abbas Rafii and Donald DuBois. Performance tradeoffs of group commit
logging. In CMG Conference, 1989. 2.3.1, 2.3.1
[RD01] Antony Rowstron and Peter Druschel. Pastry: Scalable, decentralized object
location, and routing for large-scale peer-to-peer systems. In Middleware,
pages 329–350, 2001. 4.2.2
[RG01] Ravi Rajwar and James R. Goodman. Speculative lock elision: enabling highly
concurrent multithreaded execution. In IEEE MICRO, 2001. 4.3.4, 8.1.1
[RG02] Ravi Rajwar and James R. Goodman. Transactional lock-free execution of
lock-based programs. In ASPLOS-X, 2002. 4.3.2
[RG03] Raghu Ramakrishnan and Johannes Gehrke. Database Management Systems.
McGraw-Hill, Inc., New York, NY, USA, 2003. 2.1, 1
[RGAB98] Parthasarathy Ranganathan, Kourosh Gharachorloo, Sarita V. Adve, and
Luiz André Barroso. Performance of database workloads on shared-memory
systems with out-of-order processors. In ASPLOS-VIII, 1998. 1.2
[RK79] David P. Reed and Rajendra K. Kanodia. Synchronization with eventcounts
and sequencers. Commun. ACM, 22(2), 1979. 4.3.1
[Rob85] John T. Robinson. A fast general-purpose hardware synchronization mechanism. In SIGMOD, 1985. 3.2
[RR99] Jun Rao and Kenneth A. Ross. Cache conscious indexing for decision-support
in main memory. In VLDB, 1999. 7.4.5, 7.8.2
[RR00] Jun Rao and Kenneth A. Ross. Making B+-trees cache conscious in main
memory. In Proceedings of the 2000 ACM SIGMOD international conference
on Management of data, 2000. 7.4.5, 7.8.2
[RS84] Larry Rudolph and Zary Segall. Dynamic decentralized cache schemes for
mimd parallel processors. In ISCA, 1984. 3.3.2, 4.3.1
BIBLIOGRAPHY
205
[RSV87] Umakishore Ramachandran, Marvin Solomon, and Mary Vernon. Hardware
support for interprocess communication. In ISCA, 1987. 8.1.1
[RZML02] Jun Rao, Chun Zhang, Nimrod Megiddo, and Guy Lohman. Automating
physical database design in a parallel database. In SIGMOD, 2002. 7.5.1,
7.8.3
[SAB+ 05] Mike Stonebraker, Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Samuel Madden, Elizabeth O’Neil, Pat O’Neil, Alex Rasin, Nga Tran, and Stan Zdonik. C-store: a
column-oriented DBMS. In VLDB, 2005. 1.4, 6.5
[SCK+ 11] Jason Sewall, Jatin Chhugani, Changkyu Kim, Nadathur Satish, and Pradeep
Dubey. PALM: Parallel architecture-friendly latch-free modifications to
b+trees on many-core processors. PVLDB, 4(11), 2011. 7.8.2
[SKPO88] Michael Stonebraker, Randy H. Katz, David A. Patterson, and John K.
Ousterhout. The design of XPRS. In VLDB, 1988. 3.1
[SLSV95] Dennis Shasha, Francois Llirbat, Eric Simon, and Patrick Valduriez. Transaction chopping: algorithms and performance studies. ACM TODS, 20, 1995.
6.6
[SMA+ 07] Michael Stonebraker, Samuel Madden, Daniel J. Abadi, Stavros Harizopoulos,
Nabil Hachem, and Pat Helland. The end of an architectural era: (it’s time
for a complete rewrite). In VLDB, 2007. 1.4, 1.7, 3.1, 5.4.2, 5.5.2, 6.1, 6.6,
7.1, 7.1.1, 7.3, 7.5.3, 7.8.1
[Smi78] Alan Jay Smith. Sequentiality and prefetching in database systems. ACM
TODS, 3, 1978. 2.2.5, 5.2
[SMK+ 01] Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, and Hari Balakrishnan. Chord: A scalable peer-to-peer lookup service for internet applications. In SIGCOMM, pages 149–160, 2001. 4.2.2
[SR86] Michael Stonebraker and Lawrence A. Rowe. The design of POSTGRES.
SIGMOD Rec., 15(2), 1986. 3.1, 3.3.1
[SSY95] Eljas Soisalon-Soininen and Tatu Ylönen. Partial strictness in two-phase locking. In ICDT, 1995. 2.3.1
[ST95] Nir Shavit and Dan Touitou. Software transactional memory. In PODC, 1995.
4.3.2
BIBLIOGRAPHY
206
[ST97] Nir Shavit and Dan Touitou. Elimination trees and the construction of pools
and stacks. Theory of Computing Systems, Special Issue, 30, 1997. 5.4.1, 5.4.1
[STH+ 10] Jinuk Luke Shin, Kenway Tam, Dawei Huang, Bruce Petrick, Ha Pham,
Changku Hwang, Hongping Li, Alan Smith, Timothy Johnson, Francis Schumacher, David Greenhill, Ana Sonia Leon, and Allan Strong. A 40nm 16-core
128-thread CMT SPARC SoC processor. In IEEE ISSCC, 2010. 1.3
[Sto86] Michael Stonebraker. The case for shared nothing. IEEE Database Eng. Bull.,
9, 1986. 1.7, 7.1, 7.3, 7.8.1
[SWAF09] Stephen Somogyi, Thomas F. Wenisch, Anastasia Ailamaki, and Babak Falsafi.
Spatio-temporal memory streaming. In ISCA, 2009. 6.3.3, 7.8.4
[SWH+ 04] Stephen Somogyi, Thomas F. Wenisch, Nikolaos Hardavellas, Jangwoo Kim,
Anastasia Ailamaki, and Babak Falsafi. Memory coherence activity prediction
in commercial workloads. In WMPI, 2004. 1.3, 6.3.3, 7.1, 7.8.4
[TA10] Alexander Thomson and Daniel J. Abadi.
database systems. PVLDB, 3, 2010. 5.5.2
The case for determinism in
[Tho98] Alexander Thomasian. Concurrency control: methods, performance, and analysis. ACM Comput. Surv., 30, 1998. 6.6, 7.8.2
[THS10] Dimitris Tsirogiannis, Stavros Harizopoulos, and Mehul A. Shah. Analyzing
the energy efficiency of a database server. In SIGMOD, 2010. 8.1.2
[TPC94] TPC. TPC benchmark B standard specification, revision 2.0, 1994. Available
at http://www.tpc.org/tpcb. 2.4.2, 6.2, 6.4.1, 7.7.3
[TPC06] TPC. TPC benchmark H (decision support) standard specification, revision
2.6.0, 2006. Available at http://www.tpc.org/tpch. 6.4.1
[TPC07] TPC. TPC benchmark C (OLTP) standard specification, revision 5.9, 2007.
Available at http://www.tpc.org/tpcc. 2.4.3, 6.1.1, 6.1.1, 6.3.1, 6.4.1, 6.5,
7.7.4, 7.7.6
[TPC10] TPC. TPC benchmark E standard specification, revision 1.12.0, 2010. Available at http://www.tpc.org/tpce. 2.4.4, 6.5
[TPJA11] Pinar Tözün, Ippokratis Pandis, Ryan Johnson, and Anastasia Ailamaki. Scalable and dynamically balanced shared-everything OLTP with physiological
partitioning. Technical Report EPFL-REPORT-170525, EPFL, 2011. 6.3.2, 1
BIBLIOGRAPHY
207
[TSJ+ 10] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka,
Ning Zhang, Suresh Anthony, Hao Liu, and Raghotham Murthy. Hive - a
petabyte scale data warehouse using Hadoop. In ICDE, 2010. 1.1
[V+ 07] Sriram Vangal et al. An 80-tile 1.28TFLOPS network-on-chip in 65nm CMOS.
In IEEE ISSCC, 2007. 7.8.4
[Vog09] Werner Vogels. Eventually consistent. Commun. ACM, 52, 2009. 1.4, 2.4, 5.3
[Vog12] Werner Vogels.
Amazon DynamoDB - a fast and scalable NoSQL
database service designed for internet scale applications, 2012.
See
http://www.allthingsdistributed.com/2012/01/amazon-dynamodb.html. 1.4
[Wei99] Mark Weiser. The computer for the 21st century. SIGMOBILE Mob. Comput.
Commun. Rev., 3, 1999. 1.1
[WM11] Eugene Wu and Samuel Madden. Partitioning techniques for fine-grained
indexing. In ICDE, 2011. 7.8.3
[WOT+ 95] Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh,
and Anoop Gupta. The SPLASH-2 programs: characterization and methodological considerations. In ISCA, 1995. 4.2.2
[ZL11] Paul Zubulake and Sang Lee. The High Frequency Game Changer: How Automated Trading Strategies Have Revolutionized the Markets. Wiley Trading.
John Wiley & Sons, 2011. 1.1

Scalable Transaction Processing through Data-Oriented

Transcription

Similar documents

RE/MAX vs. THE INDUSTRY

Morgan Stanley Smith Barney Benefit Access Quick

Harborstone Credit Union

instructions for housing deposit ($250.00 usd)

Fa-600 PoEN LED EM-Lock

How to Pay My Bill - Students and Authorized Payers

Currency exchange via Traderoom sectionin Swedbank Internet

Providing Quality Locksmith Supplies Since 1974

KO Brocks Bostitch Drive Event 022516f

TCPDF Example 021