DBMS on a modern processor: where does time go?
Transcription
DBMS on a modern processor: where does time go?
DBMS on a modern processor: where does time go? Anastasia Ailamaki, David DeWitt, Mark Hill and David Wood University of Wisconsin‐Madison Presented by: Bogdan Simion Current DBMS Performance + = Where is query execution time spent? Identify performance bottlenecks in CPU and memory Outline • Motivation • Background • Query execution time breakdown • Experimental results and discussions • Conclusions Hardware performance standards • Processors are designed and evaluated with simple programs • Benchmarks: SPEC, LINPACK • What about DBMSs? DBMS bottlenecks • Initially, bottleneck was I/O • Nowadays ‐ memory and compute intensive apps • Modern platforms: – sophisticated execution hardware – fast, non‐blocking caches and memory • Still … – DBMS hardware behaviour is suboptimal compared to scientific workloads Execution pipeline INSTRUCTION POOL FETCH/ DECODE UNIT DISPATCH EXECUTE UNIT L1 I‐CACHE RETIRE UNIT L1 D‐CACHE L2 CACHE MAIN MEMORY Stalls overlapped with useful work !!! Execution time breakdown TQ = TC + TM + TB + TR ‐ TOVL •TC ‐ Computation •TM ‐ Memory stalls L1D, L1I L2D, L2I DTLB, ITLB •TB ‐ Branch Mispredictions •TR ‐ Stalls on Execution Resources Functional Units Dependency Stalls DB setup • DB is memory resident => no I/O interference • No dynamic and random parameters, no concurrency control among transactions Workload choice • Simple queries: – Single‐table range selections (sequential, index) – Two‐table equijoins • Easy to setup and run • Fully controllable parameters • Isolates basic operations • Enable iterative hypotheses !!! • Building blocks for complex workloads? Execution Time Breakdown (%) Query execution time (%) 10% Sequential Scan 100% 100% 100% 80% 80% 80% 60% 60% 60% 40% 40% 40% 20% 20% 20% 0% 0% 0% A B C D B DBMS Computation Join (no index) 10% Indexed Range Selection Memory C DBMS D Branch mispredictions • Stalls at least 50% of time • Memory stalls are major bottleneck A B C DBMS Resource D Memory Stalls Breakdown (%) Memory stall time (%) 10% Sequential Scan 100% 100% 100% 80% 80% 80% 60% 60% 60% 40% 40% 40% 20% 20% 20% 0% 0% 0% A B C D B DBMS L1 Data Join (no index) 10% Indexed Range Selection L1 Instruction C DBMS D A B C D DBMS L2 Data L2 Instruction • Role of L1 data cache and L2 instruction cache unimportant • L2 data and L1 instruction stalls dominate • Memory bottlenecks across DBMSs and queries vary Effect of Record Size 10% Sequential Scan L2 data misses / record 25 8 # of misses per record L1 instruction misses / record 20 6 15 4 10 2 5 0 0 20 48 100 record size System A 200 System B 20 System C 48 100 record size System D • L2D increase: locality + page crossing (except D) • L1I increase: page boundary crossing costs 200 Memory Bottlenecks • Memory is important ‐ Increasing memory‐processor performance gap ‐ Deeper memory hierarchies expected • Stalls due to L2 cache data misses ‐ Expensive fetches from main memory ‐ L2 grows (8MB), but will be slower • Stalls due to L1 I‐cache misses ‐ Buffer pool code is expensive ‐ L1 I‐cache not likely to grow as much as L2 Branch Mispredictions Are Expensive 25% Query execution time (%) Branch misprediction rates 25% 20% 15% 10% 5% 20% 15% 10% 5% 0% 0% A B C D B C DBMS DBMS Sequential Scan A Index scan Join (no index) • Rates are low, but contribution is significant • A compiler task, but decisive for L1I performance D Mispredictions Vs. L1‐I Misses 10% Sequential Scan 10% Indexed Range Selection Events / 1000 instr. 12 9 50 20 40 15 Join (no index) 30 10 6 20 3 5 10 0 0 0 A B C DBMS D B Branch mispredictions C DBMS D A B C DBMS L1 I-cache misses • More branch mispredictions incur more L1I misses • Index code more complicated ‐ needs optimization D Resource‐related Stalls % of query execution time Dependency‐related stalls (TDEP) Functional Unit‐related stalls (TFU) 25% 25% 20% 20% 15% 15% 10% 10% 5% 5% 0% 0% A B C D B C D DBMS DBMS Sequential Scan A Index scan Join (no index) • High TDEP for all systems : Low ILP opportunity • A’s sequential scan: Memory unit load buffers? Microbenchmarks vs. TPC CPI Breakdown Clock ticks System B System D 3.5 3.5 3 3 2.5 2.5 2 2 1.5 1.5 1 1 0.5 0.5 0 0 sequential scan TPC-D 2ary index TPC-C benchmark Computation sequential scan TPC-D 2ary index TPC-C benchmark Memory Branch misprediction Resource • Sequential scan breakdown similar to TPC‐D • 2ary index and TPC‐C: higher CPI, memory stalls (L2 D&I mostly) Conclusions • Execution time breakdown shows trends • L1I and L2D are major memory bottlenecks • We need to: – – – – reduce page crossing costs optimize instruction stream optimize data placement in L2 cache reduce stalls at all levels • TPC may not be necessary to locate bottlenecks Five years later – Becker et al 2004 • Same DBMSs, setup and workloads (memory resident) and same metrics • Outcome: stalls still take lots of time – – – – Seq scans: L1I stalls, branch mispredictions much lower Index scans: no improvement Joins: improvements, similar to seq scans Bottleneck shift to L2D misses => must improve data placement – What works well on some hardware doesn’t on other Five years later – Becker et al 2004 • C on a Quad P3 700MHz, 4G RAM, 16K L1, 2M L2 • B on a single P4 3GHz, 1G RAM, 8K L1D + 12KuOp trace cache, 512K L2, BTB 8x than P3 • P3 results: – Similar to 5 years ago: major bottlenecks are L1I and L2D • P4 results: – Memory stalls almost entirely due to L1D and L2D stalls – L1D stalls higher ‐ smaller cache and larger cache line – L1I stalls removed due to trace cache (esp. for seq. scan, but still some for index) Hardware – awareness is important ! References • DBMS on a modern processor: where does time go? Revisited – CMU Tech Report 2004 • Anastassia Ailamaki – VLDB’99 talk slides Questions?