The PicoSecond is Dead Long Live the PicoJoule
Transcription
The PicoSecond is Dead Long Live the PicoJoule
The PicoSecond is Dead Long Live the PicoJoule Christos Kozyrakis Stanford University http://mast.stanford.edu Scaling Systems without Technology Help Systems Scaling • Higher performance • More data • Constant cost (power, $) 3" Scaling Past [Source: CPUDB] Transistors ! + frequency ! = performance ! 4" Scaling Past Power ≈ C"Vdd2"Freq Transistors + Frequency ! Capacitance + Voltage #$ Power density % 5" Scaling Present [Source: CPUDB] Voltage scaling ✖ Transistors ! Power density ! Frequency % 6" Scaling Present Power = C"Vdd2"Freq All chips are now power limited Switch to multi-core does not change this 7" Performance/Cost (log) Scaling without Technology Help ge n e l l r Cha Ou 80s 90s 00s 10s 20s 30s 40s [Source: Hill & Kozyrakis’12] 8" Scaling without Technology Help Power = Ops/sec " Energy/op Reduce energy/op through better HW Reduce ops/sec needed through better SW Tradeoff performance – power 9" Better HW with Specialization Assume temporal locality for now Caches work 10" Energy Overheads Operation Energy Scale 8-bit add 0.03 pJ 1 32-bit add 0.10 pJ 3 8-bit mult 0.20 pJ 6 32-bit mult 3.00 pJ 100 16-bit FP mult 1.00 pJ 30 32-bit FP mult 4.00 pJ 133 Obvious thought: specialize data-types 11" But Wait… Operation Energy Scale 8-bit add 0.03 pJ 1 32-bit add 0.10 pJ 3 8-bit mult 0.20 pJ 6 32-bit mult 3.00 pJ 100 16-bit FP mult 1.00 pJ 30 32-bit FP mult 4.00 pJ 133 >70.00 pJ >2,300 RISC instruction Instruction overheads dominate! 12" Energy Overheads 25"pJ" 6"pJ" 20+"pJ" 25"pJ" I-cache Register file Control D-cache Op First Amortize instruction and control overheads Avoid accesses to register files and data caches Then Specialize data-types 13" SIMD & Vectors 25"pJ" 6"pJ" I-cache 20+"pJ" 25"pJ" … Control D-cache Ops SIMD amortizes I-cache and control overheads 10x energy efficiency & must use Not enough on its own Must also amortize register and D-cache accesses 14" Specialization for Convolutions Convolutional patterns Signal/photo/video processing, computer vision, … Map-reduce patterns over 1D/2D/3D stencils 15" Specialization for Convolutions Convolutional patterns Signal & photography processing, computer vision, … Map-reduce patterns over 1D/2D/3D stencils 16" Specialization for Convolutions Convolutional patterns Signal & photography processing, computer vision, … Map-reduce patterns over 1D/2D/3D stencils 17" Specialization for Convolutions Specialized unit Register files that shift support & no data reloading Map-reduce, SIMD instructions & amortize control, shared data Specialized data-types & avoid waste 18" Specialization Opportunity 1000 Performance Energy Savings 100 [Source: ISCA’10] 10 1 4 cores + ILP + SIMD + custom inst ASIC >100x opportunity in performance & energy Close to ASIC efficiency 19" Specialization Challenge #1 [Source: Xilinx] Efficiency Vs. generality Solution #1: domain-specific accelerators Solution #2: automatic generation of accelerators 20" Domain-specific Accelerators [Source: ISCA’13] Convolution engine Generalized engine for convolution-line computations Wider storage, general ALUs, general reduction, … 10x better than SIMD, ~2x worse custom unit 21" Automatic Generation of Accelerators Table 4: Area of the LINQits hardware template configured to implement di↵erent applications. to adopt the same divide-and-conquer approaches used inGroupby Join Bscholes LINQits C LINQits KeyCount C, 2 threads Figure 7: LINQits hardware template for partitioning, grouping, and hashing. 1 C, 2 threads Post-Core Partition Allocator + Writer 6 C Post-Core Pre-Core 11 7 LINQits Pre-Core 10 C, 2 threads Partition Reader 10 12 12 C DRAM Spill FSM 23 17 LINQits Control L2 AXI Bus (Mem) Metadata Structures C, 2 threads Post-Core (user-def) 108 61 C Pre-Core (user-def) 121 100 LINQits Core HW 432 149 C, 2 threads Core AXI Bus (I/O) 959 515 C SW LOG Normalized Energy Gain vs. Mono 1000 K-means to make hardware-based hash tables practical to Figure 12: Power of software optimized C (1 and 2-CPUs) implement for queries that operate on large-scale dataset vs. LINQits. [Source: Chung’13] 13: Energy of optimized C (1 and 2-CPUs) sizes. In the context of a small SoC such as Figure ZYNQ, the vs. LINQits. HAT must operate with a limited amount of on-die memcross-compiler, we also observed a similar degradation in ory Mono storage against (typicallyC1MB Naively implementing performance when running on or a less). desktop a hardware-based hash table using this limited also storage can increases memory traffic to DRAM, resulting in higher Xeon-based processor. hurt performance due to excessive conflicts and spills. A power dissipation. A more important metric than power The rightmost set of key barsstrategy in Figure 11is show LINQits we take to develop hardware that performs or speedup alone, however, is the relative energy reduction achieving nearly 2 to 3 orders magnitude better perforin-place,ofmulti-pass partitioning prior to actually carrying using LINQits. Figure 13 shows how LINQits achieves anyure 8: Data Structures for mance LINQits partitions. out the final hashRelative operationto needed in a2multi-level GroupBy relative to single-threaded C#. C with where from 8.9 to 30.6 times energy reduction relative to or Join. We also observe that both hashing and partitionthreads, LINQits achieves between 10.7 and 38.1 times bethe Join operator also poses similar challenges, as shown ing can leverage the same physical structures with minimal multithreaded C. This result is somewhat surprising given gure 6. In a Join, two large datater collections are matched performance. These gains are possible due to the HAT’s impact to hardware complexity and area. that a significant percentage (around 50%) of the overall common key followed by a user-defined function that ability to exploit parallelism and 7pipelining between Figure illustrates the physicalmultiorganization of our proboth values associated with the common key and compower is attributed to DRAM, yet LINQits is still able to ple stages in hardware (i.e., queue spilling, posed HAT insertion, architecture,queue spanning both hardware and softs an output (also based on a user-defined function). For improve energy efficiency by an order of magnitude. Inand used post-core the ability to of the HAT is a ware on theetc), ZYNQand platform. The heart large data set sizes, a hash join ispretypically to avoidcomputations, tuitively, by speeding up the computation by an order-ofsteering and storage engine that enables fine-grained recoalesce memory accessesdata efficiently to DRAM. xpensive nested, pairwise comparison. In the hash join, partitioning of data into multiple hash entries, implemented magnitude, LINQits reduces the amount of static energy ents from the smaller of the two partitions are inserted 22" using a network-on-chip and physical queues. Overall, for the applications we studied, LINQits achieves wasted between idling periods of the ZYNQ’s DRAM suba hash table. Elements of the larger partition are then At run-time, data is streamed in opfrom main memory betweenkeys. 10.7Like andin38.1 times better performance than systemand as well as maximizing efficient use of data (versus a to probe the hash table for matching processed by a collection of preand post-core modules that pBy, implementing a naive hash join similarly exhibits timized multithreaded C code running on the ARM cores. cache controller that may unnecessarily fetch unused data). implement the user-defined functions described in Section 2; caching and memory behavior. In a hypothetical mobile system with other components such the outputs of the pre-cores are used by the network-on-chip LINQits engine Configurable engine for LINQ map-reduce Ops: select, selectMany, where, join, groupBy, aggregate 10x performance & 10x energy efficiency Specialization for I/O 40Gbps NIC [Source:NVMW’14]" Data-serving accelerators (KVS, noSQL, …) Common case in HW, complex cases in SW Example: 20M IOPS @ 4–50usec (DRAM + SSD) 10x throughput & 10x energy efficiency of x86 servers 23" Domain-specific Throughput (cache hit) Specialization 25 40Gbps line rate HW Flash KVS (GET) HW Flash KVS (SET) Best published HW x86(Xeon® 8 cores) Throughput [Mqps] 20 15 10 [Source:NVMW’14]" 5 0 0 100 200 300 Key + value size [byte] 400 500 Achieved 20M qps Accelerator (DRAM): 20M IOPS, 3.8usec, 170 Kqps/W © 2014 Toshiba Corporation 11 Flash: 10M IOPS, 50usec, 85 KQPS/W x86 Server: 2M IOPS, 300usec, 14 Kqps/W 24" Specialization Challenge #2 ASICs Expensive and inflexible FPGAs High overheads (bit-level config, I/O interface) Upto10x efficiency loss Solutions Coarse-grain FPGAs 2.5D and 3D integration CPU + FPGA integration Traditional SOC Large Chip Multi-Die SOC CPUs Cache xPUs FPGA Interposer 25 Specialization Challenge #3 What if there is limited temporal locality? Graphs, in-memory analytics, … 26" Energy Overheads Operation Energy Scale 8-bit add 0.03 pJ 1 32-bit add 0.10 pJ 3 32-bit FP mult 4.00 pJ 133 RISC instruction 70.00 pJ 2,300 8KB cache access 10.00pJ 300 100.00 pJ 3,000 2,000.00 pJ 60,000 1MB cache access DRAM access Memory overheads dominate! 27" Full-system Energy Analysis Core" LLC" Memory" Other" Energy'Breakdown' 100%" 90%" 80%" 70%" 60%" 50%" 40%" 30%" 20%" 10%" 0%" histogram" word" count" kmeans" pagerank" sssp" bc" spmv" mergesort" 6"Xeon"cores,"HT,"3"DDR3E1333"channels,"pagerank" ~50% of energy due to memory >50% of CPU energy due to idling 28" Full-system Runtime Analysis Execu1on'Time'Breakdown' instrucWon"issued" memory"stalls" other"stalls" 100%" 90%" 80%" 70%" 60%" 50%" 40%" 30%" 20%" 10%" 0%" histogram" word" count" kmeans" pagerank" sssp" bc" smvp" mergesort" 6"Xeon"cores,"HT,"3"DDR3E1333"channels,"pagerank" >50% of the time waiting for memory 29" Memory-side Computing ✖"Data & compute MC CPU MC ✔"Compute & data Avoid interconnect overheads Move computation function to the data Streaming computations 30" Memory-side Computing DDR'base' HMC'base' MSC' 1" 0.1" Time" Delay" Energy" Energy" 0.01" E*D" E*D" 0.001" 10x performance & 10x energy improvements Compute capability scales with memory capacity 31" Didn’t We Try this Before? Exacube, PIM, IRAM, DIVA, FlexRAM, Active Pages, … Reasons for past failures Conventional designs scale well Apps have temporal locality No good parallel programming models All compute close to memory Poor technology options ? 32" Technology Options 3D integration Buffer-on-board Stacking w/ edge-bonding 33" Better Software 34" Better Software Better algorithms Match SW to modern HW Reduce bloat Specialization Elasticity 35" Matching SW to Modern HW Core SIMD VT-x/d Core SIMD Core VT-x/d SIMD Core VT-x/d New? SIMD VT-x/d Shared Cache NIC Net Offload Crypto Zip New? PCIe DRAM Time for SW to exploit modern HW 36" Example: System Software Hardware is fast 10 GbE widely deployed >10 cores per server Large distributed apps should be fast Millions of QPS for small messages 10-20 us RTT 37" But Software is the Bottleneck 70" Mostly Unloaded 6" 60" Peak'Load' 5" 50" 4" Millions'QPS' HW"Limit" 40" 30" Linux" 20" 10" 0" 3" 2" 1" 0" 99th"PercenWle"Latency"(us)" Throughput"(QPS)" Example: Linux + memcached w/ 200 bytes values Single-socket SandyBridge + 10GbE NIC 38" Conventional Wisdom Bypass the kernel Move TCP to user-space Avoid protection domain crossings Replace TCP Offload to hardware (TOE) Use a different transport protocol (UDP, new) Replace Ethernet hardware Use a different fabric (Infiniband) Offload I/O processing to HW (rDMA) How about looking into system SW? 39" Network I/O in Linux System Calls And VFS Complex Interface Packet Scheduling TCP/IP Ethernet + ARP Scheduling and Buffering Interrupts And Deferred Work [Source:"h`p://www.linuxfoundaWon.org/collaborate/workgroups/networking/kernel_flow]" 40" Rethinking System Software Efficiency mechanisms Run to completion Idle" Timer" Free" RX"Queue" TX"Queue" Recv" App" Send" 41" Rethinking System Software Efficiency mechanisms Run to completion Adaptive batching Timer" Free" RX"Queue" TX"Queue" Recv" App" Send" 42" Rethinking System Software Efficiency mechanisms Run to completion Adaptive batching Flow-consistent hashing 43" Rethinking System Software Efficiency mechanisms Run to completion Adaptive batching Flow-consistent hashing Zero copy 44" Rethinking System Software Efficiency mechanisms Run to completion Adaptive batching Flow-consistent hashing Zero copy Scalable + practical APIs POSIX sockets ✖, events/light-weight threads/futures ✔ 45" System SW Implementation 3-way protection C C C C C C C VMX" NonERoot" CPL"3" httpd libIX Your App memcached Custom Runtime libIX VMX" NonERoot" CPL"0" IX IX (Driver+TCP/IP) (Driver+TCP/IP) Custom Transport VMX" Root" Dune" Module" Domain-specific I/O stack Linux" Kernel" Control plane for coarsegrain resource assignment Full API compatibility NICs" [Source:OSDI’14]" 46" Better Software Impact 400 IX Avg IX 99th IX 99.9th Linux Avg Linux 99th Linux 99.9th 350 Latency (US) 300 250 200 ~10x''improvement'' @'100us'99.9th'SLA' 150 100 50 0 0 0.5 1 1.5 2 2.5 Throughput (millions of QPS) 3 3.5 [Source:OSDI’14]" 10x throughput &1/3 latency over Linux Scalable, elastic, secure 47" Utilization 48" Why Does Utilization Matter? The cloud makes specialized HW/SW practical But it does not make them free 49" Cloud Economics Total Cost of Ownership 6%" 3%" Servers' Energy' 14%" 16%" Cooling' 61%" Networking' Other' [Source: James Hamilton] Hardware dominates TCO Must use it as well as possible 50" Cloud Utilization Twitter [Source:ASPLOS’14] Google [Source:Barroso’09] Low utilization in private & public clouds (<20%) Despite the aggressive use of multi-tenancy Specialized systems can make it worse 51" Why is Utilization Low? [Source:ASPLOS’14] Overprovisioning @ Twitter Similar challenges across the industry 52" Performance Heterogeneity Scale-up Scale-out ! s n o i t a v r e s e r Servers Cores n o i s i Repeat whenever code v o r p r e v O Interference Cores or HW changes! Input load Performance Performance Performance Provisioning in the Cloud is Hard Input size 53" The Path to High Utilization Predictable systems Reduce interference Elastic software Reservations ✖, QoS goals ✔ Machine-learning to manage complex systems 54" High Utilization in the Cloud [Source:ASPLOS’14] Quasar: apps QoS + ML to guide resource selection >70% cluster utilization Predictable app performance 55" High Utilization in the Cloud Quasar Quasar: apps QoS + ML to guide resource selection >70% cluster utilization Predictable app performance 56" Summary The bad news All systems are power limited The good news We can built much better systems Lower energy/op + fewer ops/task [Source:ThinkStock] The challenges Design complexity, heterogeneity, utilization, … Think vertical: HW + system + app + management 57"