Near-Data Computation: It`s Not (Just) About Performance

Transcription

Near-Data Computation: It`s Not (Just) About Performance
Near-­‐Data Computa.on: It’s Not (Just) About Performance Steven Swanson Non-­‐Vola0le Systems Laboratory Computer Science and Engineering University of California, San Diego 1 Solid State Memories
•  NAND flash –  Ubiquitous, cheap –  Sort of slow, idiosyncra0c •  Phase change, Spin torque MRAMs, etc. –  On the horizon –  DRAM-­‐like speed –  DRAM or flash-­‐like density 2 1000000 Bandwidth Rela.ve to disk 100000 5917x à 2.4x/yr 10000 1000 PCIe-­‐Flash (2012) PCIe-­‐PCM (2010) PCIe-­‐PCM (2015?) 100 10 DDR Fast NVM (2016?) Hard Drives (2006) PCIe-­‐Flash (2007) 7200x à 2.4x/yr 1 1 10 100 1000 10000 100000 1000000 100000
1/Latency Rela.ve To Disk 3 Programmability in High-Speed
Peripherals
•  As hardware gets more sophis0cated, programmability emerges –  GPUs – fixed func0on à Programmable shaders à full-­‐blown programmability –  NICs – Fixed func0on à MAC offloading à TCP offload •  Storage has been leb behind –  It hasn’t goden any faster in 40 years 4 Why Near-data Processing?
Well-­‐
studied Worth a Closer Look Data Dependent Accesses Internal bandwidth ≫ interconnect ✓ bandwidth Moving data takes energy ✓ Trusted Cefficient omputa0on NDP compute can be energy ✓ Storage latencies ≤ Interconnect Latency ✓ ✓ Storage latencies ≈ CPU latency ✓ ✓ ✓ NDP compute Seman0c can be trusted Flexibility Improved Real-­‐0me capabili0es ✓ 5 Diverse SSD Semantics
•  File system offload/OS bypass –  [Caulfield, ASPLOS 2012] –  [Zhang, FAST 2012] Lessons Learned •  Caching support Implementa0on Our Goal: Make costs are high Database transac0ons support Programming an SSD Easy and Flexible. Fit with applica0ons is –  [Bhaskaran, INFLOW 2013] –  [Saxena, EuroSYS 2012] • 
–  [Coburn SOSP 2013] –  [Ouyang HPCA 2011] –  [Prabhakaran OSDI 2008] •  NoSQL offload –  Samsung’s SmartSSD –  [De FCCM 2013] •  Sparse storage address space –  FusionIO DFS uncertain 6 The Sobware Defined SSD 7 The Software-defined SSD
SDSSD
Host Machine
Memory
Memory
Memory
Memory
Memory
Memory
Application
Generic SSDAPP interface
PCIe Ctrl
PCIe
Bridge
3GB/s
3GB/s
3GB/s
Memory
Controller
SSD
CPU
Memory
Controller
SSD
CPU
Memory
Controller
SSD
CPU
4 GB/s
8 SDSSD Programming Model
RPC Ring –  Send and receive RPC requests •  SSD-­‐CPU –  Send and receive RPC requests –  Manage access to NV storage banks PCIe Host CPU SDSSD CPU Data streamer •  The host communicates with SDSSD processors via a general RPC interface. •  Host-­‐CPU Memory 9 The SDSSD Prototype
•  FPGA-­‐based implementa0on •  DDR2 DRAM emulates PCM –  Configurable memory latency –  48 ns reads, 150 ns writes –  64GB across 8 controllers •  PCIe: 2 GB/s, full duplex 10 SDSSD Case Studies
• 
• 
• 
• 
Basic IO Caching Transac0on processing NoSQL Databases 11 Basic IO 12 Normal IO Operations: Base-IO
• 
• 
• 
• 
Base-­‐IO App provides read/write func0ons. Just like a normal SSD. Applica0ons make system calls to the kernel The kernel block driver issues RPCs 1. SysCall user accesses kernel module 2. Read/Write RPCs SDSSD CPU NVM 13 Faster IO: Direct access IO
•  OS installs access permissions on behalf of a process. •  Applica0ons issue RPCs directly to HW •  The App checks permission user accesses 1. userspace channel kernel module 3. userspace channel direct IO accesses 2. kernel installs perms SDSSD CPU NVM 14 Bandwidth (MB/s) Preliminary Performance
Comparison
1800 Read 1800 1600 1600 1400 1400 1200 1200 1000 1000 800 800 Direct-­‐IO 600 Direct-­‐IO-­‐HW 400 Direct-­‐IO 600 Direct-­‐IO-­‐HW 400 Base-­‐IO 200 Write Base-­‐IO 200 0 0 0.5 1 2 4 8 16 32 64 128 0.5 1 2 4 8 16 32 64 128 15 Caching 16 Caching
•  NVMs will be caches before they are storage •  Cache hits should be as fast as Direct-­‐IO •  Caching App File System Applica.on file file Cache Library Userspace Kernel –  Tracks dirty data –  Tracks usage sta0s0cs system reads/writes Cache Hit Cache Miss Cache Manager … Device Driver Disk SDSSD Caching 17 4KB Cache Hit Latency
18 Transac0on Processing 19 Atomic Writes
•  Write-­‐ahead logging guarantees atomicity •  New interface for atomic opera0ons –  LogWrite(TID, file, file_off, log, log_off) •  Transac0on commit occurs at the memory interface à full memory bandwidth Logging Module inside SDSSD Transaction
Table
…"
TID 15
PENDING
TID 24
FREE
…"
X
…"
TID 37
COMMITTED
…"
Metadata
File
Log File
Data File
New D
Old D
New B
Old B
New A
Old A
New C
Old C
20 Atomic Writes
•  SDSSD atomic-­‐writes outperform pure sobware approach by nearly 2.0x •  SDSSD atomic-­‐writes achieve comparable performance to a pure hardware implementa0on Bandwidth (GB/s)
AtomicWrite
SDSSD−Tx−Opt
SDSSD−Tx
2.0
Moneta−Tx
Soft−atomic
1.5
1.0
0.5
0.0
0.5
2
8
32
Request Size (KB)
128
21 Key Value Store 22 Key-value store
•  Key-­‐value store is a fundamental building block for many enterprise applica0ons •  Supports simple opera0ons o  Get to retrieve the value corresponding to a key o  Put to insert or update a key-­‐value pair o  Delete to remove a key-­‐value pair •  Open-­‐chaining hash table implementa0on. 23 Direct-IO based Key-Value Store
Get K3 Key K3 Hash(K3) K1 V1 0 1 K2 V2 K3 V3 K8 V8 K9 V9 K4 V4 M-­‐1 Index Structure Compare Keys K7 V7 Key-­‐value data I/O Mismatch Match Host SSD 24 SDSSD Key-Value Store App
Get K3 0 1 Hash(K3) Key K3 K1 V1 Hash Idx K2 V2 K3 V3 K8 V8 K9 V9 K4 V4 M-­‐1 Index Structure K7 V7 Key-­‐value data RPC(Get) Hash Idx Key K3 K3 V3 Compare Keys DMA Mismatch Match Host SDSSD SPU 25 MemcacheDB performance
Throughput (in Million ops/s) 0.6 0.5 0.4 Direct-­‐IO 0.3 Key-­‐Value-­‐HW 0.2 Key-­‐Value 0.1 0 Get Put Workload-­‐A Get 0.6 0.45 0.4 0.35 0.3 0.25 Direct-­‐IO 0.2 Key-­‐Value 0.15 Key-­‐Value-­‐HW 0.1 0.05 0 Throughput (Million ops/s) Throughput (Million ops/s) 0.5 Workload-­‐B Put 0.5 0.4 Direct-­‐IO 0.3 Key-­‐Value 0.2 Key-­‐Value-­‐HW 0.1 0 1 2 4 8 Avg. chain length 16 32 1 2 4 8 Avg. chain length 16 32 26 Ease-of-Use
27 Conclusion
•  New memories argue for near-­‐data processing –  But not just for bandwidth! –  Trust, low-­‐latency, and flexibility are cri0cal too! •  Seman0c flexibility is valuable and viable –  15% reduc0on in performance compared to fixed-­‐
func0on implementa0on –  6-­‐30x reduc0on in implementa0on 0me –  Every applica0on could have a custom SSD 28 Thanks! 29 Ques0ons? 30