2012-03-15 - HPC Lugano - Handout
Transcription
2012-03-15 - HPC Lugano - Handout
BUILDING HIGH AVAILABILITY SSD • • • • • Company overview Architecture & Performance Reliability Maximizing SSD Q&A Adam Chunn 15 March 2012 Lugano - Switzerland 1 - 56 Select RamSan Facts… The largest SSD installations in production in the world Currently operating in 10 major financial exchanges worldwide Used today by 7 out of 11 of the world’s largest telecoms Conducted a financial trade Installed and in production in over 34 countries Sent a text message Shopped online Placed online bet Used pre-paid wireless Booked a cruise or flight Gamed online Used an ATM …RamSan is Everywhere 2 - 56 Select RamSan Facts… The largest SSD installations in production in the world Currently operating in 10 major financial exchanges worldwide Used today by 7 out of 11 of the world’s largest telecoms Conducted a financial trade Installed and in production in over 35 countries Sent a text message Shopped online Placed online bet Used pre-paid wireless Booked a cruise or flight Gamed online Used an ATM …RamSan is Everywhere 3 - 56 Background on TMS Solid State Storage Leader Deep Domain Expertise Global Enterprise Customers Strong Financial Performance World Class Team • Industry’s highest performance, highest reliability, lowest latency, lowest power SSD solutions • 33 years experience designing SSDs; 30+ patents granted and pending; many trade secrets • Growing enterprise customer base in over 34 countries • No Venture Capital/Long Term Debt • Strong management and engineering teams • Over 400 man-years of SSD experience 4 - 56 Key References See all of these and more in the Success Stories section of our web site at www.ramsan.com. 5 - 56 ARCHITECTURE & PERFORMANCE 6 - 56 L = λW The long-term average number of customers in a stable system L is equal to the long-term average effective arrival rate, λ, multiplied by the average time a customer spends in the system, W 1 Above is Little’s Law which is just a fancy way to say that performance is based on Latency and Parallelism 1 Paraphrased from Little’s Law, John D.C. Little and Stephen C. Graves, MIT 7 - 56 Flash Controller Design Basics FLASH Media Lookup Tables Flash Controller FPGA • Each controller handles 10 flash chips • The Lookup Tables and Write Buffer is RAM accessible from the Write Buffer controller only. • The I/O Interface and CPU controller are both separate FPGAs • The CPU is an embedded CPU processor that handles all RAM out-of-band operations • DMAs are all processed completely in FPGA I/O Interface hardware 8 - 56 DMAs are hardware only FLASH Media Lookup Tables Write Buffer CPU Flash Controller FPGA CPU RAM I/O Interface 9 - 56 •DMAs are all processed completely in FPGA hardware Decreasing Latency The Embedded CPU FLASH Media Lookup Tables Write Buffer CPU Flash Controller FPGA CPU RAM I/O Interface 10 - 56 • Remove from the DMA path, all non-critical flash memory book-keeping • Write setup • Garbage collection • Error handling • Health calculation • Wear Leveling • Statistics collection • Formatting • Backup/Restore • Key Generation Increasing Parallelism FLASH Media Lookup Tables Flash Controller FPGA • Increasing the number of flash chips that can run concurrently • Which is done by increasing the number of flash chip controllers • Each TMS flash chip controller Write Buffer can do 36 4KB DMAs in parallel • (40 if you include the CPU background chip RAID, or VSR, operations) • A RamSan-70 has 8 controllers, so it can do 288 4KB operations CPU simultaneously RAM • A RamSan-810 has 40 controllers, so it can do 1440 I/O Interface 4KB operations simultaneously 11 - 56 L = λW So, what else effects Latency and Parallelism? 12 - 56 L = λW What else effects Latency? • CPU Speed • not number of cores • not number of chips • Bus architecture • North/south bridges • PCIe hierarchy • PCIe controller • CPU Usage (so in a convoluted way, cores and chip counts do matter) 13 - 56 L = λW What else effects Latency? • Operating system and file system • OSes and file systems optimized for disks tend to count on slow data access to hide processing • Modern OSes and file systems are now written to maximize SSD • The driver, the bridge between the OS and the hardware • It must be thin or else adds latency • Linux, Windows, Solaris, VMWare, OSX, AIX • We are actively trying to push the driver into the Linux kernel • If measuring at the application layer, middleware (for example, databases) can inject latency 14 - 56 L = λW What else effects Parallelism? • Large Blocks • RamSan products break apart large block DMAs into multiple, parallel DMAs • For example, a 64kB DMA is converted into 16 parallel 4kB DMAs • A single application can be written to either have multiple threads of synchronous I/O or a single thread that allows multiple outstanding asynchronous I/O • Most high-performance middleware does just this (such as Microsoft SQL, Oracle, et cetera) • Running multiple applications can provide the same effect as a single application running multiple threads • CPU becomes more and more of a bottleneck, however 15 - 56 CSCS Benchmark • CSCS = Swiss National Computing Centre • Independent evaluation of PCIe SSDs • RamSan-70 results: – “…by far the best IOPS result we have ever measured…” (300K+ random 4K IOPS) – “Unlike the FusionIO and Virident TachIOn devices, the bandwidth is almost independent of block size…” 16 - 56 RamSan Flash Product Portfolio RamSan-70 RamSan-710/810 RamSan-720/820 RamSan-630 SLC Flash SLC/eMLC Flash SLC/eMLC Flash SLC Flash 900GB 5/10TB 12/24TB 10TB 1.2M IOPS 400K/320K IOPS 500K/450K IOPS 1M IOPS 2.5GB/s 5/4GB/s 5/4GB/s 10GB/s Full-height, halflength PCIe x8 2.0 Single Server Apps; Distributed filesystems 1U rackmount, 4x IB or FC ports 3U rackmount, 10x IB or FC ports Clustered Server Apps; Shared-storage filesystems (GPFS, GFS2, etc) 17 - 56 SPC Price/Performance Leader Top 10 SPC-1 IOPS™ 1,40 TMS RamSan400 SPC-2 MBPS™ x1k / Total TSC Price (USD) SPC-1 IOPS™ / Total TSC Price (USD) 1,60 Top 10 SPC-2 MBPS™ 1,20 1,00 TMS RamSan-630 0,80 0,60 0,40 0,20 0,00 250.000 350.000 450.000 550.000 SPC-1 IOPS™ 25 TMS RamSan-630 20 15 10 5 0 5.000 6.000 7.000 8.000 9.000 10.000 SPC-2 MBPS™ 18 - 56 Keys to Performance • Hardware-only Data Path – FPGA & Hardware Logic – Faster than software-shared memory • Software cannot add performance – Virtualization is a software overhead to utilizing additional hardware – QoS is a software overhead to give applications priority over another on shared hardware 19 - 56 RELIABILITY 20 - 56 Flash Quality • Flash type matters! Typical Chip Endurance P/E Cycles (Thousands) – SLC in most RamSans – Enterprise MLC (eMLC) in RamSan-8x0 • SLC is best but most expensive/least dense • eMLC chips last 10x longer vs. normal MLC • TMS technologies like Variable Stripe RAID™ lengthen system life 100 90 80 70 60 50 40 30 20 10 0 MLC 21 - 56 eMLC Flash Type SLC Combat Endurance • Endurance of system is calculated: Flash Capacity × Flash Quality Media Write Bandwidth 22 - 56 Combat Endurance 5TB RamSan-710 (SLC Flash) 5TB × 100,000 = 15.8 Years Endurance 1 GBps 10TB RamSan-810 (eMLC Flash) 10TB × 30,000 = 9.5 Years Endurance 1 GBps 23 - 56 Combat Endurance • Fight endurance with increased capacity • eMLC has 2x Capacity for same cost – 2/3rd endurance of SLC • MLC is 3000 Writes where eMLC is 30000 Writes • MLC is ~1/4th price of eMLC storage – Sustained writes do not make sense for MLC – MLC will last less than a year from sustained writes at same cost and half the write workload 1TB × 3,000 = Less than a year 500 MBps 24 - 56 Flash Problems and TMS Solutions Problem Solution Limited write-erase cycles Wear leveling Bit errors ECC Block/plane/device failures Block remapping, RAID, Variable Stripe RAID™ Disturb errors Voltage and timing adjustments (read, write, erase) Erases need big blocks and take a long time Overprovisioning 25 - 56 Four Layers of Data Correction Layer Protection System-level RAID 5 Module failure managed by centralized RAID controllers Module-level Variable Stripe RAID™ RamSan-720/820 only Sub-chip failure, System Longevity managed by each module across its chips Module-level RAID 5 Chip failure managed by each module across its chips Chip-level ECC Bit and block errors managed by each module using its chips RamSan-720/820 introduce System-Level RAID 5 across Flash modules, plus the other mechanisms found on all RamSan Flash storage systems. 26 - 56 Variable Stripe RAID™ (VSR) • Patented VSR allows RAID stripe sizes to vary. • If one die fails in a ten-chip stripe, only the failed die is bypassed, and then data is restriped across the remaining nine chips. 10 Chips … 16 Planes FAIL … 27 - 56 2D Flash RAID™ (RS-720/820) External Interfaces (FC, IB) Interface Interface RAID Controllers RAID Controller RAID Controller RAID 5 within Flash Modules (9 data + 1 parity) TMS 2D Flash RAID™ RAID 5 across Flash Modules (10 data + 1 parity + 1 hot spare) 28 - 56 RamSan-70 Overview 1. PCIe 2.0 x8 2. PowerPC CPU 5 4 5 2 4. 900GB usable SLC Flash (1374GB raw) 3 5 6 3. Xilinx FPGAs 5. 4GB DRAM 4 5 6. Super-Capacitors 3 7. Half-length card 1 7 • • • Usable 450-900GB 650,000 4K IOPS 2.5 GB/s Bandwidth • • • 30 µs sustained 4K Write Latency / 100us 4K Read Latency 10 Years Life Expectancy Series-7™ Flash Controller 29 - 56 MAXIMIZING SSD 30 - 56 Segregation of Workload • Metadata, Working Data, Archived Data • Metadata is typically accessed the most, but takes up the least space • Archived Date is accessed the least, but takes up the most space • Moving high-access data into a high-performance medium has the greatest impact But the question is, what data makes sense to store on SSD? 31 - 56 Performance per Capacity • Historically, TMS has designed DRAM-based SSD devices that performed GB/sec per GB of storage [Metadata] • Our flash-based SSD devices perform GB/sec per TB of storage [Metadata, Working Data] • Disk-based products typically grossly underperform SSD, but economical performance at >>TB of storage [Archive, Large Working Data] 32 - 56 Algorithm Matrix Low CPU Utilization + Low I/O Wait Low CPU Utilization + High I/O Wait High CPU Utilization + Low I/O Wait High CPU Utilization + High I/O Wait Algorithm needs to = provide more work = Great fit for SSD!! = In-memory work = Using Asynchronous I/O Add disks for growing capacity Add SSD for same size capacity 33 - 56 Q&A 34 - 56 RamSan-440 Overview 128 - 512 GB capacity 600,000 IOPS 4.5 GB/s throughput Latency 15 µs 2-8 FC Ports Industry Firsts: 512GB Non-volatile RAM storage RAM SSD with Flash backup RAID protected RAM and Flash modules TMS patented IO2 Instant-on Input-Output option. 35 - 56 RamSan-440 Architecture RAID Protected RAM Boards Management Control Processor 4 Dual-ported Fibre Channel or InfiniBand Interfaces Redundant Batteries Hot Swappable Redundant Power Supplies Redundant Fans RAID Protected Backup Flash 4U Chassis 36 - 56 Series-7 Flash Controller Design Lookup Tables Write Buffer 4 GB RAM Cache Best Performance: 4K aligned I/O CPU – (out of Primary Data Path) Write setup, Garbage collection, Error handling Out of the data path activities Super Capacitors Flash Controller FPGA (Process all of the “IN DATA” activities) 4 GB RAM Cache I/O Interface 37 - 56 Memory Backup RamSan-630 Overview 1-10TB capacity 1 Million IOPS 10 GB/s throughput Latency 80-250 µs Highest density SLC Flash SSD system available. Leverages proven flash core from the RamSan-20 and RamSan-620 Easily shared and multipathed through ten 8 Gbit Fibre Channel ports or QDR InfiniBand ports Enterprise Reliability Single Layer Cell (SLC) Flash Fault Tolerant Flash (FTF) Architecture Active Spare Flash 38 - 56 RamSan-630 Architecture 5 Dual-ported FC or IB Interfaces 1-10TB of SLC Flash Boards Management Control Processor Redundant Power Supplies Redundant Fans 3U Chassis 39 - 56 RamSan-630 Flash Board RAID-5 Protected Flash Embedded PowerPC 480 GB usable, 640 GB RAW On Board RAM ECC Protected Gateway FPGA 4 Flash Controllers 40 - 56 Super capacitors RamSan-710 Overview 1-5 TB Usable capacity (6.8 TB Raw) 400,000 IOPS 5 GB/s throughput 35-175 µs latency 150K+ Write/Erase Cycles per Cell Highest density SLC Flash SSD system available in a 1U Series-7™ Flash Controller Four 8 Gbit Fibre Channel ports or QDR InfiniBand ports Enterprise reliability Single Layer Cell (SLC) Flash Variable Stripe RAID (VSR)™ Active Spare 41 - 56 RamSan-710 Overview 4-20 Flash modules + 1 “Active Spare” 2 dual-ported 8Gb FC or QDR IB interfaces management control processor redundant power supplies 1U chassis N+1 batteries redundant fans 42 - 56 RamSan-710 Overview 43 - 56 RamSan-810 Overview 2-10 TB Usable capacity (13.7 TB Raw) 320,000 IOPS 5 GB/s throughput 70-225 µs latency (est.) 30K+ Write/Erase Cycles per Cell Highest density eMLC Flash SSD system available in a 1U Series-7™ Flash Controller Four 8 Gbit Fibre Channel ports or QDR InfiniBand ports Enterprise reliability enterprise Multi-Level-Cell (eMLC) Flash Variable Stripe RAID (VSR)™ Active Spare 44 - 56 RamSan-810 Architecture 4-20 Flash modules + 1 “Active Spare” 1-2 interface modules management control processor redundant power supplies 1U chassis N+1 batteries redundant fans 45 - 56 Motherboard 46 - 56 Toshiba eMLC Flash Series-7 Flash Controller FPGAs Gateway FPGA DDR DRAM PowerPC CPU @ 400 MHz 47 - 56 Applications Suited for eMLC • • • • Data Warehousing Web Content Hosting Low Bandwidth Log Files READ Intensive, Low WRITE Application For Users Writing at 600 MB/s, the Lifetime of the eMLC RamSan-810 is rated at 10 years*. – – – *2TB *6TB *10TB =10TB WRITES per Day =30TB WRITES per Day =50TB WRITES per Day 48 - 56 RamSan-70 Overview 49 - 56 RamSan-70 Architecture 1. PCIe 2.0 x8 5 3 5 4 5 3. Power PC CPU 333 mHz 4. Xilinx FPGAs 5 6 2. 900GB usable SLC Flash (1374GB raw) 2 2 5. 4GB DRAM 4 6. Super-Capacitors 1 7. Half-length card 7 • • • 450-900GB 650.000 IOPS (4K) 2,5GB/s Bandwidth • • • 50 - 56 30 µs Write Latency 10 Years Life Expectancy (25% writes) Series-7™ Flash Controller RamSan-720 Overview 6 or 12 TB Usable capacity (~ 7.8 or ~15.6 TB Raw) 500,000 IOPS (4K) 5 GB/s throughput <100µs latency No Single Point of Failure (nSPoF) Hot Swappable Flash Cards Highest density SLC Flash SSD system available in a 1U Series-7™ Flash Controller Four 8 Gbit Fibre Channel ports or QDR InfiniBand ports High Enterprise reliability Single-Level-Cell (SLC) Flash Variable Stripe RAID (VSR)™ 2D Flash RAID™ 51 - 56 RamSan-820 Overview 12 or 24TB Usable capacity (~ 15.6 or ~31.2 TB Raw) 450,000 IOPS (4K) 5 GB/s throughput <100µs latency No Single Point of Failure (nSPoF) Hot Swappable Flash Cards Highest density eMLC Flash SSD system available in a 1U Series-7™ Flash Controller Four 8 Gbit Fibre Channel ports or QDR InfiniBand ports High Enterprise reliability enterprise Multi-Level-Cell (eMLC) Flash Variable Stripe RAID (VSR)™ 2D Flash RAID™ 52 - 56 RamSan-Green IT 53 - 56 Bandwidth Latency Speed & IOPS HAHNSTÄTTEN • MÜNCHEN CONFIDENTIAL • COPYRIGHT BY PSP World‘s Fastest Storage Since 1978 Pure SSD-Racepower HAHNSTÄTTEN • MÜNCHEN CONFIDENTIAL • COPYRIGHT BY PSP Thanks for your attention… 56 - 56