High-performance Power-efficient Bulldozer-core x86-64
Transcription
High-performance Power-efficient Bulldozer-core x86-64
HIGH-PERFORMANCE POWEREFFICIENT X86-64 SERVER AND DESKTOP PROCESSORS Using the core codenamed “Bulldozer” Sean White 19 August 2011 THE DIE Overview, Floorplan, and Technology 2 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | THE DIE | Photograph 3 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | THE DIE | What’s on it? Eight “Bulldozer” cores – High-performance, power-efficient AMD64 cores First generation of a new execution core family from AMD (Family 15h) Two cores in each Bulldozer module – 128 KB of Level1 Data Cache, 16 KB/core, 64-byte cacheline, 4-way associative, write-through – 256 KB of Level1 Instruction Cache, 64 KB/Bulldozer module, 64-byte cacheline, 2-way associative – 8 MB of Level2 Cache, 2 MB/Bulldozer module, 64-byte cacheline, 16way associative Integrated Northbridge which controls: – 8 MB of Level3 Cache, 64-byte cacheline, 16-way associative, MOESI – Two 72-bit wide DDR3 memory channels – Four 16-bit receive/16-bit transmit HyperTransport™ links 4 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | THE DIE | Floorplan (315 mm2) HyperTransport™ Phy 2MB L2 Cache 2MB L3 Cache 2MB L2 Cache Bulldozer Module 2MB L3 Cache Northbridge HyperTransport™ Phy 2MB L3 Cache Bulldozer Module 2MB L2 Cache HyperTransport™ Phy 2MB L3 Cache 2MB L2 Cache Bulldozer Module MiscIO 5 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | DDR3 Phy HyperTransport™ Phy Bulldozer Module MiscIO THE DIE | Process Technology 32-nm Silicon-On-Insulator (SOI) Hi-K Metal Gate (HKMG) process from GlobalFoundries 11-metal-layer-stack Low-k dielectric Dual strain liners and eSiGe to improve performance. Multiple VT (HVT, RVT, LVT) and long-channel transistors. Layer Type Pitch CPP1 - 130 nm M01 1x 104 nm M02 1x 104 nm M03 1x 104 nm M04 1.3x 130 nm M05 1.3x 130 nm M06 2x 208 nm M07 2x 208 nm M08 2x 208 nm M09 4x 416 nm M10 16x 1.6 µm M11 16x 1.6 µm 1 Contacted poly pitch 6 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | THE BULLDOZER CORE A few highlights 7 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | THE BULLDOZER CORE | What’s different about this core? A Bulldozer module contains 2 cores “Bulldoze r ” Functions with high utilization that can’t be shared without significant compromises exist for each core Fetch Decode Ex: Integer pipelines, Level1 data caches This allows two cores to each use a larger, higher-performance function (ex: floating point unit) as they need it for less total die area than having separate, smaller functions for each core L1 DCache L1 DCache Shared L2 Cache 8 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | Pipeline Pipeline Pipeline Pipeline 128-bit FMAC 128-bit FMAC Pipeline Pipeline Pipeline Ex: Floating point pipelines, Level2 cache Integer Scheduler FP Scheduler Pipeline Other functions are shared between the cores Integer Scheduler THE BULLDOZER CORE | Microarch: Shared Frontend Decoupled predict and fetch pipelines Prediction-directed instruction prefetch Prediction Queue ICache L1 BTB Fetch Queue L2 BTB Ucode ROM 4 x86 Decoders Instruction cache: 64K Byte, 2-way – Level2: 512 entries, 4-way, 4K pages Ld/ST Unit L1 DTLB L1 DCache Data Prefetcher AGen AGen EX, DIV 128-bit FMAC 128-bit FMAC MMX FP Ld Buffer Instr Retire EX, MUL Int Scheduler FP Scheduler MMX AGen – Level1: 72 entries, mixed page sizes Instr Retire AGen Instruction TLBs: EX, DIV 32-Byte fetch EX, MUL Int Scheduler Ld/ST Unit L1 DTLB L1 DCache Shared L2 Cache Branch fusion 9 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | THE BULLDOZER CORE | Microarch: Separate Integer Units Thread retire logic Physical Register File (PRF)-based register renaming Prediction Queue ICache L1 BTB Fetch Queue L2 BTB Ucode ROM 4 x86 Decoders Unified scheduler per core AGen AGen EX, DIV 128-bit FMAC 128-bit FMAC Instr Retire EX, MUL Int Scheduler FP Scheduler MMX AGen AGen EX, DIV Fully out-of-order load/store Instr Retire EX, MUL Data TLB: 32-entry fully associative Int Scheduler MMX Way-predicted 16K Byte Level1 Data cache – 2 128-bit loads/cycle – 1 128-bit store/cycle – 40-entry Load queue 24-entry Store queue Ld/ST Unit L1 DTLB L1 DCache Data Prefetcher FP Ld Buffer Ld/ST Unit L1 DTLB L1 DCache Shared L2 Cache 10 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | THE BULLDOZER CORE | Microarch: Shared FPU PRF-based register renaming A single floating point scheduler for both cores L2 BTB Ucode ROM Ld/ST Unit L1 DTLB L1 DCache Data Prefetcher AGen AGen 128-bit FMAC 128-bit FMAC MMX FP Ld Buffer Instr Retire EX, DIV Int Scheduler FP Scheduler MMX Instr Retire 4 x86 Decoders EX, MUL Int Scheduler AGen Dual 128-bit Packed Integer pipelines Fetch Queue AGen Dual 128-bit Floating Point Multiply/Accumulate (FMAC) pipelines ICache L1 BTB EX, DIV Reports completion back to parent core Prediction Queue EX, MUL Co-processor organization Ld/ST Unit L1 DTLB L1 DCache Shared L2 Cache 11 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | THE BULLDOZER CORE | New Floating Point Instructions New Instructions Applications/Use Cases SSSE3, SSE4.1, SSE4.2 (AMD and Intel) • Video encoding and transcoding • Biometrics algorithms • Text-intensive applications AESNI PCLMULQDQ (AMD and Intel) • Application using AES encryption • Secure network transactions • Disk encryption (MSFT BitLocker) • Database encryption (Oracle) • Cloud security AVX (AMD and Intel) Floating point intensive applications: • Signal processing / Seismic • Multimedia • Scientific simulations • Financial analytics • 3D modeling FMA4 (AMD Unique) HPC applications XOP (AMD Unique) • Numeric applications • Multimedia applications • Algorithms used for audio/radio 12 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | THE BULLDOZER CORE | Microarch: Shared Level 2 Cache 16-way unified L2 cache Prediction Queue ICache L1 BTB L2 TLB and page walker Fetch Queue L2 BTB Ucode ROM 4 x86 Decoders – 1024-entry, 8-way 23 outstanding L2 cache misses for memory system concurrency Ld/ST Unit L1 DTLB L1 DCache Data Prefetcher AGen AGen EX, DIV 128-bit FMAC 128-bit FMAC MMX FP Ld Buffer Instr Retire EX, MUL Int Scheduler FP Scheduler MMX AGen AGen Instr Retire EX, DIV Multiple data prefetchers Int Scheduler EX, MUL – Services both Instruction and Data requests Ld/ST Unit L1 DTLB L1 DCache Shared L2 Cache 13 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | THE NORTHBRIDGE A few of its highlights 14 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | THE NORTHBRIDGE | Simplified Block Diagram 8 MB L3 Cache CPU CPU CPU C0 CPUC1 L2 L2 L2 L2 L2 Cache 4 Bulldozer modules Synchronization L3 Arrays L3 CTL System Request Queue (SRQ) Crossbar APIC APIC APIC (interrupts) (interrupts) (interrupts) APIC(*2) Memory CTL 2 DRAM channels DRAM CTL1 DRAM CTL0 DDR PHY DDR PHY HT HT HT CTL HTCTL CTL CTL 4 HyperTransport™ links HT/PCI-E HT PHY HT/PCI-E HT/PCI-E PHY PHY PHY 15 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | THE NORTHBRIDGE | SRQ, Crossbar, Memory Controller System Request Queue and Crossbar The northbridge SRQ and crossbar form the traffic hub of the design, routing from any requestor to the appropriate destination. Examples: A core request to the L3 cache; a request from another die or chip from a HyperTransport™ link to the memory controller, etc. Memory Controller Each memory controller enforces cache coherency and proper order of operations for the memory space it “owns”, distributing this responsibility across multi-die or multi-chip systems. This allows multi-die or multi-chip systems to scale rather than bottleneck at a single coherency/ordering choke point in the system. Obviously, the memory controller also sends the read/write requests to the DRAM controllers. 16 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | THE NORTHBRIDGE | DRAM Controllers, L3 Cache, Probe Filter DDR3 DRAM Controllers Two per die Supports Unbuffered (UDIMM), Registered (RDIMM) and LoadReduced (LRDIMM) memory, 1.50V, 1.35V and 1.25V depending on the processor package, with frequencies up to DDR3-1866 New power-saving features: Low-voltage DDR3, fast self-refresh entry with internal clocks gated, aggressive precharge power down, tristate addr/cmd/bank w/chip-select deassertion, throttle activity for thermal or power reduction, put RX circuits in standby when no reads active, etc. L3 Cache and Probe Filter Up to 8 MB of Level3 cache, normally shared, can be partitioned L3 data ECC protected (single-bit correct, double-bit detect) Probe Filter: the northbridge can filter coherency traffic to improve system bandwidth Space in the L3 array is used to hold Probe Filter data when it’s enabled 17 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | THE NORTHBRIDGE | Four 16-bit HyperTransport™ Links High-speed unidirectional signaling, 16-bits receive/16-bits transmit A 16-bit link can be unganged into two 8-bit links Up to 6.4 Gigatransfers/sec/bit (16-bit link = 12.8 GByte/s RX, 12.8 TX) Maximum link frequency depends on package, chipset capability and printed circuit board signal integrity Number of links vary with package AM3+ has a single link to the chipset C32 has three links for chipset and/or dual-processor coherent interconnect G34 has four external links for chipset and/or multi-processor coherent interconnect (plus internal links between the dice in the multi-chip module) In retry mode, bit errors on the link are detected on CRC error and packets are reissued for highly reliable operation. Power consumption is minimized based on configuration (ex: link width/frequency) and activity. 18 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | POWER MANAGEMENT A few key capabilities 19 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | POWER MANAGEMENT | Key Features and Functions The Bulldozer core is architected to be power-efficient Minimize silicon area by sharing functionality between two cores All blocks and circuits have been designed to minimize power (not just in the Bulldozer core) Extensive flip-flop clock-gating throughout design Circuits power-gated dynamically Numerous power-saving features under firmware/software control Core C6 State (CC6) Core P-states/AMD Turbo CORE Application Power Management (APM) DRAM power management Message Triggered C1E 20 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | POWER MANAGEMENT | Core C6 State (CC6) Core C6: if a core isn’t active, remove power PG I/O Implemented in this physical design by a power gating ring that isolates the Core VSS for each Bulldozer module from the “Real” VSS CC6 entry: when both Bulldozer cores in the module are idle, flush caches and dump register state to CC6 save space, then gate Core VSS CC6 exit: ungate Core VSS, reload CC6 saved state, resume execution (ex: service interrupts, etc.) “Real” VSS Core I/O Core VSS Bulldozer Module RUN Power Gating Ring Control WAKE Power Gating FETs Small FET Core VSS WAKE RUN “Real”VSS 21 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | Large FET POWER MANAGEMENT | P-states, AMD Turbo CORE Core P-states specify multiple frequency and voltage points of operation – Higher frequency P-states deliver greater performance but require higher voltage and thus more power – The hardware and operating system vary which P-state a core is in to deliver performance as needed, but use lower frequency P-states to save power whenever possible AMD Turbo CORE: when the processor is below its power/thermal limits the frequency and voltage can be boosted above the normal maximum and stay there until it gets back to the power/thermal limits 22 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | TURBO CORE | Base, All Core Boost and Max Boost Base frequency with TDP headroom All core boost activated Max turbo activated All Core Boost When there is TDP headroom in a given workload, AMD Turbo CORE technology is automatically activated and can increase clock speeds across all cores. Max Turbo Boost When a lightly threaded workload sends half the Bulldozer modules into C6 sleep state but also requests max performance, AMD Turbo CORE technology can increase clock speeds for the active Bulldozer modules. 23 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | THE 1-SOCKET DESKTOP AM3+ PROCESSOR CODENAMED “ZAMBEZI” 940-pin lidded micro PGA package 1.27mm pin pitch 31 row x 31 column pin array C4 die attach 24 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | ZAMBEZI | Performance Desktop Processor “Zambezi”, for the AM3+ Platform AM3+ socket infrastructure adds: Support for low-voltage DRAM Increased ILDT current for higher frequency HyperTransport™ link (2.0A maximum per Gen3 link) Increase in IDDR current (4.0A maximum) Older AM3 processors plug-in compatible with AM3+ motherboards HT Link HT PHY NC HT PHY 2 2 2 2 Cores Cores Cores Cores L2 L2 L2 L2 HT PHY NC HT PHY NC Northbridge DRAM CTL’s L3 Cache PHY 2 memory channels, unbuffered DIMMs, up to DDR3-1866 1 HyperTransportTM link, up to 5.2 GT/s For use with AMD 9-Series chipsets 2 Memory Channels AMD 990FX AMD 990X AMD 970 AMD SB950 25 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | THE 1 OR 2-SOCKET SERVER C32 PROCESSOR CODENAMED “VALENCIA” 1207-land lidded micro LGA package 1.10mm land pitch 35 row x 35 column land array C4 die attach 26 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | VALENCIA | 1-2 Socket Server Processor “Valencia”, for the C32 Platform Valencia is compatible with existing C32 motherboards (AMD Opteron™ 4000 series processor-based platform) with appropriate BIOS update 2 memory channels, UDIMM, RDIMM or LRDIMM, up to DDR3-1600 3 HyperTransportTM links, up to 6.4 GT/s Most designs use only 2 links to achieve lower Thermal Design Power (TDP) 2 2 2 2 Cores Cores Cores Cores L2 L2 L2 L2 NC APML HT Link HT HT PHY Northbridge HT PHY NC HT PHY DRAM CTL’s L3 Cache PHY Advanced Platform Management Link (APML) 1 or 2 socket systems For use with AMD server chipsets HT Link 2 Memory Channels AMD SR5690 AMD SR5670 AMD SR5650 AMD SP5100 27 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | NC VALENCIA | 2-socket system example System design optimized to provide maximum performance for minimum cost and power for 1-2 socket servers Valencia Valencia Up to 16 cores in a two socket system Two DDR3 Memory channels per socket PCI-Express Northbridge expansion I/O SR56x0 PCI-Express AMD SR5690: 42 PCI Express® lanes AMD SR5670: 30 PCI Express® lanes AMD SR5650: 22 PCI Express® lanes SP5100 Southbridge: SATA, PCI, USB Green = Coherent HyperTransport™ links SP5100 Southbridge Red = Non-coherent HyperTransport™ links Blue = PCI Express® expansion I/O 28 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | THE 1 TO 4-SOCKET SERVER G34 PROCESSOR CODENAMED “INTERLAGOS” 1944-land lidded LGA package 1.00mm land pitch 57 row x 40 column land array C4 die attach 29 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | INTERLAGOS | 1-4 Socket Server Processor 2 2 2 L2 L2 2 2 Cores Cores Cores Cores L2 Master Die 2 2 2 Cores Cores Cores Cores L2 L2 L2 L2 Slave Die L2 APML 2 and ½ HT Links HT PHYs HT PHYs Northbridge HT PHYs PHY 2 Memory Channels “Interlagos”, for the G34 Platform Two die in a multichip module for 1 to 4 socket systems Compatible with existing G34 motherboards (AMD Opteron™ 6000 series processor-based platform) with appropriate BIOS update 1 and ½ HT Links DRAM CTL’s DRAM CTL’s L3 Cache HT PHYs Northbridge L3 Cache PHY 2 Memory Channels • Up to 16 MB of combined Level3 cache • 4 memory channels, UDIMM, RDIMM or LRDIMM, up to DDR3-1600 • 4 external HyperTransportTM links, up to 6.4 GT/s (plus internal die-to-die links) • Advanced Platform Management Link (APML) • For use with AMD server chipsets •AMD SR56x0, AMD SP5100 OS views it as one multi-core processor with up to 16 cores 30 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | INTERLAGOS | 4-socket system example System design optimized to provide maximum performance for 1-4 socket servers Up to 64 cores in a four socket system to support highly multi-threaded workloads Interlagos Interlagos Interlagos Interlagos Database, Web, Virtualization, Cloud Computing, HighPerformance Computing, etc. 16 DDR3 channels w/4 sockets PCI-Express PCI-Express SR56x0 SR56x0 PCI-Express PCI-Express To support demanding memorybandwidth-intensive workloads High-performance PCI Express® links via SR56x0 chipset to support demanding I/O-intensive workloads (high-speed network, storage, etc.) Green = Coherent HyperTransport™ links SP5100 Southbridge Red = Non-coherent HyperTransport™ links Blue = PCI Express® expansion I/O 31 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | Trademark Attribution AMD, the AMD Arrow logo, AMD Opteron and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names used in this presentation are for identification purposes only and may be trademarks of their respective owners. PCI Express is a registered trademark of PCI-SIG. ©2011 Advanced Micro Devices, Inc. All rights reserved. 32 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 |