Lecture 13: —Ways to Reduce Memory Hierarchy Misses
Transcription
Lecture 13: —Ways to Reduce Memory Hierarchy Misses
Lecture 13: Memory Hierarchy—Ways to Reduce Misses DAP Spr.‘98 ©UCB 1 Review: Who Cares About the Memory Hierarchy? • Processor Only Thus Far in Course: – CPU cost/performance, ISA, Pipelined Execution CPU CPU-DRAM Gap 100 10 1 “Moore’s Law” µProc 60%/yr. Processor-Memory Performance Gap: (grows 50% / year) DRAM DRAM 7%/yr. 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Performance 1000 • 1980: no cache in µproc; 1995 2-level cache on chip (1989 first Intel µproc with a cache on chip) DAP Spr.‘98 ©UCB 2 The Goal: Illusion of large, fast, cheap memory • Fact: Large memories are slow, fast memories are small • How do we create a memory that is large, cheap and fast (most of the time)? • Hierarchy of Levels – Uses smaller and faster memory technologies close to the processor – Fast access time in highest level of hierarchy – Cheap, slow memory furthest from processor • The aim of memory hierarchy design is to have access time close to the highest level and size equal to the lowest level DAP Spr.‘98 ©UCB 3 Recap: Memory Hierarchy Pyramid Processor (CPU) transfer datapath: bus Decreasing distance from CPU, Decreasing Access Time (Memory Latency) Level 1 Level 2 Increasing Distance from CPU, Decreasing cost / MB Level 3 ... Level n Size of memory at each level DAP Spr.‘98 ©UCB 4 Memory Hierarchy: Terminology Hit: data appears in level X: Hit Rate: the fraction of memory accesses found in the upper level Miss: data needs to be retrieved from a block in the lower level (Block Y) Miss Rate = 1 - (Hit Rate) Hit Time: Time to access the upper level which consists of Time to determine hit/miss + memory access time Miss Penalty: Time to replace a block in the upper level + Time to deliver the block to the processor Note: Hit Time << Miss Penalty DAP Spr.‘98 ©UCB 5 Current Memory Hierarchy Processor Control L1 cache regs Datapath L2 Cache Main Memory Secondary Memory Speed(ns): 0.5ns 2ns 6ns 100ns 10,000,000ns Size (MB): 0.0005 0.05 1-4 100-1000 100,000 Cost ($/MB): -$100 $30 $1 $0.05 Technology: Regs SRAM SRAM DRAM Disk DAP Spr.‘98 ©UCB 6 Memory Hierarchy: Why Does it Work? Locality! Probability of reference 0 2^n - 1 Address Space • Temporal Locality (Locality in Time): => Keep most recently accessed data items closer to the processor • Spatial Locality (Locality in Space): => Move blocks consists of contiguous words to the upper levels To Processor From Processor Upper Level Memory Lower Level Memory Blk X Blk Y DAP Spr.‘98 ©UCB 7 Memory Hierarchy Technology • Random Access: – “Random” is good: access time is the same for all locations – DRAM: Dynamic Random Access Memory » High density, low power, cheap, slow » Dynamic: need to be “refreshed” regularly – SRAM: Static Random Access Memory » Low density, high power, expensive, fast » Static: content will last “forever”(until lose power) • “Not-so-random” Access Technology: – Access time varies from location to location and from time to time – Examples: Disk, CDROM • Sequential Access Technology: access time linear in location (e.g.,Tape) • We will concentrate on random access technology – The Main Memory: DRAMs + Caches: SRAMs DAP Spr.‘98 ©UCB 8 Introduction to Caches • Cache – is a small very fast memory (SRAM, expensive) – contains copies of the most recently accessed memory locations (data and instructions): temporal locality – is fully managed by hardware (unlike virtual memory) – storage is organized in blocks of contiguous memory locations: spatial locality – unit of transfer to/from main memory (or L2) is the cache block • General structure – n blocks per cache organized in s sets – b bytes per block – total cache size n*b bytes DAP Spr.‘98 ©UCB 9 Cache Organization (1) How do you know if something is in the cache? (2) If it is in the cache, how to find it? • Answer to (1) and (2) depends on type or organization of the cache • In a direct mapped cache, each memory address is associated with one possible block within the cache – Therefore, we only need to look in a single location in the cache for the data if it exists in the cache DAP Spr.‘98 ©UCB 10 Simplest Cache: Direct Mapped Block Address 0 1 0010 2 3 4 5 0110 6 7 8 9 1010 10 11 12 13 1110 14 15 Main Memory Cache Index 4-Block Direct Mapped Cache 0 1 2 3 Memory block address tag index • index determines block in cache • index = (address) mod (# blocks) • If number of cache blocks is power of 2, then cache index is just the lower n bits of memory address [ n = log2(# blocks) ] DAP Spr.‘98 ©UCB 11 Issues with Direct-Mapped • If block size > 1, rightmost bits of index are really the offset within the indexed block ttttttttttttttttt iiiiiiiiii oooo tag to check to if have correct block index byte offset select within block block DAP Spr.‘98 ©UCB 12 64KB Cache with 4-word (16-byte) blocks 31 . . . 16 15 . . 4 3 2 1 0 Address (showing bit positions) 16 Hit 12 2 Byte offset Tag Data Index V Block offset 16 bits 128 bits Tag Data 4K entries 16 32 32 32 32 Mux 32 DAP Spr.‘98 ©UCB 13 Direct-mapped Cache Contd. • The direct mapped cache is simple to design and its access time is fast (Why?) • Good for L1 (on-chip cache) • Problem: Conflict Miss, so low hit ratio Conflict Misses are misses caused by accessing different memory locations that are mapped to the same cache index In direct mapped cache, no flexibility in where memory block can be placed in cache, contributing to conflict misses DAP Spr.‘98 ©UCB 14 Another Extreme: Fully Associative • Fully Associative Cache (8 word block) – Omit cache index; place item in any block! – Compare all Cache Tags in parallel 4 0 Byte Offset Cache Tag (27 bits long) = = : = = Cache Tag Valid Cache Data B 31 : 31 B1 B0 = : : • By definition: Conflict Misses = 0 for a fully associative cache : DAP Spr.‘98 ©UCB 15 Fully Associative Cache • Must search all tags in cache, as item can be in any cache block • Search for tag must be done by hardware in parallel (other searches too slow) • But, the necessary parallel comparator hardware is very expensive • Therefore, fully associative placement practical only for a very small cache DAP Spr.‘98 ©UCB 16 Compromise: N-way Set Associative Cache • N-way set associative: N cache blocks for each Cache Index – Like having N direct mapped caches operating in parallel – Select the one that gets a hit • Example: 2-way set associative cache – Cache Index selects a “set” of 2 blocks from the cache – The 2 tags in set are compared in parallel – Data is selected based on the tag result (which matched the address) DAP Spr.‘98 ©UCB 17 Example: 2-way Set Associative Cache tag Valid Cache Tag Cache Data Block 0 : : offset index Cache Data Cache Tag Valid Block 0 : : mux = Hit address : : = Cache Block DAP Spr.‘98 ©UCB 18 Set Associative Cache Contd. • Direct Mapped, Fully Associative can be seen as just variations of Set Associative block placement strategy • Direct Mapped = 1-way Set Associative Cache • Fully Associative = n-way Set associativity for a cache with exactly n blocks DAP Spr.‘98 ©UCB 19 Alpha 21264 Cache Organization DAP Spr.‘98 ©UCB 20 Block Replacement Policy • N-way Set Associative or Fully Associative have choice where to place a block, (which block to replace) – Of course, if there is an invalid block, use it • Whenever get a cache hit, record the cache block that was touched • When need to evict a cache block, choose one which hasn't been touched recently: “Least Recently Used” (LRU) – Past is prologue: history suggests it is least likely of the choices to be used soon – Flip side of temporal locality DAP Spr.‘98 ©UCB 21 Review: Four Questions for Memory Hierarchy Designers • Q1: Where can a block be placed in the upper level? (Block placement) – Fully Associative, Set Associative, Direct Mapped • Q2: How is a block found if it is in the upper level? (Block identification) – Tag/Block • Q3: Which block should be replaced on a miss? (Block replacement) – Random, LRU • Q4: What happens on a write? (Write strategy) – Write Back or Write Through (with Write Buffer) DAP Spr.‘98 ©UCB 22 Write Policy: Write-Through vs Write-Back • Write-through: all writes update cache and underlying memory/cache – Can always discard cached data - most up-to-date data is in memory – Cache control bit: only a valid bit • Write-back: all writes simply update cache – Can’t just discard cached data - may have to write it back to memory – Flagged write-back – Cache control bits: both valid and dirty bits • Other Advantages: – Write-through: » memory (or other processors) always have latest data » Simpler management of cache – Write-back: » Needs much lower bus bandwidth due to infrequent access » Better tolerance to long-latency memory? DAP Spr.‘98 ©UCB 23 Write Through: Write Allocate vs Non-Allocate • Write allocate: allocate new cache line in cache – Usually means that you have to do a “read miss” to fill in rest of the cache-line! – Alternative: per/word valid bits • Write non-allocate (or “write-around”): – Simply send write data through to underlying memory/cache - don’t allocate new cache line! DAP Spr.‘98 ©UCB 24 Write Buffers • Write Buffers (for wrtthrough) – buffers words to be written in L2 cache/memory along with their addresses. – 2 to 4 entries deep – all read misses are checked against pending writes for dependencies (associatively) – allows reads to proceed ahead of writes – can coalesce writes to same address • Write-back Buffers – between a write-back cache and L2 or MM – algorithm » move dirty block to write-back buffer » read new block » write dirty block in L2 or MM – can be associated with victim cache (later) L1 to CPU Write buffer L2 DAP Spr.‘98 ©UCB 25 Write Merge DAP Spr.‘98 ©UCB 26 Review: Cache performance • Miss-oriented Approach to Memory Access: MemAccess CPUtime IC CPI MissRate MissPenalty CycleTime Execution Inst MemMisses CPUtime IC CPI MissPenalty CycleTime Execution Inst – CPIExecution includes ALU and Memory instructions • Separating out Memory component entirely – AMAT = Average Memory Access Time – CPIALUOps does not include memory instructions AluOps CPUtime IC CPI Inst AluOps MemAccess AMAT CycleTime Inst AMAT HitTime MissRate MissPenalty HitTime Inst MissRate Inst MissPenalty Inst HitTime Data MissRate Data MissPenaltyData DAP Spr.‘98 ©UCB 27 Impact on Performance • Suppose a processor executes at – Clock Rate = 200 MHz (5 ns per cycle), Ideal (no misses) CPI = 1.1 – 50% arith/logic, 30% ld/st, 20% control • Suppose that 10% of memory operations (Data) get 50 cycle miss penalty • Suppose that 1% of instructions get same miss penalty • CPI = ideal CPI + average stalls per instruction = 1.1(cycles/ins) + [ 0.30 (DataMops/ins) x 0.10 (miss/DataMop) x 50 (cycle/miss)] + [ 1 (InstMop/ins) x 0.01 (miss/InstMop) x 50 (cycle/miss)] = (1.1 + 1.5 + .5) cycle/ins = 3.1 • 58% (1.5/2.6) of the time the proc is stalled waiting for data memory! • Total no. of memory accesses = one per instrn + 0.3 for data = 1.3 Thus, AMAT=(1/1.3)x[1+0.01x50]+(0.3/1.3)x[1+0.1x50]=2.54 cycles => instead of one cycle. DAP Spr.‘98 ©UCB 28 Impact of Change in cc • Suppose a processor has the following parameters: – CPI = 2 (w/o memory stalls) – mem access per instruction = 1.5 • Compare AMAT and CPU time for a direct mapped cache and a 2-way set associative cache assuming: – – – – cc Hit cycle Miss penalty Miss rate Direct map 1ns 1 75 ns 1.4% 2-way associative 1.25ns(why?) 1 75 ns 1.0% AMATd = hit time + miss rate * miss penalty = 1*1 + 0.014*75 = 2.05 ns AMAT2 = 1*1.25 + 0.01*75 = 2 ns < 2.05 ns CPId = (CPI*cc + mem. stall time)*IC = (2*1 + 1.5*0.014*75)IC = 3.575*IC CPI2 = (2*1.25 + 1.5*0.01*75)IC = 3.625*IC > CPId ! • Change in cc affects all instructions while reduction in miss rate benefit only memory instructions. DAP Spr.‘98 ©UCB 29 Miss Penalty for Out-of-Order (OOO) Exe. Processor. • In OOO processors, memory stall cycles are overlapped with execution of other instructions. Miss penalty should not include this overlapped part. mem stall cycle per instruction = mem miss per instruction x (total miss penalty – overlapped miss penalty) • For the previous example. Suppose 30% of the 75ns miss penalty can be overlapped, what is the AMAT and CPU time? – Assume using direct map cache, cc=1.25ns to handle out of order execution. AMATd = 1*1.25 + 0.014*(75*0.7) = 1.985 ns With 1.5 memory accesses per instruction, CPU time =( 2*1.25 + 1.5 * 0.014 * (75*0.7))*IC = 3.6025 IC < CPU2 DAP Spr.‘98 ©UCB 30 Lock-Up Free Cache Using MSHR (Miss Status Holding Register) 1 bit 32 bits 16 bits Valid bit Block request address Source node bits mshr1 Source node bits mshr 2 Source node bits mshr n Comparator Valid bit Block request address Comparator Valid bit Block request address Comparator DAP Spr.‘98 ©UCB 31 Avg. Memory Access Time vs. Miss Rate • Associativity reduces miss rate, but increases hit time due to increase in hardware complexity! • Example: For on-chip cache, assume CCT = 1.10 for 2way, 1.12 for 4-way, 1.14 for 8-way vs. CCT direct mapped Cache Size (KB) 1-way 1 2.33 2 1.98 4 1.72 8 1.46 16 1.29 32 1.20 64 1.14 128 1.10 Associativity 2-way 4-way 2.15 2.07 1.86 1.76 1.67 1.61 1.48 1.47 1.32 1.32 1.24 1.25 1.20 1.21 1.17 1.18 8-way 2.01 1.68 1.53 1.43 1.32 1.27 1.23 1.20 (Red means A.M.A.T. not improved by more associativity) DAP Spr.‘98 ©UCB 32 Unified vs Split Caches • Unified vs Separate I&D Proc Unified Cache-1 Unified Cache-2 I-Cache-1 Proc D-Cache-1 Unified Cache-2 • Example: – 16KB I&D: Inst miss rate=0.64%, Data miss rate=6.47% – 32KB unified: Aggregate miss rate=1.99% • Which is better (ignore L2 cache)? – Assume 33% data ops 75% accesses from instructions (1.0/1.33) – hit time=1, miss time=50 – Note that data hit has 1 stall for unified cache (only one port) AMATHarvard=75%x(1+0.64%x50)+25%x(1+6.47%x50) = 2.05 AMATUnified=75%x(1+1.99%x50)+25%x(1+1+1.99%x50)= 2.24 DAP Spr.‘98 ©UCB 33