Computer System Architecture Final Examination Sample Problems May 12, 1999 Professor Arvind
Transcription
Computer System Architecture Final Examination Sample Problems May 12, 1999 Professor Arvind
Computer System Architecture Final Examination Sample Problems May 12, 1999 Professor Arvind Name: _______________________ Remember to write your name on every page!!! This is an open book, open notes exam. 180 Minutes 21 Pages New Questions: 3-7, Part 4 Question 1 (2 parts): ________ 15 Points Question 2 (1 parts): ________ 24 Points Question 3 (4 parts): ________ 26 Points Question 4 (3 parts): ________ 15 Points Question 5 (4 parts): ________ 29 Points Question 6 (3 parts): ________ 26 Points Question 7 (3 parts): ________ 25 Points Total: ________ 160 Points Question 1. Virtual Memory Consider a byte-addressed system with 32-bit virtual and 24-bit physical addresses and 4096byte pages. 31 0 Virtual Address Question 1.1 (5 points) How large can a direct-mapped, physically addressed cache be in a design where the cache and TLB are accessed in parallel? Question 1.2 (10 points) The initial contents of the TLB and page table are shown below: TLB: VPN 00020 00046 Page Table: PPN 017 089 V 1 1 D 0 0 VPN 00020 00021 00022 …. 00046 00047 00048 00049 00050 …. PPN 017 083 022 V 1 0 1 D 0 0 0 089 054 035 073 054 1 0 0 1 1 0 0 0 0 0 Note: All addresses are shown in hexadecimal. The virtual page number is specified by the high order bits of the virtual address. For example, given the virtual address 0x64f3c, the corresponding VPN would be 00064. 2 Suppose you are given the following code: line 1 line 2 line 3 line 4 line 5 line 6 line 7 Address 0x20fec 0x20ff0 0x20ff4 0x20ff8 0x20ffc 0x21000 0x21040 …. Instruction . addi R1, R0, 0x46000 lw R2, 0x0(R1) lw R3, 0x1000(R1) lw R4, 0x1100(R1) lw R5, 0x3000(R1) lw R6, 0x3100(R1) sw R5, 0x3000(R1) Identify the lines that cause TLB misses. Identify the lines that cause page faults. 3 Question 2. Analysis of Two-level Caches buf is an R-byte character array. The inner loop in the following program fetches, in order, the characters whose position is a multiple of S in buf. The outer loop repeats the process indefinitely. char buf[R]; while( true ) { i = 0; while (i < R) { dummy = buf[i]; i = i + S; } } ;; Time this Fetch ;; The program is executed on a byte-addressable machine where a character is a byte. For a range of different R’s and S’s, we measured the average latency (in clock ticks) of the character fetch in the inner loop. The results are tabulated below: S 0 R 0 2 21 22 23 24 25 26 27 28 21 210 211 212 213 214 215 216 217 218 219 220 221 222 223 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1.5 1.5 1.5 1.5 1.5 2 2 2 2 1 2 3 4 5 6 7 8 9 10 1 1 1 1 1 10 10 10 10 10 50 50 50 50 2 2 2 2 2 2 2 2 2 2 2 212 213 214 215 216 217 218 219 220 221 222 223 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2.1 2.1 2.1 2.1 2.1 3.3 3.3 3.3 3.3 1 1 1 1 1 1 1 1 1 1 1 1 1 3.2 3.2 3.2 3.2 3.2 5.7 5.7 5.7 5.7 1 1 1 1 1 1 1 1 1 1 1 1 5.5 5.5 5.5 5.5 5.5 10 10 10 10 1 1 1 1 1 1 1 1 1 1 1 10 10 10 10 10 20 20 20 20 1 1 1 1 1 1 1 1 1 1 10 10 10 10 10 30 30 30 30 1 1 1 1 1 1 1 1 1 10 10 10 10 10 50 50 50 50 1 1 1 1 1 1 1 1 10 10 10 10 10 50 50 50 50 1 1 1 1 1 1 1 10 10 10 10 10 50 50 50 50 1 1 1 1 1 1 10 10 10 10 10 50 50 50 50 1 1 1 1 10 10 10 10 10 50 50 50 50 1 1 1 1 10 10 10 10 50 50 50 50 4 11 1 1 1 1 10 10 10 50 50 50 50 1 1 1 1 10 10 50 50 50 50 1 1 1 1 10 50 50 50 50 1 1 1 1 1 1 1 1 1 1 50 1 1 1 50 50 1 1 50 50 50 1 50 50 50 50 1 1 1 1 1 1 1 1 1 1 Question 2.1 (24 points) You are informed that the computer has two levels of caches and does not support virtual address translation (i.e., programs use physical address directly). L2 is inclusive of L1. The caches use LRU when appropriate. Based on the table, deduce the cache size, the block size (a.k.a. cacheline size), and the associativity of L1 and L2. If you do not have enough information, you should give the tightest bound possible. Support your answer with a brief explanation. (Simply identifying a collection of rows or columns is not acceptable.) Cache L1 Parameter Cache Size Value Explanation Block Size Associativity L2 Cache Size Block Size Associativity 5 Question 3. Branch Prediction In this problem, we will examine two branch prediction schemes and compare their relative performances. Specification of the Benchmark To analyze the performance of this branch prediction scheme, let’s consider the following code sequence. (Assume i_max and j_max are both greater than or equal to one.) for (i=1; i<=i_max; i++) { for(j=1; j<=j_max; j++) { <body statements here> } } Suppose the above code is compiled to the following assembly code sequence. _outer: _inner: Bj: Bi: ADDI Ri, R0, _i_max ADDI Rj, R0, _j_max <body statements here> . . . ADDI Rj, Rj, #-1 BNEZ Rj, _inner ADDI Ri, Ri, #-1 BNEZ Ri, _outer ; Ri <- i_max ; Rj <- j_max ; ; ; ; Rj <- Rj - 1 branch if (Rj == 0) Ri <- Ri - 1 branch if (Ri == 0) The actual branch outcomes for this code exhibit the following pattern: Bj is taken j_max-1 times, then Bj is not taken once, then Bi is taken once. The same pattern is then repeated i_max-1 more times, except Bi is not taken for the last repetition, when the program terminates. For example, for i_max=4, j_max=3, the pattern is 1101 1101 1101 1100 (1 = taken, 0 = not taken). 6 Two-bit Saturation Counter Prediction Scheme Table of 2-bit Counters Branch PC Prediction index We have studied this scheme in lecture (Slide L15-16) and in the practice final exam (’97, Question 3.1). As shown in the figure above, the lower order bits of the branch address is used as an index into a table of two-bit counters. The content of these counters is the same as the BP bits in a branch target buffer (BTB). A two-bit counter encodes four states, and is updated as shown below. Note that the branch prediction is equal to the high-order bit of the counter (1 = taken, 0 = not taken). current state 00 01 10 11 Prediction Not taken Not taken taken taken next state Actually actually Taken not taken 01 00 11 00 11 00 11 10 taken taken pred taken 11 pred taken 01 taken taken taken pred taken 10 taken pred taken 00 taken taken Question 3.1 (6 points) Assuming all the counters are initialized to weakly-taken (state 10). Give the number of mispredictions (in terms of i_max and j_max, if applicable) and circle the final states of the counters for the branches Bi and Bj when the above code sequence is executed using the two-bit saturation counter prediction scheme. Bi: _________ mispredictions, final state of counter is: 00 01 10 11 (circle one). Bj: _________ mispredictions, final state of counter is: 00 01 10 11 (circle one). 7 N-bit Global History Correlating Prediction Scheme Table of 2-bit Counters Global History Register Prediction index In this scheme, an N-bit global history register is used to store the outcomes of the last N branch resolutions. This history register is used as an index into the counter table (with 2N 2-bit counters), from with the branch prediction is taken. The two-bit counters are updated in the same manner as in the two-bit saturation counter prediction scheme. Each time a branch outcome is resolved, the global history register is also updated as follows: all bits are left-shifted by one, and the right-most bit is updated with the most recent branch outcome (1 = taken, 0 = not taken). Question 3.2 (8 points) Assuming the global history register is initialized to all zeros and all the counters are initialized to weakly-taken (state 10). For i_max=100, j_max=2, and N=3, fill in the states of the table of two-bit counters when Ri=17 , Rj=1, and the PC is in the loop body part of the benchmark assembly code. Global History Register Bits 2-bit Counter Bits (oldest … newest) 000 001 010 011 100 101 110 111 8 Question 3.3 (6 points) Assuming all the counters are initialized to weakly-taken (state 10). Give the number of mispredictions and circle the final states of the counters for the branches Bi and Bj when the above code sequence is executed using the N-bit global history correlating counter prediction scheme. Bi: _________ mispredictions, final state of counter is: 00 01 10 11 (circle one). Bj: _________ mispredictions, final state of counter is: 00 01 10 11 (circle one). Question 3.4 (6 points) How well does this scheme do for larger values of j_max (i.e., when the number of iterations of the inner loop increases)? Explain your answer by giving a rough estimate of the number of mispredictions (in terms of i_max and j_max, if appropriate). 9 Question 4. Cache Coherence After examining the rules for cache coherence given in (L22-12-14), Ben Bitdiddle comes up with the following rule, which he thinks will improve the performance of his protocol. id M k k i M Pushout Rule (Child to Parent) → <id, Cell(a, u, (Ex, W(Idk))) | m > , <idk, Cell(a, v, (Ex, R(dir))) | mk> <id, Cell(a, v, (Ex, R(ε))) | m >, <idk, mk> Question 4.1 (5 points) Show that the Pushout Rule is correct because its behavior can be simulated by the other caching rules. Question 4.2 (5 points) Give a scenario which shows that the Pushout rule can do better than the rules given in the class. 10 Question 4.3 (5 points) Suppose we replace the Writeback Rule (L22-14) with the Pushout Rule. Can the new set of rules still show all the behaviors that the original set of rules could show? Argue by showing that the Writeback rule can be simulated by the Pushout Rule. _____________________________________________________________________________ (Do not write below this line) 11 Question 5. Superscalar AX This problem explores the issues in building a superscalar AX from the pipelined AX described in Lecture 13. Question 5.1 (2 points) Suppose the following instructions are in the pipeline of the pipelined AX, all of which have already passed the decode (ID) stage but none of which has completed the write-back (WB) stage: I1: I2: I3: ADD R1, R2, R3; LW R4, 0(R3); ADD R5, R6, R7; Regs[R1] <- Regs[R2] + Regs[R3] Regs[R4] <- Mem[0+Regs[R3]] Regs[R5] <- Regs[R6] + Regs[R7] Suppose the following instruction is in the fetch (IF) stage: I4: ADD R8, R4, R5; Regs[R8] <- Regs[R4] + Regs[R5] Can we dispatch instruction I4 into the decode stage? Explain. _____________________________________________________________________________ (Do not write below this line) 12 Recall from Lecture Slide L13-11, the Op decode rule for the pipelined AX is given by: Op decode rule Proc((ia, rf, IB(sia, r:=Op(r1, r2));bsD, bsE, bsM, bsW), im, dm) if r1 ∉ Dest(bsE) and r2 ∉ Dest(bsE) and r1 ∉ Dest(bsM) and r2 ∉ Dest(bsM) and r1 ∉ Dest(bsW) and r2 ∉ Dest(bsW) Ð Proc((ia, rf, bsD, bsE;ITB(sia, r:=Op(rf[r1], rf[r2])), bsM, bsW), im, dm) To simplify the boolean expression in the predicate, let’s define the following notations: Sources(inst) ≡ source register(s) of instruction inst ≡ destination register of instruction inst (or instruction templates, if any) Dest(inst) Dest(bs) ≡ the union of all destination registers of all instructions in buffer bs ≡ Dest(a1) ∪ Dest(a2) ∪ ... ∪ Dest(an) Dests(a1,a2,...,an) With these new definitions, the decode stage rules can be combined into one rule: Decode rule Proc((ia, rf, IB(sia, inst1);bsD, bsE, bsM, bsW), im, dm) if for all s ∈ Sources(inst1), s ∉ Dests(bsE, bsM, bsW) Ð Proc((ia, rf, bsD, bsE;ITB(sia, it1), bsM, bsW), im, dm) where it1 is the instruction template for instruction inst1 (i.e., the source registers of inst 1 have been replaced by appropriate values from the register file) Now, suppose we extend the pipelined AX to support the dispatching of two instructions at a time. For this new machine, which we will call AX2, at every clock cycle: (1) Two instructions are fetched from the instruction memory; (2) Up to two instructions can propagate to the next stage in the pipeline. Here, we assume the additional hardware (e.g., extra read/write ports for register file and memory, another ALU, additional data paths and muxs) necessary for implementing AX2 is available. 13 Question 5.2 (2 points) Ben Bitdiddle is writing a new set of TRS rules for the AX2. He begins with the fetch stage rule: Fetch stage rule Proc((ia, rf, bsD, bsE, bsM, bsW), im, dm) Ð Proc((ia+1, rf, bsD;IB(ia, inst1);IB(ia, inst2), bsE, bsM, bsW), im, dm) where inst1 = im[ia], inst2 = im[ia+1] Circle and correct the one mistake Ben made in writing the fetch stage rule for AX2. Question 5.3 (10 points) There are two decode stage rules for AX2: one rule for dispatching two instructions to the execute stage; another rule for dispatching only one instruction to the execute stage. Complete the two decode stage rules by providing the predicates and filling in the blanks in the terms to the right of the arrows. Decode stage rules Proc((ia, rf, IB(sia,inst1);IB(sia+1,inst2);bsD, bsE, bsM, bsW), im, dm) if Ð Proc((____________________________________________________), im, dm) where it1 and it2 are the instruction templates for instructions inst1 and inst2, respectively. Proc((ia, rf, IB(sia,inst1);IB(sia+1,inst2);bsD, bsE, bsM, bsW), im, dm) if Ð Proc((____________________________________________________), im, dm) where it1 and it2 are the instruction templates for instructions inst1 and inst2, respectively. 14 Question 5.4 (15 points) There are six execute stage rules, depending on the types of instructions to be executed. Ben wrote one of the execute stage rules: Execute stage rules Proc((ia, rf, bsD, ITB(sia, it1);ITB(sia+1, it2);bsE, bsM, bsW), im, dm) if it1 ≠ r:=Jz(-,-) and it2 ≠ r:=Jz(-,-) Ð Proc((ia, rf, bsD, bsE, bsM;ITB(sia, it1*);ITB(sia+1, it2*), bsW), im, dm) where it1* and it2* are executed versions of it1 and it2, respectively Complete the remaining five execute stage rules by filling in the blanks in the terms to the right of the arrows. Note that the terms (but not the predicates) to the left of the arrows are identical for all six execute stage rule. For clarity, only the predicates are provided below. if Ð Proc((____________________________________________________), im, dm) if Ð it1 = r:=Jz(v,-), v ≠ 0 and it2 = r:=Jz(0,nia) Proc((____________________________________________________), im, dm) if Ð it1 ≠ r:=Jz(-,-) and it2 = r:=Jz(v,-), v ≠ 0 Proc((____________________________________________________), im, dm) if Ð it1 ≠ r:=Jz(-,-) and it2 = r:=Jz(0,nia) Proc((____________________________________________________), im, dm) if Ð it1 = r:=Jz(0,nia) it1 = r:=Jz(v,-), v ≠ 0 and it2 = r:=Jz(v,-), v ≠ 0 Proc((____________________________________________________), im, dm) 15 Question 6. Sequential Consistency and Out-of-order Execution Consider the following parallel program for two processors. Processor 1 Store α, 10 R1 ← Load β R2 ← Load γ R3 ← R1 + R2 Processor 2 Store γ, 100 R1 ← Load α Store β, R1 α, β, and γ are three distinct addresses. Initially mem[α]=mem[β]=mem[γ]=0 Question 6.1 (8 points) Suppose Processor 1 and Processor 2 are the speculative, out-of-order processors (Ps) described in (L15-7~12). Assume processors have no caches and their data memories (dm) are shared. Ps rules speculate and reorder the execution of instructions other than Loads and Stores, which are dispatched from the ROB in order. Ps-Load Rule : Proc((ia, rf, ROB(t, ia, r:=Load(a)));rob, btb), im, dm) Ð Proc((ia, rf, ROB(t, ia, r:=dm[a]);rob, btb), im, dm) Ps-Store Rule: Proc((ia, rf, ROB(t, ia, Store(a, v));rob, btb), im, dm ) Ð Proc((ia, rf, rob, btb), im, dm[a:=v] ) These rules can insure sequential consistency in a system without caches. What are the possible values in R3 of Processor 1 at the end of an execution? 16 Question 6.2 (8 points) PSR is identical to Ps except in its Load and Store dispatch rules that are given below. PSR-Load Rule : Proc((ia, rf, rob1;ROB(t, ia, r:=Load(a));rob2, btb), im, dm ) if Store(a,-) ∉rob1 and Store(t’,-) ∉ rob1 Ð Proc((ia, rf, rob1;ROB(t, ia, r:=dm[a]);rob2, btb), im, dm ) PSR-Store Rule: Proc((ia, rf, rob1;ROB(t, ia, Store(a, v));rob2, btb), im, dm ) if Store(a,-) ∉rob1 and Store(t’,-) ∉ rob1 and -:=Load(a) ∉rob1 and -:=Load(t’’) ∉ rob1 Ð Proc((ia, rf, rob1;rob2, btb), im, dm[a:=v] ) Give a value for R3 of Processor 1 at the end of an execution that is allowed by PSR but not by PS, and number the instructions below from 1 to 7 to indicate an execution order that would lead to your answer R3=__________________. Order Processor 1 Store α, 10 R1 ← Load β R2 ← Load γ R3 ← R1 + R2 Order 17 Processor 2 Store γ, 100 R1 ← Load α Store β, R1 Question 6.3 (10 points) A memory-barrier instruction is introduced in PSRB to restore sequential consistency. PSRB is identical to PSR except in its Load and Store dispatch rules and the mem-barrier dispatch rule. PSRB-Load Rule : Proc((ia, rf, rob1;ROB(t, ia, r:=Load(a));rob2, btb), im, dm ) if Store(a,-) ∉rob1 and Store(t’,-) ∉ rob1 and mem-barrier ∉ rob1 Ð Proc((ia, rf, rob1;ROB(t, ia, r:=dm[a]);rob2, btb), im, dm ) PSRB-Store Rule: Proc((ia, rf, rob1;ROB(t, ia, Store(a, v));rob2, btb), im, dm ) if Store(a,-) ∉rob1 and Store(t’,-) ∉ rob1 and -:=Load(a) ∉rob1 and -=:Load(t’’) ∉ rob1 and mem-barrier ∉ rob1 Ð Proc((ia, rf, rob1;rob2, btb), im, dm[a:=v] ) PSRB-Mem-Barrier Rule: Proc((ia, rf, ROB(t, ia, mem-barrier);rob, btb), im, dm ) Ð Proc((ia, rf, rob, btb), im, dm) Memory barriers can be inserted in a program to make its behavior sequentially consistent, i.e. the same as PS as shown below: Processor 1 Store α, 10 mem-barrier R1 ← Load β mem-barrier R2 ← Load γ mem-barrier R3 ← R1 + R2 Processor 2 Store γ, 100 mem-barrier R1 ← Load α mem-barrier Store β, R1 Cross out the extra mem-barrier instructions that are not necessary to guarantee sequential consistency for this particular program. 18 Question 7. Atomic Operations and Cache Coherence Too . . . Consider the function below to increment a counter. Increment(int *counter) { R=M[counter]; R=R+1; M[counter]=R; } If multiple processes could increment the same counter simultaneously, the reading and updating the counter needs to be performed in an atomic manner. In this problem, you are asked to implement the function using different atomic operations. You may use a mixture of pseudocode and DLX assembly. You can assume the existence of temporary registers. Assuming your implementation is intended for a cache-coherent multiprocessor system, make your implementation as efficient as possible in terms of memory and cache subsystem operations. Question 7.1 (10 Points) Give an implementation of Increment() using the Swap instruction (L20-19). Swap(m,R): Rt ← M[m]; M[m] ← (R); R ← (Rt); 19 Question 7.2 (10 points) Give an implementation of Increment() using the Compare&Swap Instruction (L20-21). Compare&Swap(m,Rt,Rs): if (Rt==M[m]) then M[m]=Rs; Rs=Rt ; status ← success; else status ← fail; Question 7.3 (5 Points) In terms of memory and cache subsystem operations, give an advantage or a disadvantage of the Compare&Swap atomic instruction relative to the Load-reserve/Store-conditional combination (L20-22). Load-reserve(m,R): < reserve, address> ← < 1, m >; R ← M[m]; Store-conditional(m,R): if < reserve, address> = < 1, m > then cancel other processors’ reservation on m; M[m] ← (R); status ← succeed; else status ← fail; 20 Part 4: Scheduling an Irregular Instruction Pipeline TIPS Inc. hires you to add a multiply instruction to their integer DLX2000. The original DLX2000 only supports a subset of memory (LW, SW only) and ALU/ALUi (ADD, ADDI, SUB, SUBI, .etc) instructions. You can ignore branch/jump for this part of the quiz. Their 5stage pipeline is similar to what was presented in Lecture 9 and 10. Operands to all instructions are required at the beginning of the E stage. The result of an ALU instruction is available by the end of the E stage. The result of LW is available by the end of the M stage. Unless stalled, all instructions must follow the same execution template, even though some stages may be idle for some instructions. F D E M W t Inst t+1 t+2 t+3 t+4 Inst Inst Inst Inst The pipeline is fully-bypassed. The F and D stages only stall in the following condition (The right-hand-side signals are defined in L10-10.) Stalloriginal = { { opcodeE == LW [ ( wsE == rf1D ) and re1D ] or [ ( wsE == rf2D ) and re2D ] } and } IMUL performs half-length integer multiplication (only computes a 32-bit product) on 2 source and 1 destination registers in GPR. Regs[Rf3] ← Regs[Rf1] × Regs[Rf2] IMUL: The integer multiply unit (IMU) is separate from the main ALU. IMU has 3 stages and is pipelined to accept a new multiplication on each cycle. The multiplicands are required at the beginning of IMU1 and the product is available at the end of IMU3. The template for IMUL is: F D E M W IMU1 IMU2 IMU3 t IMUL t+1 t+2 t+3 t+4 t+5 IMUL IMUL IMUL IMUL IMUL 21 Question 13 (6 points): TIPS requires you to implement an in-order issue and in-order completion design. Briefly describe any new pipeline hazards introduced by the addition of IMUL and IMU to DLX2000. Data Hazard: Structural Hazard: Question 14 (6 points): Assuming all feasible data bypasses to the D stage have been provided, specify the necessary conditions for stalling the F and D stages to resolve the remaining hazards. (For question 14 and 15, you must assume none of the other stages can be stalled.) Your equation can make use of weX, wsX, re1X, re2X, and rf1x and rf2X and opcodeX from stages: D, E, M, W, IMU1, IMU2, and IMU3. (The signals from L10-10 are extended for IMUL below.). The opcode of an unoccupied stage is NOP. Your answer can also make use of Stalloriginal from the original DLX2000 pipeline without IMU. wsX = Case opcodeX ALU, IMUL ⇒ rf3 ALUi, LW ⇒ rf2 JAL, JALR ⇒ 31 weX = Case opcodeX ALU, ALUi, LW, IMUL JAL, JALR ⇒ (wsX ≠ r0) ... ⇒ off re1X = Case opcodeX ALU, ALUi, LW, SW, IMUL, BZ, JR, JALR ⇒ on J, JAL ⇒ off re2X = Case opcodeX ALU, SW, IMUL⇒ on ... ⇒ off Stallnew = 22 Question 15 (5 points): This is the last question of the midterm Complete the pipeline resource diagram for the following instruction sequence. You need to minimize the number of stalls. The first instruction has been filled in for you. I1: I2: I3: I4: I5: I6: F D E M W IMU1 IMU2 IMU3 0 I1 1 2 LW IMULT IMULT ADD IMULT ADD 3 4 5 0(r1), r2 r2, r3, r4 r1, r5, r1 r4, r5, r6 r7, r6, r8 r2, r3, r9 6 7 8 I1 I1 I1 I1 23 ;; ;; ;; ;; ;; ;; 9 r2 ← M[(r1)] r4 ← (r2) × (r3) r1 ← (r1) × (r5) r6 ← (r4) × (r5) r8 ← (r7) × (r6) r9 ← (r2) + (r3) 10 11 12 13 14 15 16 17