Some Sample Problems For Exam 2
Transcription
Some Sample Problems For Exam 2
Some Sample Problems For Exam 2 1. (30%) Consider the following code segment in MIPS LOOP: LD F2, 100 (R2) : Load to F2 value from memory ADDD F4, F2, F8 : F4 = F2+F8 MULD F6, F4, F10 : F6 = F4*F10 SD 100 (R2), F6 : store F6 to memory ADDDI R2, R2, #8 : R2 = R2+8 BNE R2, R6, Loop Assume that you have the normal 5 stage pipeline (Instruction Fetch, Instruction Decode, Execute, Memory and write-back). Also we will use the following latencies: Integer operations 1 LD and STORE 2 Floating Add 3 Floating Mult 5 a) Show how many cycles are needed to complete one-iteration of the loop without reordering the code. b). Use delayed branch and show the number of cycles needed. You can also reorder the code if you see opportunities. Key. a). Note that the latencies indicate when the results are available for use. So the results of LD can be used 2 cycles after LD starts and if we assume the data can be forwarded to the instruction that needs the results, we only need on stall after LD before the ADDD can use the LD results. LOOP: LD F2, 100 (R2) Stall ADDD F4, F2, F8 Stall Stall MULD F6, F4, F10 Stall Stall Stall Stall SD 100 (R2), F6 ADDDI R2, R2, #8 BNE R2, R6, Loop Stall I included a stall after BNE since we do not know if the branch will be taken or not. With this we have 14 Cycles to complete one iteration. b). We can move SD into the delayed branch slot and also reorder some instructions to reduce the number of cycles needed. We gave to modify the offset of SD instruction since R2 is incremented by 8. LOOP: LD Stall ADDD Stall Stall MULD Stall Stall ADDDI BNE SD F2, 100 (R2) F4, F2, F8 F6, F4, F10 R2, R2, #8 R2, R6, Loop 92 (R2), F6 Now we need 11 cycles to complete one iteration. Comments. Some of you assumed that a latency of 2 meant you needed 2 stalls. Some of you had difficulty in reordering instructions or using delay slots. 2. (30%) Consider that you want to build a pipelined processor to implement ACCUMULATOR type instructions. For the purpose of this problem, we will assume that there is only one general purpose register, the accumulator (ACC for short). So we will have instructions like: LOAD address :Load to accumulator from address ADD address :Add to accumulator the value at address : result is stored back into ACC Likewise you can have other instructions for Subtract, Multiply, Store. We can also have ADDI immediate with “address” replaced by a constant value ADDI #literal Branch instructions can compare Accumulator to Zero. So we can have BZ, BNZ, BNEG and BPOS For these instructions, “address” will be the displacement that will be added to PC. Sketch pipelined Data paths for this type of an architecture. Key. For this problem I will use 4 stages: Instruction Fetch, Decode/Memory, Execute and Writeback Instruction Fetch. Fetch the instruction and increment PC by 4. Decode/Memory. Decode instruction, access data memory (either get an operand or store contents of accumulator into memory), also read the data in Accumulator to be used in an arithmetic operation. Note that if the instruction has a literal value, then the literal value will be used in the arithmetic operation instead of the data read from memory. If the instruction is Store, write the contents of Accumulator to memory. Execute. If the instruction is an arithmetic instruction, perform the arithmetic operation the values from Accumulator, and either the data from memory or the literal value from the instruction. At the same time we can also test the value in the accumulator to see if it is zero, non-zero, positive or negative. Writeback. Write the results of the instruction back to Accumulator. The results may come form the arithmetic unit, or from memory. Change PC as needed depending on the condition (if the instruction is a branch). Comments. Some of you seem to be completely lost. Some of you used the same data paths as those for MIPS architecture. EX/WB MEM/E X IF/MEM Accumulat or Test for condition PC Data Memory +4 Arithmeti c Unit Integer Adder 3. (15%) In a standard 5-stage pipeline, the decision about a branch is know in 3rd stage (Execute). Consider two choices for branch instructions a). Use delayed branch instruction with one delay slot (i.e., one instruction after the branch will be executed) and stop fetching any additional instructions upon discovering a branch instruction. b). Use a delayed branch with 2 delay slots (i.e., two instructions after the branch will be executed). In the first case, we have a stall on a branch and lose one cycle (even if we save a cycle because of the delay slot). In the second case there will be no stalls. However, it is more difficult to find two useful instructions that can be placed after a branch (and fill the 2 delay slots). If we cannot find a useful instruction for a delay slot, we use a NOOP and for the purpose of this example, we will assume that it is a wasted slot. Compare these two alternates if 20% of all instructions are branches, and an optimizing compiler can find useful instructions for one delay slot 80% of the time and find useful instructions to fill 2 delay slots only 25% of the time. Key Consider the case with 1 delay slot. We still lose one cycle even if we can use the delay slot. Or we will lose 2 cycles if we cannot use the delay slot. Since we can use only 80% of the time one delay slot, we have 20%*1+80% + 20%*2*20% = 0.16+0.08 = 0.24 cycles lost In the second case, if we can use both delay slots, we have no loss of cycles, if only one delay slot can be used, we have a loss of one cycle and if both delay slots cannot be used we have a loss of two cycles. Thus we have 20%*1*(100-25)%*80% + 20%*2*(100-25)%*20% = 0.12+0.06 = 0.18 cycles lost. In this example, creating two delay slots is better Comments. Some of you just compared the stalls just for Branch instructions (that is OK with me). Some of did not account for all the cases in the second option. You may be able fill both delay slots with useful instructions (25% of the time), fill only one of the two delay slots with useful instructions (80% of the remaining75% of the time) and cannot fill either of the two delay slots. 4. (25%) For the code in problem #1 above, use Scoreboard technique and show the contents of Instruction Status and Functional Unit Status tables for two snapshots a). At the initial state when no instruction has completed execution b). When LD and ADDD have completed. Remember for Scoreboard we will assume one Integer unit (for LD, SD and Integer arithmetic instructions), 2 Floating point Multiply units, one Floating point Add unit and one Floating point Divide unit. For your convenience, I am giving you the template for Instruction Status and Functional Unit status tables. Key a) Before any instruction completed execution Instruction Status LD F2, 100 (R2) ADDD F4, F2, F8 MULD F6, F4, F10 SD 100 (R2), F6 ADDDI R2, R2, #8 BNE R2, R6, Loop Issue Read Operands X X X X Execute Functional Unit Status Name Busy Op Fi Fj Fk Qj Integer MULT-1 MULT-2 ADD DIV Yes Yes No Yes No Load Mult F2 F6 R2 F4 F10 Add F4 F2 F8 Write results Qk Rj Rk ADD No No Yes Integer No Yes Note that we show No under Rj for Load instruction since in this snapshot we already read R2 and we want to indicate that WAR is no longer an issue (if there is an instruction waiting to write to R2, it can proceed). b) After LD and ADDD completed Instruction Status LD F2, 100 (R2) ADDD F4, F2, F8 MULD F6, F4, F10 SD 100 (R2), F6 ADDDI R2, R2, #8 BNE R2, R6, Loop Issue Read Operands Execute Write results X X X X X X X X X X X Functional Unit Status Name Busy Op Fi Fj Fk Qj Integer MULT-1 MULT-2 ADD DIV Yes Yes No No No SD Mult F6 R2 F4 F6 F10 Add Qk Rj Rk Mult-1 No No No No Note that for Multiply, I am assuming that the instruction read its operands, thus both F4 and F10 are read and any other instruction waiting to write to these registers can proceed (no longer a WAR issue) as indicated by No in the Rj and Rk fields. Note that ADDI and BNE cannot proceed since they are waiting for the integer unit which is now waiting to execute SD. If we re-ordered the code (and possibly use delayed branch), we could have moved ADDI and BNE before SD, and we could have completed those instructions. Comments. Some of you did not issue MULTD in the first snapshot. Remember you can issue an instruction if the functional unit is available and there is no WAW dependency on the destination register 5. (35%) In some older architectures an indirect memory address mode is permitted. Consider the following instruction. LWI Rd, disp This instruction uses disp as an indirect address. That is, use disp as a memory address, read the contents of memory at that address, and use the value just read as the address of the real operand (that is read memory again and store the value in Rd). Consider the following example. LWI R2, 100 Let us assume that at memory address “100” we have a value of 1500. The instruction will use 100 as an indirect address, obtains the value 1500 stored at memory address 100. The actual address of the operand is 1500. If we have a value of –10 in memory location 1500, then the instruction will load –15 into R2. Note that such indirect address is applicable to both Load and Store. Describe how we can modify our pipeline design for DLX to implement the indirect address mode. Show the pipeline stages and data-paths to indicate which hardware units are accessed in each of the pipeline stage. Also describe in English the functionality of each stage. Indicate the number of read and write ports needed to the data memory to avoid structural hazards. Key. As most of you discovered the typo, in the above example instruction using the indirect mode, we load –10 in R2 (not –15 as indicated in the problem description). I am sorry for this typo that may have confused some of you. Since indirect address mode requires that we access the memory twice, we need to design a pipeline with two Memory Access stages. Consider the following diagram. Indirect Address Data Memory P C Operand (if Indirect) Operand (if direct) +4 store (if indirect) disp Instr. Mem Indirect? ALU store (if direct) Reg's Instr Fetch Instr Decode/ Reg. Fetch Execute Memory-1 Memory-2 Write-Back To simplify, I have eliminated some of the data paths that are needed to handle branch instructions, to use immediate operands in instructions (i.e., sign extension hardware, testing for zero for branch instructions, etc). The main change is the introduction of two separate memory stages. In Memory-1, we use the displacement from the instruction to fetch a data value. We check to see if the instruction is an indirect instruction, if so we use the data fetched from memory as an address, and fetch memory again in Memory-2. The Mux in Memory-2 forwards either the data fetched in Memory-1 or in Memory-2 to Write-back. Note that the decision is based on the opcode which is obtained from the pipeline latches. For store, we may also have direct or indirect addresses. If the opcode is SW (direct), then we store in Memory-1 stage; if the opcode is SWI, we do the store in Memory-2 stage. I hope the diagram is clear with these data paths. As can be seen we need two read ports and two write ports to the data memory to avoid stalls, since it is possible to have either two LWI in a sequence or a SWI followed by SW instructions in sequence. Consider the following examples …….. ……. LWI R1, disp1 SWI disp1, R1 LW R2, disp2 SW disp2, R2 The example on the left hand side shows why we need two read ports to data memory (assuming instructions are in separate memory that will be accessed by IF—otherwise we need 3 read ports). The example on the right hand side shows why we need two write ports since the second Store stores in the first MEM –1 stage while SWI stores in the second MEM-2 stage. Although I did not ask for dependencies, here a bit of discussion on data dependencies. If we have the following sequence of instructions, LWI R1, disp ADD R3, R1, R2 The ADD will incur two stalls since the operand (R1) for ADD will not be available until LWI passes Memory-2 stage (we can forward the data from MEM2/WB to ID/EX). Likewise if we have LWI R1, disp Some Instruction with no dependency on R1 ADD R3, R1, R2 The ADD will incur one stall (and we need to forward the data from MEM2/WB to ID/EX) Note that we could have re-arranged the pipestages (say, IF, ID, Mem-1, Mem-2, EX, WB) to reduce or eliminate stalls due to LW or LWI to an ALU instruction. However, this can cause additional stalls in Branch since value of a register tested by branch may will not be available until EX. 6. (20%). Consider the following code segment 1: ADD 2: LW 3: ADDI 4: LW 5: SUB 6: BNEG R3, R0, R7 R8, 0(R3) R3, R3, #4 R9, 0(R3) R1, R8, R9 R1, Exit Note R0 is hardwired to Zero. List all dependencies (i.e., RAW, WAR, WAW) among these instructions. Use register renaming to eliminate as many of these dependencies as possible (and indicate which dependencies were eliminated). Key. RAW on R3 from 1 to 2. RAW on R3 from 1 to 3. RAW on R8 from 2 to 5. RAW on R3 from 3 to 4. RAW on R9 from 4 to 5. RAW on R1 from 5 to 6. WAW on R3 from 1 to 3. WAR on R3 from 2 to 3 If we assume that this code segment is not in a loop, we can eliminate the use of R3 for LW completely – and R3 is the only register that causes too many and unnecessary WAW and WAR dependencies. So our new code looks like 1: LW 2: LW 3: SUB 4: ADDI 5: BNEG R8, 0(R7) R9, 4(R7) R1, R8, R9 R3, R7, #4 R1, Exit Note that statement 4 is included because we do not know if the value of R3 is needed or not Now we have only RAW (or true) dependencies on R8 and R9 between 1, 2 and 3; and a RAW on R1 between 3 and 5. If we assume compare and branch (Branch on Less Than to compare two registers), we can eliminate the RAW on R1. 1: LW 2: LW 3: ADDI 4: BLT R8, 0(R0) R9, 4(R0) R3, R7, #4 R8, R9, Exit Note that when you rename R3 (as most of you did), you need to change the references to R3 to use the new register – like the second LW. Some of you did not consider that the value of R3 after the ADDI may be needed elsewhere. 7. (30%) You are given the following code. Note that floating-point instructions use floating point registers labeled F. Integer instructions use integer registers labeled R. We are given the following latencies for instructions (that is the dependent data must wait this many cycles to the data from the predecessor instruction). Floating Point Add/Sub Floating point Multiply Load Integer arithmetic (using data forwarding) Loop: LD MULD LD ADDD SD SUBUI SUBUI BNEZ F0, 0(R1) F0, F0, F2 F4, 0(R2) F0, F0, F4 0(R2), F0 R1, R1, #8 R2, R2, #8 R1, Loop 2 3 1 0 Assuming single-issue pipeline, unroll the loop 3 times and schedule the instructions to minimize the number of cycles needed to execute the code. Key. Let us look at the original code with appropriate number of stalls Loop: LD stall MULD stall LD stall ADDD stall stall SD SUBUI SUBUI BNEZ F0, 0(R1) F0, F0, F2 F4, 0(R2) F0, F0, F4 0(R2), F0 R1, R1, #8 R2, R2, #8 R1, Loop We need 13 cycles to complete one iteration of the loop. Now look at the loop unrolled 3 times (and I am using additional registers as well as correct the displacement to load and store values from different array locations). Loop: LD LD LD LD LD LD MULD MULD MULD stall ADDD ADDD ADDD SD SD SD SUBUI SUBUI F0, 0(R1) F4, 0(R2) F6, -8(R1) F8, -8(R2) F10, -16(R1) F12, 16(R2) F0, F0, F2 F6, F6, F2 F10, F10, F2 F0, F0, F4 F6, F6, F8 F10, F10, F12 0(R2), F0 -8(R2), F6 -16(R2), F10 R1, R1, #24 R2, R2, #24 BNEZ R1, Loop We needed the one stall before ADDD F0, F0, F4 since the MULD F0, F0, F2 needs 3 cycles before the data in F0 an be used. We can reorder the instructions (by moving one of the SUBUIR1, R1, #24) to eliminate this stall. Loop: LD LD LD LD LD LD MULD MULD MULD SUBUI ADDD ADDD ADDD SD SD SD SUBUI BNEZ F0, 0(R1) F4, 0(R2) F6, -8(R1) F8, -8(R2) F10, -16(R1) F12, 16(R2) F0, F0, F2 F6, F6, F2 F10, F10, F2 R1, R1, #24 F0, F0, F4 F6, F6, F8 F10, F10, F12 0(R2), F0 -8(R2), F6 -16(R2), F10 R2, R2, #24 R1, Loop Now we need 18 cycles to complete 3 iterations or 6 cycle per iteration. 8. (20%) This problem deals with branch target buffers, BTB (to store the address to which a branch is taken). Assume that the branch miss-prediction penalty is 4 cycles. You are given that the branch miss prediction rate is 10%, and the probability of finding the branch target (hit rate in the BTB) is 80%. On a miss in the BTB, the penalty (you have to complete execution of the branch instruction) is 3 cycles. 20% of all instructions are branch instructions. The base CPI is 1 cycle a). What is the CPI for branch instructions when using the BTB as described above? b). What is the CPI for branch instructions if you are not using BTB? Key. a). Notice we have two cases here: find an entry in BTB for the branch instruction or not; the branch prediction is correct or not. Hit in the BTB (80% of the time): 80%[90%*1 + 10%*4] Miss in the BTB (20% of the time): 20%*3 Total = 1.04+0.6 =1.64 cycles But we only have 20% of the instructions that are branches. So the CPI for branches = 20%*1.64= 0.328 cycles b). If there is no BTB, all branch instructions will take 3 cycles. Since 20% of all instructions are branches, the CPI = 20%*3 = 0.6 cycles. 9. (25%) Examine the following code segment. Assume a 5-stage pipeline and normal forwarding data on Read-After-Write Dependencies. Loop: LD ADDD LD MULD ADDD SD BEQ R3, 0(R5) R7, R7, R3 R4, 4(R5) R8, R8, R4 R10, R7, R8 R10, 0(R5) R10, R11, Loop a). Show how many cycles are needed to execute this sequence of code. b). Can you re-order the instructions to improve the number of cycles needed. Show the reordered code. Key. Assuming normal forwarding, only LD following by an Arithmetic instruction causes a stall. We will also assume that Branch causes a stall. Thus Loop: LD Stall ADDD LD Stall MULD ADDD SD BEQ Stall R3, 0(R5) R7, R7, R3 R4, 4(R5) R8, R8, R4 R10, R7, R8 R10, 0(R5) R10, R11, Loop We need 10 cycles. If we use flow diagrams through the pipeline we have Cycle LD R3, 0(R5) ADDD R7, R7, R3 LD R4, 4(R5) MULTD R8, R8, R4 ADDD R10, R7, R8 SD R10, 0(R5) BEQ R10, R11, LOOP 1 2 3 4 5 F 6 7 D E M W F S D F 8 9 10 11 12 13 E M W D E M F S D F E M W D E M W F D E M W F D E M W F S S S W b). Reordering the code is easy. I will also assume one delay slot so that we will have no stalls. Loop: LD R3, 0(R5) LD R4, 4(R5) ADDD R7, R7, R3 MULD R8, R8, R4 ADDD R10, R7, R8 BEQ R10, R11, Loop SD R10, 0(R5) Without any stalls, each iteration of the loop will take 7 cycles. The flow through the pipeline is shown below. Cycle LD R3, 0(R5) 1 2 F LD R4, 4(R5) ADDD R7, R7, R3 3 4 5 D E M W F D E M W F D E M W F D E M W F D E M W F D E M W F D E M MULTD R8, R8, R4 ADDD R10, R7, R8 BEQ R10, R11, LOOP 6 SD R10, 0(R5) 7 8 9 10 11 W 10. (30%) For the code in problem 2, let us assume that the latency for multiplication is 5 cycles and the latency for ADD is 3 cycles. The latency for all other instructions is 1 cycle. Using single-issue speculative processor, show a table similar to that on page 237 (note we are using single issue unlike the figure on page 237). Show the table for 3 iterations of the loop. Key. Iteration Issued Executes Memory Write to CDB Commits 3 4 5 8 9 1 LD R3, 0(R5) 1 2 1 ADDD R7, R7, R3 2 5 1 LD R4, 4(R5) 3 4 1 MULTD R8, R8, R4 4 7 1 ADDD R10, R7, R8 5 13 1 SD R10, 0(R5) 6 17 1 BEQ R10, R11, LOOP 7 17 2 LD R3, 0(R5) 8 9 2 ADDD R7, R7, R3 9 12 2 LD R4, 4(R5) 10 11 2 MULTD R8, R8, R4 11 14 2 ADDD R10, R7, R8 12 20 2 SD R10, 0(R5) 13 24 2 BEQ R10, R11, LOOP 14 24 5 12 25 Wait for LD 6 10 12 13 Wait for LD 16 17 Wait for MULTD 18 Wait for ADDD 18 Wait for ADDD 18 10 Comments 11 19 15 20 13 21 19 22 Wait for LD 23 24 Wait for MULTD 25 Wait for ADDD 25 Wait for ADDD Wait for LD 3 LD R3, 0(R5) 15 16 3 ADDD R7, R7, R3 16 19 3 LD R4, 4(R5) 17 18 3 MULTD R8, R8, R4 18 21 3 ADDD R10, R7, R8 19 27 3 SD R10, 0(R5) 20 31 3 BEQ R10, R11, LOOP 26 31 17 19 32 18 26 22 27 20 28 Wait for SD 26 29 Wait for LD 30 31 Wait for MULTD 32 Wait for ADDD Wait for LD and previous addd 32 Note we are using a single issue and not multiple issue. We will assume one FP adder, one FP Multiplier, one LD/SD unit (I am not using multiple reservations stations with adders and multiplier). We need to account for possible structural hazards in starting instructions. We may have delay an issue if the required functional unit is not available (you can show this also as delaying execution). Likewise we may have to delay posting results on CDB if the CDB is being used by earlier instructions. In this example this we need no delays due to structural hazards. Also in this example we are committing one instruction at a time (except for SD since there is no commit needed). However, it may be possible to commit all instructions that have completed. So we have a total of 32 cycles to complete 3 iterations or 10.67 cycles per iteration or 21 instructions in 32 cycles for an IPC of 0.657 instructions per cycles.