Week 3 out-of-class notes, discussions and sample problems
Transcription
Week 3 out-of-class notes, discussions and sample problems
Week 3 out-of-class notes, discussions and sample problems We wrap up our discussion of scoreboard and Tomasulo-style dynamic issue processors with a look at their implementation details, cost and complicating factors. We start with the nature of WAW and WAR hazards. First, let’s consider from a high level language point of view what a WAW hazard is. Consider the following two instructions: x = y + 1; x = z * 2; Why would anyone write such code? The only rational explanation is that the programmer made a mistake and the first instruction was to be replaced by the second but the programmer forgot (or was too lazy) to remove the first assignment. The first value stored in x is never used. Now, a WAW hazard has two consecutive writes to a register without an intervening read, but the two writes wind up occurring in opposite order. The following MIPS FP code would result in a WAW hazard: MUL.D F0, F2, F4 // F0 = F2 * F4 ADD.D F0, F6, F8 // F0 = F6 + F8 The reason for the hazard is that the multiply takes 3 more cycles than the add, and so F6 + F8 is computed and stored first, and then F2 * F4 overwrites the sum and the product is put into F0. F0 winds up with the wrong result. Yet why would a programmer put those two instructions together like that? The product is never used, there is no need for it. The reason for this combination of instructions is quite subtle. The compiler will schedule code for us to optimize it for the given pipeline. Among the optimizations are branch delay slot scheduling. It is possible that the compiler could make such a move on us. The following is such an example: Foo: BNEZ R1, Foo DIV.D F0, F2, F4 … L.D F0, … Here, the branch delay slot has a lengthy division operation. If the branch is taken, then we reach L.D before the lengthy division will complete, so the loaded value into F0 is eventually replaced by the division result. If the branch is not taken, the WAW hazard may still exist, it depends on the number of instructions that appear in place of the … above. While two successive writes to one location would not normally be a problem, the different lengths in execution cause the WAW hazard. The MIPS solution – as soon as the WAW hazard is detected, ensure that the MUL.D does not write to F0. WAR hazards are a different type of problem that do not arise in MIPS until we move to the scoreboard. In any typical sequence of MIPS operations, whether integer or FP, operand reads always take place in the 2nd stage and execution takes place afterward. So no earlier instruction’s read will happen after a later instruction’s write. Consider the following variation of a MIPS pipeline which could result in a WAR hazard. In this pipeline, ALU operations skip the MEM stage and memory operations have two cycles of MEM stages. The first cycle of the MEM stage tests the cache to ensure it is a cache hit and the second cycle of the MEM stage obtains the datum from register and sends it to data cache. Recall that any register write will occur earlier in a stage than a register read during the same clock cycle. So what we see is the DADD stores its result in R2 before the SW reads the datum. SW R2, 0(R1) DADD R2, R3, R4 IF ID IF EX ID MEM1 MEM2 WB EX WB This example is not plausible because the datum is actually retrieved from the register file in the ID stage. However, as we move to dynamic issue, we will see changes to when register values are read. First, in the scoreboard approach, registers are only read once both register values become available. Consider the following instruction sequence: L.D F0, 0(R1) ADD.D F2, F0, F4 L.D F6, 0(R2) MUL.D F8, F2, F6 The add is issued to the adder, but its operands are not read until both F0 and F4 are available. F0 is being loaded from cache, so it may postpone the register access of F4 for a cycle (or more). This is not significant in this problem. However, the MUL.D does not read either F2 or F6 until they are both available. The load will conclude prior to the add. This creates no hazard. But now consider this sequence instead: L.D F0, 0(R1) MUL.D F2, F0, F4 L.D F6, 0(R2) MUL.D F8, F2, F6 ADD.D F6, F10, F12 By replacing the second instruction with a longer MUL.D, it results in a longer amount of time before F2 becomes available for the fourth instruction. While the fourth instruction waits for F2 to become available, it waits before reading F6 as well. Unfortunately, the next instruction, ADD.D, is able to issue, read its operands, and execute all in the time the second MUL.D is waiting. The result is that the ADD.D can write its sum to F6 before the second MUL.D is able to read the old value of F6. This is a WAR hazard. As with the WAR hazard in the altered MIPS pipeline, this example is also an artifact of a peculiar design decision. Why shouldn’t the second MUL.D read F6 immediately upon being issued? The answer has to do with the amount of buses available to move data between registers and functional units. To keep costs down, we only need enough bus lines to accommodate 2 register reads per cycle. Those reads will be for the instruction issued earliest whose operands have become available. If the scoreboard could operate by having a functional unit read an operand as soon as it is available, it would avoid WAR hazards. In Tomasulo’s approach, both WAW and WAR hazards are avoided through register renaming. The cost is 1. added logic in the issue stage to detect the hazards (this logic would be required in the scoreboard as well unless we prevent the WAR hazards as explained in the previous paragraph) and 2. the added temporary registers needed to implement renaming. Since we could avoid the WAR hazard as explained above, and WAW hazards should not normally arise, should we instead forego the register renaming approach? The answer is no because with dynamic scheduling, you can never be certain of when data will become available and when the functional units will read the data. Therefore, although there is added expense, the approach permits dynamic issue of instructions which itself improves overall performance. The benefits of dynamic issue may not be apparent yet. As we continue to expand the capability of the processor, we will see their advantages. However, with dynamic issue, loop unrolling can take place naturally without compiler optimization. The primary cost of dynamic issue (at least at this point) is with the functional units (which we would have added anyway), reservation stations and temporary registers (a minor cost today), and the added logic (fairly minimal). The main disadvantage of dynamic issue is the reliance on the single CDB (common data bus). We can alleviate this bottleneck slightly by having two, one for integer values and one for FP values. The implementation of the Tomasulo architecture is given here. It is shown in figure 3.9 on page 180. The description below will hopefully be easier to read and understand. Instruction fetch unit: fetches instructions one at a time, incrementing the PC, and queuing each instruction in the instruction buffer. If a branch is issued, the behavior of the pipeline is different, as we will discuss below. Issue Stage: the instruction is decoded by type. Assume the instruction involves source register: rs, rt (some instructions do not have a second source operand) and destination register rd. If a reservation station for this type of instruction is available, send instruction to that functional unit and store it in the available reservation station. FP operation: Qi this reservation station Qj, Qk if register value for the source operand is not available, the reservation station that will be forwarding the value to it, otherwise 0 Vj, Vk register value from register file if operand is available Busy this reservation station’s busy flag set to yes Integer operation: same as FP Load/store operation: Qj, Qk, Vj, Vk same as FP operation A the immediate datum field from the instruction Qi reservation station number (for loads only), no Qi used for stores The Qj/Qk value is where register renaming takes place. If a datum is coming from a reservation station, we record that location rather than the register file location. Thus, each reservation station’s registers is used to promote renaming. Execute: once both source operands are available, the instruction can execute. If two instructions obtain their source operands in the same cycle, the instruction to move to the functional unit is randomly selected. Since functional units are pipelined, any waiting instruction can begin executing in the next cycle. FP operation: Qj, Qk, 0, execute on Vj, Vk Integer: same as FP Load/store: Qj 0, A Vj + A (compute effective address) Read from memory location [A] (loads) Write Vk to memory location [A] (stores) NOTE: loads and stores take 2 cycles to execute, the first cycle computes the effective address and the second performs the memory access. Write result: two things need to take place here, first the result has to be written to the register file and second the result has to be sent out on the CDB. All operations except store: Register[Qi] result If Qi is listed in any reservation station under its Qj or Qk, forward result and set that Qj or Qk 0 Qi 0 (indicate that this destination register is now available) Busy no Store: Send Vk to memory location [A] Busy no The CDB broadcasts 1 result per clock cycle (maximum) but the result is broadcast to all waiting reservation stations. Therefore, the write result writes the result to every waiting reservation station, the register file, and the store buffer at the same time. The hardware for Tomasulo’s approach permits loop unrolling, but in fact it does not execute as we thought. The instruction fetch unit continues to fetch instructions sequentially until a branch is completed. What happens to instructions fetched sequentially after a branch was fetched and issued but not yet completed? If the branch was taken, we would have the wrong instructions in the queue and/or issued to reservation stations. How do we know which ones? Consider the following code: L.D ADD.D C.LT.D BC1F MUL.D … F0, 0(R1) F2, F0, F4 F2, F6 foo F8, F10, F12 foo: Here, we load a datum, and use it in an add. Next, we compare the result in order to determine whether we should branch around a multiply or not. Assuming that the ADD.D takes 4 cycles to execute, and because the L.D will take 2 cycles to execute, the MUL.D operation will have been issued to a FP multiply unit before the ADD.D completes. Assuming F10 and F12 are available, the multiply will even begin executing before we have determined the branch condition. If the branch is taken, we want to shut down the MUL.D. But how does the FP multiply functional unit know that it was dependent on a branch? Unless we set some mechanisms up to handle this, we would have to delay issuing the MUL.D because it is after a branch. The text cryptically mentions that instructions after a branch are postponed in the issue stage (see the second to last sentence in the caption for figure 3.9 on page 180). This would mean that our loop unrolling example wouldn’t actually execute as state: L.D MUL.D S.D L.D MUL.D S.D The second iteration would be stalled in the issue stage until the first iteration’s branch completed. In which case, dynamic scheduling does nothing useful for us and we would want to rely on compiler-based loop unrolling instead! For now, we will assume that branches do not stall the issue stage and that there is a mechanism available to flush the reservation stations/functional units of instructions issued after a branch if the branch is taken. Next week, we will continue to expand our processor by focusing on the superscalar – a pipeline that permits multiple instruction issues per cycle. You can think of this either as parallel pipelines, or a Tomasulo-style processor where 2 (or more) instructions are issued at the issue stage, each to independent functional units. We will see that the Tomasulo-based superscalar approach is common in today’s processors. We will also examine how to support branch speculation so that we can bypass the problem discussed in the previous two paragraphs. The remainder of these notes cover some sample problems. Sample Problems: 1. For each of the following situations, provide an example of MIPS code that will result in the given hazard for the MIPS floating point pipeline, or explain why the hazard cannot arise. a. Structural hazard in the EX stage b. Structural hazard in the MEM stage c. WAR hazard d. WAW hazard Solution: a. Structural hazard in the EX stage – this arises when we have two division instructions within 25 cycles of each other since the division unit is not pipelined. b. Structural hazard in the MEM stage – this arises whenever two instructions leave the EX stage during the same cycle, for instance a FP add followed 3 cycles later by a load: ADD.D IF ID A1 A2 A3 A4 M WB Instr2 IF ID EX M W Instr3 IF ID EX M WB LW IF ID EX M WB c. WAR hazard – this cannot arise in the MIPS pipeline whether integer or floating point because all register reads happen in the 2nd stage and all writes happen later on, so no later instruction would write to the register file earlier than an earlier instruction reads from the register file d. WAW hazard – this can arise with two instructions that have out-of-order completion such as: ADD.D F1, … IF ID A1 A2 A3 A4 M WB L.D F1, … IF ID EX M WB 2. Repeat #1 for the scoreboard and Tomasulo architectures. Solution: a. The structural hazard in the EX stage exists if we do not pipeline our functional units. Additionally, if we have pipelined functional units, the structural hazard in the EX stage arises in Tomasulo if we run out of reservation stations. b. This does not exist because our MEM stage now has its own buffer. c. In the scoreboard, this exists when an instruction waiting at a functional unit to read its registers waits so long that a later instruction executes and writes its result to the same register as one that the waiting instruction needs to read. These hazards are avoided by stalling any such situation in the issue stage. With register renaming, WAR hazards cannot arise in the Tomasulo approach. d. Same as c. 3. Using the 7 cycle execution time for the MUL.D as presented in appendix C (as opposed to those of section 3.2), unroll and schedule the following loop to remove all stalls for the MIPS FP pipeline. Assume that the MUL.D and S.D can both enter the MEM and WB stages together. Loop: L.D F0, 0(R1) MUL.D F4, F0, F2 S.D F4, 0(R1) DADDI R1, R1, #8 BNE R1, R3, Loop Solution: the greatest source of stalls exists between the MUL.D and S.D (5 cycles worth, this would normally be 6 cycles worth if we could not accommodate both MUL.D and S.D in the MEM stage at the same time). We can improve on this by scheduling the DADDI between MUL.D and S.D and moving the S.D to the branch delay slot. This would reduce the number of stalls needed to 3 cycles. However, if we unroll the loop, we can only place one S.D in the branch delay slot, so at best, we still need to find 4 more instructions to exist between the MUL.D and S.D other than the DADDI. We will unroll the loop for 5 total iterations. Loop: L.D F0, 0(R1) L.D F6, 8(R1) L.D F10, 16(R1) L.D F14, 24(R1) L.D F18, 32(R1) MUL.D F4, F0, F2 MUL.D F8, F6, F2 MUL.D F12, F6, F2 MUL.D F16, F14, F2 MUL.D F20, F18, F2 DADDI R1, R1, #40 S.D F4, -40(R1) S.D F8, -32(R1) S.D F12, -24(R1) S.D F16, -16(R1) BNE R1, R3, Loop S.D F20, -8(R1) 4. For the following loop, first determine the stalls, next schedule the code to reduce the stalls, and finally, determining based on the number of stalls that remain how many times you would have to unroll the loop in order to have enough code to schedule such that you remove all remaining stalls. Assume the MIPS FP pipeline and the FP latencies as presented in Appendix C, not chapter 3.2. Assume an FP and an S.D can reach the MEM stage at the same time but not two FP operations. Loop: L.D F0, 0(R1) MUL.D F2, F0, F10 L.D F4, 4(R1) ADD.D F6, F2, F4 S.D F0, 8(R1) DADDI R1, R1, #12 DSUBI R2, R2, #1 BNE R2, Loop This code is roughly equivalent to the following for loop: for(i=0;i<3*n;i+=4) a[i+2]=a[i]*s+a[i+1]; Solution: the stalls occur as follows, 1 after LW, 1 after MUL.D, 5 after L.D (which subsumes both the RAW hazard from L.D to ADD.D and from MUL.D to ADD.D), 2 after ADD.D, 1 after DSUBI, 1 after BNE. The biggest source of stalls is after the MUL.D. Notice that unlike the previous example which had a RAW hazard between MUL.D and S.D, we have a hazard between MUL.D and ADD.D. The result is that we have 1 additional cycle of delay because ADD.D needs the datum in the A1 stage, S.D needed it in the MEM stage. This extra cycle of delay though is subsumed by the second L.D operation. Unfortunately, unlike the MUL.D/S.D example where the MUL.D and S.D could both reach the MEM and WB stages at the same time, this is not true of MUL.D and ADD.D. So in fact, we have an additional cycle of stall to avoid that structural hazard! We can schedule the code to remove some stalls as follows: Loop: L.D F0, 0(R1) L.D F4, 4(R1) MUL.D F2, F0, F10 DSUBI R2, R2, #1 DADDI R1, R1, #12 ADD.D F6, F2, F4 BNE R2, Loop S.D F0, -4(R1) This code has 4 cycles of stalls between MUL.D and ADD.D because of the RAW hazard but 1 additional cycle of stall from the structural hazard of the MUL.D and ADD.D reaching the MEM stage at the same time! We also have 1 stall after the ADD.D before the S.D because of that RAW hazard. NOTE: we do not want to insert the stall after the BNE because that would fill the branch delay, so the stall has to go after ADD.D and before BNE. Because we have 5 cycles worth of stalls, we need to fill the void between MUL.D and ADD.D with 5 instructions. We unroll the loop a total of 6 iterations. The code would have 12 L.Ds followed by 6 MUL.Ds followed by the DSUBI and DADDI followed by 6 ADD.Ds, followed by 5 S.Ds, 1 BNE and the last S.D. We would have to alter the memory offsets appropriately. 5. For the following code, show a table of when each instruction is issued, reads operands, executes and writes results using a Scoreboard-based architecture of Appendix C (the table will look something like that of figure 3.20 on page 202, but it will fit the Scoreboard architecture, not Tomasulo’s). Assume that you have the following unpipelined functional units (along with their execution times): 1 FP adder which takes 4 cycles to execute 2 FP multipliers which take 10 cycles to execute 1 FP divider which takes 20 cycles to execute 2 Load/store unit which take 2 cycles to execute (the load/store unit has its own adder so the integer EX does not need to be used, but only 1 memory operation can be done in the same 2 cycle period) 1 integer EX unit which takes 1 cycle to execute In addition, assume that a functional unit is busy from the time an instruction is issued to it through the cycle when it writes its results. If an instruction is waiting for a functional unit to become available before it is issued, it can only be issued the cycle after the functional unit is freed. Also assume that a functional unit waiting to read registers can read them the cycle they are written (writes occur first in the cycle, then reads), and can begin executing the cycle after the read occurs. Note that only one functional unit can write to the register file in one cycle and only one functional unit can read operands in one cycle. L.S F1, 0(R3) MUL.S F5, F1, F2 L.S F2, 4(R3) DIV.S F6, F2, F3 L.S F3, 8(R3) MUL.S F7, F6, F5 SUB.S F4, F1, F2 MUL.S F7, F1, F2 ADD.S F8, F7, F4 L.S F1, 12(R3) ADD.S F2, F1, F8 S.S F2, 16(R3) DADDI R3, R3, #20 // assume F2 already has a value // assume F3 already has a value Solution: L.S F1, 0(R3) MUL.S F5, F1, F2 L.S F2, 4(R3) DIV.S F6, F2, F3 L.S F3, 8(R3) MUL.S F7, F6, F3 SUB.S F4, F1, F2 MUL.S F7, F1, F2 ADD.S F8, F7, F4 L.S F1, 12(R3) ADD.S F2, F1, F8 S.S F2, 16(R3) DADDI R3, R3, #20 Issue 1 2 3 4 6 7 8 40 41 42 58 59 60 Read Operands 2 5 4 7 8 28 9 41 52 43 59 64 61 Execute 3-4 6-15 5-6 8-27 9-10 29-38 10-13 42-51 53-56 44-45 60-63 65-66 62 Writes Result 5 16 7 28 11 39 14 52 57 46 64 65 Comments RAW hazard with previous L.S RAW hazard with previous L.S No WAR hazard with previous DIV.S RAW hazard with DIV.S Stalls because of WAW with MUL.S RAW hazard with previous MUL.S Functional hazard – only 1 FP adder RAW hazard with ADD.S WAR hazard with S.S 6. For each of the following situations, explain how the MIPS floating point pipeline, the Scoreboard approach and Tomasulo’s approach each handle the situation, if the situation might result in stalling the entire instruction stream or just the affected instruction, or if the given situation cannot arise in that architecture (and if this is the case, why not). a. RAW data hazards b. WAW data hazards c. WAR data hazards d. Structural hazards from trying to enter the same FP functional unit e. Structural hazards from trying to read registers/source operands f. Structural hazards from trying to write results g. Structural hazard from performing load/store to memory h. Control hazards from a branch For instance, the MIPS pipeline handles RAW hazards by forwarding when possible, and stalling the pipeline when necessary, whereas WAR hazards cannot arise, and control hazards are handled by filling the branch delay slot or by assuming not taken and flushing the pipeline of the wrong instruction when branch is taken. Solution: MIPS FP pipeline: a. Prevented by forwarding when possible, stalling when needed, notably more common with FP operations. b. WAW only possible in FP operations, earlier operation does not write its result, essentially becoming a no-op. c. WAR hazards are not possible since all read occur in the 2nd stage and writes occur in stage 5 or later. d. The only structural hazard from the functional units is for divides, all others are pipelined or in the int unit’s case, it only takes 1 cycle. e. This structural hazard does not arise since all reads are in the ID stage. f. Since instructions must stall before entering the MEM stage if they are going to collide there, only one instruction enters the WB stage per cycle so this hazard does not arise (see f). g. Up to 4 instructions might try to enter the MEM stage at any one time (one from each of EX, A4, M7, DIV), so all later instructions trying this will stall. This may or may not stall the earlier (IF/ID) parts of the pipe. h. Handled by either freezing/flushing the instruction in IF, or using the branch delay slot. Scoreboard: a. Results sent to registers when produced, the Scoreboard alerts functional units when the values are available, so there is no stalling of instruction issue, but an instruction might wait at a functional unit for a while. b. The latter instruction is stalled from being issued when the hazard is detected, until the earlier instruction completes and writes its result. c. Writing of results by the later instruction is postponed until the earlier instruction can read both operands, causing the writing instruction to wait in the functional unit until the writing can be performed. d. If a needed functional unit is busy, the instruction stalls at the issue stage, stalling the instruction stream. e. Only one instruction can read registers in a cycle, the Scoreboard selects which instruction performs read operands based on the one waiting the longest, all others wait in their functional unit. f. Only one instruction can write results to a register in any cycle, all others must wait in their functional unit. g. A separate memory unit handles memory accesses, so the memory unit stores instructions in a queue and services them one at a time, so there is only one memory access per cycle enforced by this unit, all others wait. h. Unless a branch target buffer is used, incorrect instructions might be issued and later must be turned into no-ops. If a target buffer is used, the next instruction will already be in the instruction stream when it is needed. Tomasulo: a. RAW hazards are handled by forwarding results over the CDB and having all reservation stations keep an eye on the CDB for results from functional units that they are waiting for. b. WAW hazards are handled by disallowing the earlier instruction from writing, turning it into a no-op. c. WAR hazards are handled by register renaming, renaming the later register to be written to to a new value, and all subsequent instructions that use this register for a source will have the register renamed as well. d. Usually there are more reservation stations than functional units, so instructions will be issued and will wait in reservation stations for a functional unit to become available and for the source operands to become available, if no reservation stations are available, the instruction issue stalls, stalling the instruction stream. e. While only one set of register reads can occur in one cycle, causing other reservation stations to wait, any number of reservation stations can read from the CDB. f. Only one reservation station can write a result to the CDB at any one time, later instructions stall in their functional unit. g. Same answer as with the Scoreboard. h. Same answer as with the Scoreboard – we will see a better solution later in chapter 3 by using a “Reorder Buffer” which collects results and allows those results to be stored to memory or register only once we have determined that we predicted correctly.