Sample Problems: Branch Prediction and Speculation

Transcription

Sample Problems: Branch Prediction and Speculation
Branch Prediction and Speculation
Sample Problems:
Branch Prediction and Speculation
Branch Prediction and Speculation
1. Consider the following two designs for a alleviating the effect of branches.
a. The first design defines a branch with two delay slots and does not use
branch prediction. Rather the solution is to use compile-time scheduling
to fill the delay slots with useful instructions where possible. Suppose that
for 30% of the branch instructions the compiler can fill both branch delay
and for 60% of the instructions the compiler can fill only one delay slot.
b. The second design employs branch prediction and does not use delay
slots. The mis-prediction penalty is 3 cycles. The branch always costs one
cycle, and if mis-predicted, it will cost an additional three cycles.
What prediction accuracy is required in the second design to achieve the same
performance as the first design?
From a. we know that 10% of the branch instructions result in two pipeline bubbles
while 60% result in a one cycle bubble. We can compute the increase in CPI for each
case (we can ignore the probability an instruction is a branch instruction since it is the
same in both cases) we have,
The increase in CPI due to a. is prob_instr_is_a_branch*(0.6 * 1 + 0.1 * 2) =
0.8
The increase in CPI due to b. is prob_instr_is_a_branch* 1 * (p * 3) = 3p
Where p is the probability of misprediction.
Equating both provides the critical value of p. The prediction accuracy = (1-p)
Branch Prediction and Speculation
2. Consider the following code sequence. Assume that each instruction is encoded in
one 32-bit word. We have a k-entry branch prediction buffer.
Address
0x40000000
0x40000004
0x40000008
0x4000000c L1:
0x40000010
0x40000014
0x40000018 L2:
0x4000001c
..
..
0x80008004 L3:
Instruction
DSUBUI
BNEZ
DADD
DSUBUI
BNEZ
DADD
DSUBU
BEQZ
R3, R1, #2
R3, L1
R1, R0, R0
R3, R3, #2
R3, L2
R2, R0, R0
R3, R1, R2
R3, L3
…
a. What is the minimum value of k to maximize prediction accuracy and
why?
In general it is 32. In this example, we need 5 bits to address the
prediction buffer without aliasing the branch addresses since the
addresses of the branch instructions differ in the least significant 5
bits. (In this example even 4 bits will suffice. Can you tell why?)
Alternatively, consider the use of a branch target buffer (Figure 3.19). Show
the possible contents of the following 4-element branch target buffer after one
execution of the prior code (assuming it is initially empty).
0x40000004
0x4000000c
0x40000010
0x40000018
0x4000001c
0x80008004
Branch Prediction and Speculation
3. Consider the use of a branch prediction buffer using n-bit saturating counters for
the code sequence shown below. The memory addresses for the instructions are
shown in hexadecimal notation. Assume the following loop code has been
executed 12 times. The branch at location 0x0044 has been taken 50% of the time
and the branch at location 0x0050 has been taken 50% of the time. Consider the
point in time of the start of the execution of the 13th iteration.
Address
0x0038
0x003C
0x0040
0x0044
0x0048
0x004C
0x0050
0x0054
0x0058
0x005C
L3:
L1:
L2:
..
..
DSUBUI
BNEZ
DADD
DSUBUI
BNEZ
DADD
DSUBUI
BEQZ
R3
R3
R1
R3
R3
R2
R3
R3
R1
L1
R0
R2
L2
R0
R1
L3
#2
R0
#2
R0
R2
a. Considering only the preceding code, how many entries should the branch
prediction buffer have to avoid the possibility of aliasing of branch
addresses?
The minimum number of least significant bits to ensure
no aliasing for these addresses is 4, hence the branch
prediction buffer would need 24= 16 entries.
b. If all prediction buffer entries were initialized to 0, what can be the value
of the counters in the prediction buffer corresponding to these two branch
instructions?
0x0044 BNEX R3 L1
0 ≤ value ≤ 6
0x0050 BNEZ R3 L2
0 ≤ value ≤ 6
Branch Prediction and Speculation
c. Now consider the case where we use a global branch predictor with 3 bit
global history. Execution of the 13th iteration is about to start. Provide an
example of i) the value of feasible 3-bit global branch history, and ii) the
value of an infeasible global branch history. Ensure you clearly identify
the entries in the branch history with the branch instructions in the code
sequence.
The first two branches test for equality between two numbers, N1 & N2, with the number
2. The last branch tests if N1 = N2. If the first branch is taken (N1 not equal to 2) and the
second branch is taken (N2 = 2) then the last branch cannot be taken. Hence a feasible
global history is 111 (the first branch on the program corresponds to the most significant
bit). An infeasible history is 000.
Branch Prediction and Speculation
4. Consider the case where the execution pipeline has a single cycle branch delay
slot. Static scheduling can fill 30% of the delay slots. We can fill 60% of the
remaining slots if we use cancelling branches: instructions are cancelled if the
branch is mispredicted. These slots are filled with instructions assuming the
branch is not taken. If 18% of all instructions are branches and they are taken 62%
of the time, what is the net increase in CPI? .
Number of stall cycles/instruction are
0.18 * 0.7 *(0.4 + 0.6*0.62)
70% of the branch instruction slots cannot be successfully filled with computer
instructions. Of these 40% are left empty and contribute a cycle. Of the remaining, a
cycle is contributed only when the branch is cancelled (which is 62% of the time).
Branch Prediction and Speculation
5. Consider the 5 stage integer pipeline with forwarding. Assume the branch
penalty is 1 cycles (branch condition computed in ID). Now assume we have
pipelined the memory system to three stages (rather than 1 stage) for both
instruction fetch and data fetch. Branches are resolved at the end of the EX stage.
We use a static branch-not-taken prediction strategy, i.e., if branches are taken we
incur the branch penalty. Assume conditional branches occur with a frequency of
14%.
a. If branches are taken 62% of the time, what is the increase in CPI due to
this prediction strategy?
The branch penalty is 3 cycles and incurred only when the branch is taken.
Increase in CPI = 0.62 * 0.14 * 3 = 0.2604
b. Alternatively, if we modify the pipeline and implement a delayed branch
with a single delay slot, and we are able to successfully fill 65% of the
slots, what is the increase in CPI?
35% of the time we are unable to fill these slots with a penalty of 1 cycle
Hence increase in CPI I s= 0.35 * 14
c. Now consider the occurrence of load delay slots, where loads occur with a
probability of 24%, and 40% of these fetch data used by the immediately
following instruction. If we perform no instruction scheduling to fill delay
slots, what is the increase in CPI compared to the original pipeline (i.e.,
without pipelining the memory system).
The load stalls are now three cycles rather than 1.
With no instruction scheduling we have 0.24 * 4 * 3 = 0.288
Branch Prediction and Speculation
6. Consider the dynamically scheduled execution of the following code sequence
where a ROB buffer is used. Assume register F10 is initialized with the value 1.0
and memory locations 0(R1) and 0(R2) are initialized with 6 and 7 respectively.
All other registers are initialized to 0. Consider the first iteration through the loop.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
a.
L.D F2, 0(R1)
L.D F4, 0(R2)
MUL.D F6, F2, F4
SUB.D F4, F6, F10
DIV.D F6, F12, F2
S.D F6, 0(R1)
ADDD F8, F8, F4
DADDIU R1, R1, #-8
DADDIU R2, R2, #-8
BNE R1, R4 LOOP
Show a valid state of a 4 entry ROB when instruction 7 issued. Identify
the head and tail of the ROB.
TAIL
HEAD
b.
LOOP:
Destination
F6
0(R1)
F8
F4
Value
NO VALUE
NO VALUE
NO VALUE
41
Status
PENDING
PENDING
PENDING
COMPLETED
Register re-mapping is employed where architecture registers are
remapped to physical registers (PR). F6 in instruction 3 is remapped on issue to PR 9.
When the DIV instruction reaches the head of the ROB can PR 9 be freed? Justify
your answer.
Yes. This means that all instructions prior to the DIV.D have committed and all
instructions that used the mapped register for F6 have completed. Therefore, it can be
freed.
Branch Prediction and Speculation
7. Consider the following code sequence for a dynamically scheduled machine.
Assume register F10 is initialized with the value 1.0 and memory locations 0(R1)
and 0(R2) are initialized with 6 and 7 respectively. All other FP registers are
initialized to 0. Consider the first iteration through the loop
1.
2.
3.
4.
5.
6.
a.
L.D F2, 0(R1)
L.D F4, 0(R2)
DIV.D F6, F12, F2
MUL.D F6, F2, F4
SUB.D F4, F6, F10
S.D F6, 0(R1)
Assuming an exception occurs on instruction 3 and instructions 4 and 5
have completed execution. What are the contents of F4 and F6 in the following cases?
F4
F6
Using a History Buffer only:
__41______
__42______
Using a ROB only:
____7____ ____0____
Using a Future Register File with an ROB: ____7____ ____0____
b.
What is a precise exception?
An exception is precise if its occurrence is as if it occurred after instruction i and before
instruction i+1. All instructions upto and including i complete execution and all
instructions flowing and including i+1 have not executed.
Branch Prediction and Speculation
8. Consider the dynamically scheduled execution of the following code sequence.
The first time through the loop an exception occurs on the DIV instruction.
Distinguish between how a precise exception will be handled using a ROB and a
history buffer. How does register renaming affect or not affect the handling of
exceptions.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
LOOP:
L.D F2, 0(R1)
L.D F4, 0(R2)
MUL.D F6, F2, F4
SUB.D F4, F6, F10
DIV.D F6, F12, F2
S.D F6, 0(R1)
ADDD F8, F8, F4
DADDIU R1, R1, #-8
DADDIU R2, R2, #-8
BNE R1, R4 LOOP
With an ROB: Exceptions for an executing instruction are flagged in its
ROB entry, but not raised. The processor raises an exception associated
with an instruction when that instruction reaches the head of the ROB.
Since instructions in the ROB are allotted entries in program order, they
are committed in program order. Instructions fetched speculatively on a
mispredicted branch are never committed. Therefore, all exceptions are
precise.
With an ROB, register renaming does not affect handling of precise
exceptions. This is because register renaming does not affect how the
instructions commit (always in program order) and exceptions can be
raised only at commit time.
In the case of a history buffer, instructions are allocated history buffer
entries that contain the old value (history) of the register being written. If
an exception occurs the corresponding history buffer entry is labeled.
When exception instruction reaches the head of the history buffer, the
history buffer is scanned from head to tail and all old values replaced
using those in the history buffer. This is needed because instructions write
directly to the register-file.
Branch Prediction and Speculation
9. We have a machine capable of retiring up to 4 instructions per cycle from the
ROB. Explain the conditions under which more than one instruction can indeed
be retired in a single cycle.
Multiple consecutive instructions starting at the current head of the ROB
must have completed execution & writeback of results (to their respective
entries in the ROB). All these instructions can be committed in a single
cycle. However, structural limitations (like, number of write-ports to the
register file and bandwidth to memory for committing stores) would put
hard limits on how many of the instructions at the head-of-queue in the
ROB may commit simultaneously.
Note that if multiple consecutive instructions are writing to the same
destination registers or memory location, the commit-hardware can still
commit them in the same cycle by ensuring that the value written to the
destination-register/memory-location comes from the result of the last
committed instruction that wrote that register/location. Finally, all
instructions are logically committed in program order.