Instructions - University of Calgary Webdisk Server

Transcription

Instructions - University of Calgary Webdisk Server
page 1 of 14
ENCM 369 Winter 2015 Lab 10
for the Week of March 30
Steve Norman
Department of Electrical & Computer Engineering
University of Calgary
March 2015
Lab instructions and other documents for ENCM 369 can be found at
http://people.ucalgary.ca/~norman/encm369winter2015/
1
1.1
Administrative details
You may work in pairs on this assignment
You may complete this assignment individually or with one partner.
Students working in pairs must make sure both partners understand all of the
exercises being handed in. The point is to help each other learn all of the lab
material, not to allow each partner to learn only half of it! Please keep in mind
that you will not be able to rely on a partner to do work for you on the final exam.
Two students working together should hand in a single assignment with names
and lab section numbers for both students on the cover page. Names should be
complete and spelled correctly. If you as an individual are making the cover page,
please get the information you need from your partner. For partners who are not
both in the same lab section, please hand in the assignment to the collection box
for the student whose last name comes first in alphabetical order.
1.2
Due Dates
The Due Date for this assignment is 3:00pm Monday, April 6. (April 3 is a holiday.)
The Late Due Date is 9:30am Tuesday, April 7.
The penalty for handing in an assignment after the Due Date but before the Late
Due Date is 3 marks. In other words, X/Y becomes (X–3)/Y if the assignment is
late. There will be no credit for assignments turned in after the Late Due Date;
they will be returned unmarked.
1.3
Marking scheme
A
B
C
D
E
total
1.4
4
5
4
3
3
19
marks
marks
marks
marks
marks
marks
How to package and hand in your assignments
Please see the Lab 1 instructions.
ENCM 369 Winter 2015 Lab 10
2
page 2 of 14
Exercise A: Tracing behaviour of an I-cache
2.1
Read This First
In learning about caches, it is useful to trace through all of the copying of bit
patterns that occurs in a sequence of memory accesses.
2.2
What to Do
Figure 1: Computer with one level of cache, and no address translation.
physical
address
physical
address
I-cache
32
32
32
processor core
(control, registers,
instructions
ALUs, f-p hardware)
32
32
main
memory
physical
address
32
D-cache
32
32
32
data
32
instructions
or data
The table in Figure 2 shows some of the main memory contents for a program
running on a computer like the one shown in Figure 1, running the MIPS instruction set (except that this computer does not have delayed jumps and branches).
Note that the first sequence of instructions is a complete procedure, but the second
sequence is only part of a procedure. The second sequence includes a loop; within
that loop there is a call to the procedure comprised of the first sequence of instrucFigure 2: Small fragments of main memory contents for Exercise A
.
address
0x0040_1640
0x0040_1644
0x0040_1648
0x0040_164c
..
.
instruction at address
0x8c88_0000
0x0008_48c0
0xac89_0000
0x03e0_0008
..
.
0x0040_2634
0x0040_2638
0x0040_263c
0x0040_2640
0x0040_2644
0x0040_2648
0x1211_0004
0x0200_2021
0x0c10_0590
0x2610_0004
0x1611_fffc
0x2652_0000
disassembly of instruction
P1: lw
$t0, ($a0)
sll
$t1, $t0, 3
sw
$t1, ($a0)
jr
$ra
..
.
L1:
L2:
beq
addu
jal
addiu
bne
addiu
$s0,
$a0,
P1
$s0,
$s0,
$s2,
$s1, L2
$s0, $zero
$s0, 4
$s1, L1
$s2, 0
ENCM 369 Winter 2015 Lab 10
page 3 of 14
Figure 3: Initial state of I-cache for Exercise A. (Only sets 397–403 in the cache are
shown.)
set
397
398
399
400
401
402
403
valid
1
1
0
1
1
1
1
tag
0x00402
0x00400
0x00000
0x00401
0x00401
0x00401
0x00400
instruction
0x1211_0004
0x256b_0008
0x0000_0000
0x8c88_0000
0x0008_48c0
0xac90_0000
0x8dcd_0014
tions. You will be asked to trace the interaction between the I-cache and the main
memory starting with PC = 0x0040_2634, at the moment in time just before the
beq instruction is fetched.
The I-cache for this computer is direct-mapped with 1024 sets, and the block
in each set contains one instruction. This structure exactly matches an example
presented in both lecture sections this term. For this cache, main memory addresses
are split like this:
31
12 11
search tag
21 0
set bits
00
byte offset
At the moment in time mentioned above, the state of sets 397–403 of the cache is
shown in Figure 3. Use 0x1000_1020 as the initial value for $s0 and 0x1000_1028 as
the initial value for $s1. Trace all the instruction fetches until after the instruction
at address 0x0040_2648 has been fetched. Record your answer in tabular form,
using this trace of the first two instructions as a model:
address
0x0040_2634
0x0040_2638
tag
0x00402
0x00402
set
397
398
action
I-cache hit—no I-cache update
I-cache miss—instruction 0x0200_2021 is
copied into “instruction” field in set 398,
V-bit in that set is changed to 1, tag to
0x00402
Hints: (1) Including the two instruction fetches given as examples, there will be a
total of 18 instruction fetches. (2) There is a useful C program in
encm369w15lab10/exA.
2.3
What to Hand In
Hand in your completed table.
3
3.1
3.1.1
Exercise B: Analysis of direct-mapped caches
Read This First
The base-two logarithm
The base-two logarithm is a simple and useful concept that has many useful applications in computer systems, one of which is describing dimensions in caches. The
log2 function is simply the inverse of the function f (x) = 2x , in the same way that
ENCM 369 Winter 2015 Lab 10
page 4 of 14
Figure 4: General organization of a direct-mapped cache. Wiring and some logic components have been left out to reduce clutter.
decoder
set bits
set
set
set
set
0
1
2
3
..
.
..
.
set S−3
set S−2
set S−1
=
1 tag comparator
for the whole cache.
Key to colouring of storage cells
status bit(s): 1 valid bit, plus, possibly, another bit to help with writes
tag bits
block of data or instruction words
Note: Relative sizes are not to scale. A block might be as large as 64 bytes (512 bits),
which is difficult to describe graphically in proportion to a single status bit.
the ln function is the inverse of f (x) = ex and the log10 function (often written as
simply log) is the inverse of f (x) = 10x . For example,
log2 1 = 0, log2 4 = 2, log2 32 = 5, and log2 65536 = 16.
3.1.2
Dimensions within direct-mapped caches
Direct-mapped lookup is the simplest practical way to organize a data or instruction
cache. It is illustrated in Figures 8.7 and 8.12 in the course textbook. The general
structure of all direct-mapped caches is shown in Figure 4.
To search for an instruction word or data word in a direct-mapped cache, the
main memory address of that word is broken into three or four parts:
• tag: used to distinguish a block address among a group of block addresses
that all generate the same set bits;
• set bits: used to select one set among the S sets in the cache;
• block offset: used to select a single word within a multi-word block, so not
necessary if the cache has one-word blocks;
• byte offset: can be assumed to be zero when the cache access is to an entire
data or instruction word, but would be important for access to indivdual bytes
in instructions such as MIPS lb, lbu, and sb.
How wide are each of these fields? Let’s go right-to-left within an address. First,
byte offset width = log2 (number of bytes per word).
So with 4-byte words, as in textbook and lecture examples, the byte offset width is
log2 4 = 2, but with 8-byte words, the byte offset width would be log2 8 = 3.
ENCM 369 Winter 2015 Lab 10
page 5 of 14
Next,
block offset width = log2 (number of words per block).
For example, in textbook Figure 8.12, there are 4 words per block, so the block
offset is log2 4 = 4 bits wide. Note that that gives you four different bit patterns to
choose one of the four words within a block: 00, 01, 10, 11. Note also that log2 1 = 0,
consistent with the idea that if the block size is one word, no bits from the address
should be used as a block offset—that is what you see in textbook Figure 8.7.
Moving on, let’s let S stand for the number of sets within the cache. Then
number of set bits = log2 S.
I hope you can see a general pattern here: If X is a power of two, then log2 X bits
are needed to select one of X things.
Finally, the tag is everything in an address that hasn’t already been used:
tag width =
address width − number of set bits − block offset width − byte offset width
The capacity of a cache is usually defined as the maximum number of bytes
of data or instructions that a cache can hold. Note that that definition excludes
storage of status bits and tag bits. Let’s let Bpl (“bytes per line”) stand for the
size of a block. (Cache line is a synonym for cache block, and unlike “block”, “line”
does not confusingly start with the letter b.) From Figure 4 it should be clear that
for a direct-mapped cache
C = S × Bpl,
and if we want to think of block size measured in words instead of bytes,
C = S × words per block × bytes per word.
3.1.3
Example calculations
1. Suppose it has been decided that a direct-mapped cache should have 1024 entries, each with one 32-bit data word. How should addresses be split into parts?
And what will the capacity of this cache be?
Byte offset: A 32-bit word is 4 bytes, so the width is log2 4 = 2.
Block offset: There is no block offset, because there is only one word per
block.
Set bits: S = 1024, so we need log2 1024 = 10 set bits.
Tag: The tag is all the bits to the left of the set bits. So addresses should be
split this way:
31
12 11
search tag
21 0
set bits
00
byte offset
Capacity is
C = 1024 blocks × 1
word
bytes
×4
block
word
= 4096 bytes
= 4 KB.
Remark: This has been a review of the organization of the cache used in
Exercise A of this lab.
ENCM 369 Winter 2015 Lab 10
page 6 of 14
2. Suppose it has been decided to build a direct-mapped cache with a (very tiny)
capacity of 32 bytes, in which each block holds 4 32-bit words. What is S, the
number of sets? And how should addresses be split into parts?
To find S, solve for it in this equation:
32 bytes = 25 bytes = S × 22
bytes
words
× 22
block
word
That gives
S=
25 bytes
= 25−2−2 blocks = 21 blocks = 2 blocks.
× 22 bytes
word
22 words
block
(The equation gives S as a number of blocks rather than a number of sets,
but that’s fine, because in a direct-mapped cache there is one block per set.)
The address split is: byte offset, log2 4 = 2 bits; block offset, log2 4 = 2 bits;
set bits, log2 2 = 1 bit; tag, 32 − 1 − 2 − 2 = 27 bits. Graphically, that’s:
31
543 21 0
search tag
00
set bit
block offset
byte offset
Remark: This example has been a review of the organization of the cache in
textbook Figure 8.12.
3.2
What to Do
Write well-organized solutions to the following problems.
3. Suppose the specification for the cache of textbook Figure 8.12 is changed.
The block size is now supposed to be to 8 words, and the capacity has been
increased to a much more practical size of 8 KB.
(a) What is S, the number of sets?
(b) How should addresses be split into parts? Show this with a diagram
like the previously given examples—include bit numbers marking the
boundaries between parts.
(c) The cache will be built using SRAM cells, one SRAM cell for every V-bit,
every bit within a tag, and every bit within a block of data or instruction
words. How many SRAM cells are needed for the whole cache? Show
your work carefully.
4. The term “64-bit processsor” usually describes a processor in which generalpurpose registers are 64 bits wide, and memory addresses within the processor
core and in pointer variables are managed as 64-bit patterns. However, 264
bytes of DRAM is an enormously larger quantity of storage than can practically be connected to a single processor chip, so a typical 64-bit design might
use only the least significant 44 bits of an address to access caches and main
memory. This is illustrated in Figure 5.
Do the following calculations for the computer of Figure 5. We’ll assume
direct-mapped design for all three caches, and we’ll consider the word size to
be 64 bits.
ENCM 369 Winter 2015 Lab 10
page 7 of 14
Figure 5: Simplifed view of memory organization of a recent single-core 64-bit system. (The major simplification here is the omission of address translation.) The core can
fetch two 32-bit instructions at once from from the Level 1 I-cache. The width of the
data/instruction bus between the Level 1 and Level 2 caches is unspecified but should be
much wider than 64 bits to support speedy transfers of large blocks.
physical
address
44
physical
address
L1
I-cache
44
44
44
??
64
instructions
unified
L2
cache
core
physical
address
44
physical
address
main
memory
44
L1
D-cache
64
data
??
??
instructions
or data
128
instructions
or data
(a) The capacity of the L1 D-cache is 32 KB. The block size is 64 bytes.
Draw a diagram to show how a 44-bit address input to the cache would
be split into these fields: tag, set bits, block offset, and byte offset.
Indicate exactly how wide each field is.
(b) Repeat part (a) for the L2 cache, which has a capacity of 2 MB. The
block size is again 64 bytes.
(c) Use your results from (b) to determine how many one-bit SRAM cells
are needed for all the V-bits, tags, and data/instruction blocks in the L2
cache.
4
4.1
Exercise C: Simulating a Cache
Read This First
In this exercise you will work with a C program to simulate some aspects of the
behaviour of data caches. For two different sequences of memory accesses you will
determine which accesses are cache hits and which accesses are cache misses. (This
is relatively easy to do; a complete simulation that modeled write-through or writeback to main memory would be a much more complex problem.)
The sequences of memory accesses were generated by determining the data memory accesses that would be made in sorting an array of 3000 integers using two
different sort algorithms on a machine similar to a 32-bit MIPS computer. Memory
is stored in 32-bit words, and byte addresses are 32-bits in length. All accesses to
data memory by the sort procedures are loads and stores of words, at addresses that
are multiples of four.
These sequences are available in text files. Here are the first ten lines of one of
the files:
r 804e6ac
ENCM 369 Winter 2015 Lab 10
r
w
r
r
r
w
w
r
r
page 8 of 14
804fe1c
804e6ac
804e6a8
804fe14
804fe18
804e6a8
804fe18
804e6a4
804fe0c
r means read (load) and w means write (store). The addresses are in hexadecimal
notation, even though there is no leading 0x.
Note that the actual data values are not included in the file, just the addresses
used to access data. It turns out that to count cache hits and misses, the sequence
of addresses used is all that really matters.
4.2
What to Do, Part I
Download the files from encm369w15lab10/exC. Note that two of the files in the
directory are large—if you are close to using up your disk quota (or if the file server
is having a bad day) your copy command might fail. To check that the copy was
successful, use the ls -l command (lower-case L, not number 1) to check the sizes
of the large files—they should be
heapsort_trace.txt
mergesort_trace.txt
1251240 bytes
1824492 bytes
(The files might be get little bigger if you have downloaded them to a Microsoft
Windows system due to the Windows text file format.)
Read the C source file sim1.c carefully. The program does a simulation of a
data cache with 1024 words in 1-word blocks. Build an executable and run it with
both data files, following the instructions in a comment near the top of the source
file.
Here is some information about two memory access traces:
• In heapsort_trace.txt all data memory accesses are to the array elements.
Each of the 3000 elements is read at least once, and is likely to be read and
written many more times.
• In mergesort_trace.txt the data memory accesses are to all of the array
elements, to a 1500-word array of temporary storage needed by the mergesort
algorithm, and to 72 words of stack used to manage a sequence of recursive
function calls. So the total number of different words accessed is 4572.
In both cases the 1024-word cache is much too small to hold all of the different data
memory words being accessed. If you na¨ıvely guess that access to memory words is
truly random, you would expect very high miss rates, such as (3000 − 1024)/3000 =
65.9% or worse for the heapsort run, and (4572 − 1024)/4572 = 77.6% or worse for
the mergesort run. However, you should see much lower miss rates due to locality
of reference in the memory accesses.
4.3
What to Do, Part II
Let’s examine the effect of changing the block size of the cache of Part I, while
maintaining capacity.
Make a copy of sim1.c called sim2.c, and edit it to simulate a direct-mapped
cache with 256 four-word blocks. You will not have to edit many lines of the
ENCM 369 Winter 2015 Lab 10
page 9 of 14
C code, but to do it correctly you will have to do some calculations like the ones in
Exercise B, then think carefully about how to isolate the correct set and tag bits.
Run the program using the two given input files. Make a record of your program
runs with termrec or some other suitable tool, and print that record and your source
file sim2.c.
Also, answer this question:
Compare results obtained in this part with results from Part I. Do they
suggest that there is significant spatial locality of reference in the memory accesses done by the heapsort and mergesort algorithms? Give a
brief explanation.
4.4
What to Do, Part III
Now let’s consider a direct-mapped cache with the same four-word block size as in
Part II, but with capacity doubled to 8 KB.
Make another copy of sim1.c; call this one sim3.c. Edit it to simulate a directmapped cache with the appropriate number of four-word blocks.
Run the program using the two given input files. Make a record of your program
runs with termrec or some other suitable tool, and print that record and your source
file sim3.c.
4.5
Final Note
Cache-friendliness (a tendency towards low miss rates) is an important factor in
performance of algorithms on modern computer hardware.
But don’t use this exercise to draw any firm conclusions about which of the
two sort algorithms is more cache-friendly. (My own suspicion is that mergesort
may often be significantly more cache-friendly than heapsort.) The caches being
simulated are unrealistically small, and using one run of each algorithm for a single
array hardly constitutes enough input data for a good experiment.
Researchers doing simulation experiments to compare cache designs will use
traces of trillions of memory accesses from many different programs in order to
ensure that performance is being measured for many different patterns of memory
access.
4.6
What to Hand In
Hand in the printouts from Parts II and III, and your answer to the question in
Part II. Please label the printouts clearly so your TA’s don’t have to guess which
printout is from which part.
5
Exercise D: Cache simulation in MARS
5.1
Read This First
MARS has a feature called a “Data Cache Simulator”. We’ll use this feature to
help consolidate understanding of a couple of concepts:
• caches with multi-word blocks reduce cache miss rates by exploiting spatial
locality;
• direct-mapped caches are susceptible to high miss rates when set conflicts
occur.
ENCM 369 Winter 2015 Lab 10
page 10 of 14
In all three parts, we’ll look at two MARS programs that walk through 8-element
arrays src1, src2, and dest, essentially as follows, but using pointer arithmetic
instead of array index arithmetic:
i = 0;
do {
dest[i] = src1[i] + src2[i];
i++;
} while (i != 8);
5.2
What to Do, Part I
Copy the directory encm369w14lab10/exD
Start up MARS. Open the file add-vectors1.asm, and read it to confirm that
the loop in main corresponds to the do-while loop given in “Read This First”.
Then click on the Assemble button. Before running the program, do these things:
• Set a breakpoint on the jr instruction at the end of main, so the program
will pause before getting to its “clean-up” code.
• Using the MARS Tools menu, start the Data Cache Simulator tool. Within
the Data Cache Simulation Tool window . . .
–
–
–
–
select Direct Mapping for the Placement Policy;
select 32 for the Number of blocks;
select 1 for Cache block size (words);
click on the button labeled Connect to MIPS.
If you have done this correctly you should see 128 as the value for Cache size
(bytes). You are about to see simulation of a ridiculously tiny direct-mapped
cache with 32 1-word blocks.
( )
over and over again.
Run the program by clicking on the single-step button
Make sure you observe at least a few passes through the loop in main.
You should see that every single memory access using lw or sw is a cache miss.
This should not be a surprise—the cache has 1-word blocks, and each word is
accessed only once—in other words, the data accesses made by the program totally
lack temporal locality of reference—so cache hits are impossible. After you have
( )
seen enough cache misses, click the run button
to get to the breakpoint you
set at the end of main. You should see in the Data Cache Simulation Tool window
that the Cache Hit Rate is 0% and that there are 24 red lines indicating 24 misses
in accesses to 24 of the 32 cache entries.
5.3
What to Do, Part II
This continues directly from Part I.
( )
so you can run the program again from the
Click the MARS reset button
beginning. In the cache simulation tool window, change the Number of blocks to 8
and change the Cache block size (words) to 4. If you have done this correctly you
should see that the Cache size is still 128 bytes, and there is a display of eight empty
cache blocks.
If you understand the concept of multi-word blocks, you should be able to make
a good guess at the hit rate before you run the program.
Run the program by single-stepping repeatedly. What you should see is that
there is a miss on the first access to a block, but the next three accesses to the
same block are hits. This is a simple demonstration that use of multi-word blocks
is helpful when programs have good spatial locality of reference.
ENCM 369 Winter 2015 Lab 10
5.4
page 11 of 14
What to Do, Part III
In this part you will repeat Part II using a very slightly different program. Load
the file add-vectors2.asm into the MARS editor.
Read the program. Observe that the only change is that some extra string
constants have been added to the data segment; this change will result in changes
to the address ranges for the arrays src2 and dest.
Assemble the program, set a breakpoint on the jr instruction at the end of main,
and click on Reset in the cache simulation window. (Also, make sure the number of
blocks and the block size are the same as in Part II.)
Run the program by single-stepping. You will discover that despite the excellent
spatial locality of memory accesses, the hit rate is not nearly as good as it was in
Part II. This is because I have manipulated the locations of the three arrays of ints
so that there will be some set bit conflicts.
When the program gets to the end of main you should observe that although
the combined size of the three arrays is 24 words, only 16 words (in four 4-word
blocks) of the cache have been used.
5.5
What to Do, Part IV
Make a copy of the following table, then fill in the empty cells with the correct 3-bit
set bit patterns generated by addresses of array elements in the program runs of
Parts II and III.
array
element
index
0
1
2
3
4
5
6
7
set bits, in base two
add-vectors1.asm
add-vectors2.asm
src1 src2 dest src1 src2 dest
100
100
100
100
101
101
101
101
Hints:
• The Labels window in MARS will quickly give you the base addresses for the
arrays.
• The set bits are bits 6–4 of an address. For example, the address of src1[0]
is 0x1001_0040 in both programs. In base two, with the set bits shown in
bold red, the address is 0001 0000 0000 0001 0000 0000 0100 0000.
In addition to the table, write a brief explanation of how you obtained the set bits
you wrote into the table.
5.6
What to Hand In
There is nothing to hand in for Parts I–III. For Part IV, hand in your table and
your explanation of how you found the set bits.
0
1
2
3
set S−3
set S−2
set S−1
set
set
set
set
=
=
..
.
way 1
...
···
···
···
···
···
···
···
N comparators for tags work in parallel.
..
.
way 0
=
..
.
way N−1
..
.
entry 0
entry 1
entry 2
···
entry N−2
Fully-associative cache: Like a set-associative cache with only one set.
No set bits are required for lookup. Instead all tags in the entire cache are checked at the same time.
Each tag has built-in comparison logic to test for a match with the search tag.
set bits
N-way set-associative cache: Each way is organized like a direct-mapped cache.
For lookup, the index selects a set, and all N blocks in that set must be checked for a hit.
N valid-bits are checked all at once; at the same time N tags are compared with the search tag.
Note: Relative sizes are not to scale. A block might be as large as 64 bytes (512 bits),
which is difficult to describe graphically in proportion to a single status bit.
decoder
0
1
2
3
set S−3
set S−2
set S−1
set
set
set
set
=
entry N−1
1 tag comparator
for the whole cache.
..
.
Direct-mapped cache: Like a 1-way
set-associative cache. The set bits decide
which set is inspected to test for a hit.
set bits
status bit(s): 1 valid bit, plus, in the case of a write-back D-cache, 1 dirty bit
tag bits
block of data or instruction words
bits that help with LRU or approximate-LRU replacement
decoder
Key to colouring of storage cells
ENCM 369 Winter 2015 Lab 10
page 12 of 14
Figure 6: Graphical comparison of set-associative caches, direct-mapped caches, and
fully-associative caches. After reading this caption you should probably use the “Rotate
Right” tool in your PDF reader to view the diagram. The term “search tag” refers to
the tag extracted from the address of the instruction or data item the processor core is
seeking. The diagrams are intended to show conceptual structure, and leave out wires and
some important logic elements (e.g., multiplexers). Layout in the diagram may or may
not reflect actual physical layout within integrated circuits.
ENCM 369 Winter 2015 Lab 10
6
page 13 of 14
Exercise E: Analysis of set-associative caches
6.1
Read This First
As explained in lectures, set-associative caches provide a good (but not perfect)
solution to the problem of set-bit conflicts in direct-mapped caches. The difference
between direct-mapped and set-associative structure is illustrated in Figure 6.
Just as for a direct-mapped cache, in a set-associative cache an address must be
split into parts: tag, set bits, block offset (if there are multi-word blocks), and byte
offset. The formulas for the widths of these address parts are exactly the same as
given for direct-mapped caches in Section 3.1.2.
However, the equations for the capacity of a set-associative cache are slightly
different from those for the capacity of a direct-mapped cache. If C is capacity in
bytes, S is the number of sets, N is the number of ways, and Bpl (“bytes per line”)
is the size in bytes of a block, then
C = S × N × Bpl.
To use a block size given in words instead of bytes:
C = S × N × words per block × bytes per word.
For units to make sense in these equations, it turns that appropriate units for N
are blocks per set.
Here are some example calculations:
• What is the capacity of the cache of textbook Figure 8.9?
Obtaining all the dimensions we need from Figure 8.9,
C = S × N × words per block × bytes per word
2 blocks 1 word 4 bytes
×
×
= 4 sets ×
set
block
word
= 32 bytes.
• Suppose we modify the design of the cache of Exercise A in this lab, so that it
is two-way set-associative instead of direct-mapped, and has four-word blocks
instead of one-word blocks. The capacity will remain 1024 words, that is
4096 = 212 bytes. How will addresses be split into parts to access this new
design?
We need to know S to determine the number of set bits.
C
S=
N × words per block × bytes per word
4 KB
= 2 blocks 4 words 4 bytes
× block × word
set
22 × 210 bytes
× 22 × 22 bytes/set
= 128 sets.
=
21
The number of set bits is log2 S = log2 128 = 7. The word size is 4 bytes, so
there is a 2-bit byte offset. The block size is 4 words, so the block offset is
also 2 bits wide. Here is the address split:
31
11 10
21-bit tag
43 21 0
7 set bits
2-bit block offset
2-bit byte offset
ENCM 369 Winter 2015 Lab 10
6.2
page 14 of 14
What to Do
Refer to Problem 4 in Exercise B (in Section 3.2). Do parts (a), (b), and (c) again,
with the following assumptions:
• no change in word size;
• no change in block size;
• no changes to the capacities of the caches;
• L1 D-cache changed to 8-way set-associative;
• L2 cache changed to 16-way set-associative.
6.3
What to Hand In
Hand in well-organized solutions for parts (a), (b), and (c).