# Instructions

## Transcription

Instructions
```page 1 of 13
ENCM 369 Winter 2015 Lab 11
for the Week of April 6
Steve Norman
Department of Electrical & Computer Engineering
University of Calgary
April 2015
Lab instructions and other documents for ENCM 369 can be found at
http://people.ucalgary.ca/~norman/encm369winter2015/
1
This Lab is important, but will not be marked
If the usual pattern for ENCM 369 labs were followed, this lab would be due Friday,
April 10, That would leave very little time for your TAs to get the lab marked
before the last lab periods on April 13.
So this lab will not be marked, and solutions will be posted sometime during
the week of April 6.
Please make a serious effort to solve the Exercises yourself before checking solutions!
2
2.1
Exercise A: Virtual and physical addresses of array elements
To understand how virtual memory works, it’s useful to think about the common
operation of sequentially accessing all of the elements in an array. For example,
consider this simple C function:
int sum_array(const int *arr, int n)
{
int sum = 0, i;
for (i = 0; i < n; i++)
sum += arr[i];
return sum;
}
arr should point to element 0 of an array, and n should indicate the number of
elements belonging to the array. In a system with virtual memory, the pointer
argument arr will hold a virtual address.
Let’s suppose sum_array is called with arr pointing to element 0 of an array of
1300 elements, each of which is a 4-byte int, and our virtual memory system uses
32-bit virtual addresses, 32-bit physical addresses, and 4 KB pages. Further, let’s
suppose the virtual address of element 0 of the array is 0x1001_0320—if that’s true,
the layout of the array in virtual memory must look like the left side of Figure 1.
ENCM 369 Winter 2015 Lab 11
page 2 of 13
Figure 1: An array of 1300 4-byte ints. Addresses must be sequential in virtual address
space, but in physical address space, chunks of the array can be within pages that are far
away from each other.
space
..
.
space
..
.
0x33715000
0x10012000
elements
0 to 823
elements
824 to 1299
0x33714000
0x10011000
elements
0 to 823
..
.
0x25589000
elements
824 to 1299
0x10010320
0x10010000
..
.
0x25588000
..
.
However, there are many, many possible arrangements of the array in physical
address space—the right side of Figure 1 shows just one possible arrangement.
If the arguments arr and n and local variables sum and i are all in registers, the
only data memory accesses the function will make will be reads of array elements,
using virtual addresses 0x1001_0320, 0x1001_0324, and so on, in sequence.
Because parts of the array are in two different pages, two different VPN-to-PPN
translations will be used by the load instructions that read the array element values.
The first translation is from VPN 0x10010 to PPN 0x33714 and the second translation is from VPN 0x10011 to PPN 0x25588. It’s possible that both translations
are in the D-TLB when the loop starts, in which case there will be no D-TLB misses
at all while the loop runs. At most there will be two D-TLB misses—one on the
effects of a TLB miss is copy a translation from a page table to the TLB.)
2.2
What to do, Part I
Consider the scenario outlined in “Read This First”, but with the address of arr[0]
changed to 0x1001_0fb0.
• Make a diagram similar to Figure 1, showing which array elements sit in
which pages. Be as detailed as you can. There is essentially only one correct
answer for the layout in virtual memory But there’s a huge number of possible
correct answers for the organization in physical memory—come up with just
one, making sure that it is consistent with the given information.
• What the largest possible number of D-TLB misses as the loop runs? Which
array element accesses could possibly cause D-TLB misses?
ENCM 369 Winter 2015 Lab 11
2.3
page 3 of 13
What to do, Part II
Consider the same loop, but with a much larger array. Suppose that the address of
arr[0] is 0x1001_2f00, and that the value of n is 50,000.
• Determine the total number of pages partly or completely occupied by elements of the array.
• Suppose the D-TLB has a capacity of 32 VPN-to-PPN translations. Find the
smallest and largest possible numbers of D-TLB misses as the loop runs from
beginning to end.
3
3.1
Exercise B: Integration of VM and caches
This problem asks you to trace some instruction fetches in a computer system that
has both virtual memory and caches.
The computer runs the MIPS instruction set, so all instructions are 32 bits in
size.
Both virtual and physical addresses are 32 bits wide. The page size is 4KB.
There are two TLBs: one for instruction address translations, and one for data
address translations. The instruction TLB has room for 4 translations.
There are separate instruction and data caches. The instruction cache is directmapped with 4-word blocks, and has a capacity of 256 bytes.
(The instruction TLB and instruction cache are both unrealistically small in
order to make this a viable pencil-and-paper exercise.)
3.2
What to Do
Suppose a process attempts to fetch three instructions using the following virtual
addresses, in the order given below:
0x0040_0ff8
0x0040_0ffc
0x0040_1010
Suppose that none of the instructions are loads or stores.
The page table for the process contains the following information:
virtual page number
0x00400
0x00401
0x00402
0x10001
0x7ffff
valid bit
1
1
1
1
1
physical page number
0x900c4
0x900ce
0x91076
0x91023
0x90fab
When the sequence of instruction fetches starts, the following information is in the
instruction TLB:
virtual page number
0x00400
0x00400
0x00402
0x80000
valid bit
0
1
1
0
physical page number
0x91111
0x900c4
0x91076
0x80000
When the sequence of instruction fetches starts, the following information is in the
instruction cache (along with 256 bytes of instructions):
ENCM 369 Winter 2015 Lab 11
set bits
0x0
0x1
0x2
0x3
0x4
0x5
0x6
0x7
008
009
0xa
0xb
0xc
0xd
0xe
0xf
valid bit
1
1
1
1
1
1
0
0
1
1
1
1
1
1
1
1
page 4 of 13
tag
0x900c4f
0x900ce0
0x900ce0
0x900c4f
0x900c4f
0x900c4f
0x000000
0x000000
0x800001
0x800001
0x800001
0x800001
0x900c4f
0x900c4f
0x900c4f
0x910764
For each of the three instruction fetches, answer the following questions:
• Is it a TLB hit or a TLB miss? Or is it not possible to tell with the given
information?
• If it is a TLB miss, does the miss cause a page fault?
• Is it a cache hit or a cache miss? Or is it not possible to tell with the given
information?
• Why is it helpful to know that none of the instructions being fetched are loads
or stores?
4
4.1
Exercise C: Why TLB miss handling must be
fast
In programs with good spatial locality of reference, TLB misses will be relatively
rare, and handling these misses will not add significantly to running time of the
program.
However, certain important algorithms necessarily have bad spatial locality. An
example is binary search, which can very quickly find a number in a very large
sorted array of numbers. (You do not have to know what binary search is or how
it works to do this exercise.)
This exercise is designed to help you understand the importance of speed in a
TLB miss handler in a case where there are frequent TLB misses.
4.2
What to do
Here is MIPS assembly language for a key loop within a procedure that performs a
version of binary search:
L1:
srl
and
lw
slt
\$t1,
\$t9,
\$t3,
\$t4,
\$t0, 1
\$t1, \$t8
(\$t9)
\$a0, \$t3
ENCM 369 Winter 2015 Lab 11
movn
movz
sltu
bne
\$a2,
\$a1,
\$t5,
\$t6,
\$t6,
\$t0,
page 5 of 13
\$t9, \$t4
\$t9, \$t4
\$a1, 4
\$t5, \$a2
\$zero, L1
\$a1, \$a2
Note that it is coded for a true MIPS system with branch delays, so when the
branch is taken, the sequence of instructions is bne, then addu, and then srl.
Question 1. Assume that the code is run on a MIPS processor that uses a simple
pipeline that will normally start one instruction per clock cycle. In the given loop,
if there are no cache misses, one-cycle stalls will be needed only so that the slt
instruction can use the lw result, and the the bne instruction can use the sltu
result. If there are no cache misses or TLB misses, how many clock cycles will it
take to run through the loop 20 times?
Question 2. To search a particular array of 1 million ints, the loop does in fact
run 20 times, and the sequence of virtual data addresses used by the lw instruction
is:
0x10208480,
0x102360ec,
0x1023688c,
0x10236fec,
0x102fc6c0,
0x1023dafc,
0x10236c5c,
0x10236fcc,
0x102825a0,
0x10239df4,
0x10236e44,
0x10236fbc,
0x10245510,
0x10237f70,
0x10236f38,
0x10236fc4,
0x10226cc8,
0x1023702c,
0x10236fb0,
0x10236fc8.
The VPN for this system is bits 31–12 of a virtual address. Assume that when
the loop runs, none of the needed VPN-to-PPN translations for data are in the
data TLB, but all of the needed data pages are in physical memory. Assume also
(very unrealistically) that there are no D-cache misses as a result of any of the loads.
Finally, assume there are no misses in the instruction TLB or instruction cache.
• If it takes 10 clock cycles to handle a miss in the data TLB, how many clock
cycles in total does TLB miss handling add to the answer to Question 1?
(This will require some analysis of the sequence of addresses used by the load
instruction.)
• Revise your answer, assuming now that it takes 100 cycles to handle a miss
in the data TLB.
• Would you say it doesn’t matter much whether the TLB miss handler takes
10 or 100 cycles, or would you say that that it has a significant impact on the
running time of the loop?
4.3
Extra note
This problem asks you to think about TLB misses in code that makes data memory accesses with poor spatial locality. However, it is important to know that in
most real-world cases of bad spatial locality cache misses will tend to cause more
5
5.1
Exercise D: Page table organization
This exercise is designed to help you understand what a page table is, and what
some of the design considerations are in choosing how to organize a page table.
ENCM 369 Winter 2015 Lab 11
page 6 of 13
The details about page table entries (PTEs) and page table organization are a
blend of details taken from Section 8.4 of your textbook, address space and page
table management in 32-bit x86 Linux, and 32-bit MIPS TLB miss handling. So,
put together, the details don’t match any real computer system, past or current,
but are realistic enough to give some insight about how virtual memory works on
real computers.
We are considering a system with 32-bit virtual addresses, 32-bit physical addresses, and 4KB pages. so, as seen in textbook and lecture examples, VPNs (virtual
page numbers) and PPNs (physical page numbers) will both be 20 bits wide. Our
system will use one 32-bit memory word for each PTE:
Alloc
31
Valid
Dirty
24 23 22 21 20 19
Ref
20-bit PPN field
0
The Valid, Ref, and Dirty bits play the same roles as exactly the V-bit described in
textbook section 8.4.2, and the U-bit and D-bit described in textbook section 8.4.5.
The purpose of the “Alloc” bit is to indicate whether or not a virtual page
exists at all for a given VPN. If a VPN is used to look up a PTE, and that PTE
has Alloc=1 and Valid=1, that means the corresponding virtual page is in physical
memory with the PPN given within the PTE. If that PTE has Alloc=1 and Valid=0,
the corresponding virtual page is on disk. But if the PTE has Alloc=0, that means
there is no virtual page for the given VPN.
Bits 31–24 in our PTE format could be used for more status bits, such as writeaccess (does the process have permission to write to the page, or only to read?)
or execute-permission (which would indicate whether a process is allowed to fetch
instructions from the page).
The simplest way to organize a page table is to simply make it a “flat” table, in
other words, just a big array of PTEs, with one PTE for each possible VPN. The
VPN would be used simply as an array index to find a PTE. Let’s assume that the
range of possible VPNs for a user process runs from 0x00000 to 0xbffff, which
happens to be the range of allowable VPNs for user processes in current 32-bit x86
Linux systems. Figure 2 shows an example of such a flat page table for a process
with three pages of instructions starting at virtual address 0x0040_0000, two pages
of static data starting at virtual address 0x1001_0000, and three pages of stack
starting at virtual address 0xbfff_d000. Notice that there are lots of PTEs filled
with 0 bits—in each of these PTEs bit 23, the Alloc bit, is zero. So most of the
page table is filled with information that indicates the nonexistence of many, many
virtual pages!
Let’s suppose our computer uses the MIPS instruction set. Then in assembly language the TLB miss handler would look something like the code shown in Figure 3.
The “quick” decision made in the second group of “special memory management
instructions” is:
• If Alloc=1 and Valid=1, quickly update a TLB with information from \$k0 and
\$k1, and restart a user process on whatever instruction caused the TLB miss.
• If Alloc=1 but Valid=0, jump to kernel code to handle a page fault (in other
words, get the kernel started on the disk operation needed to get the desired
virtual page into physical memory).
• If Alloc=0, jump to kernel code to deal with an attempted illegal memory
access by a user process.
Note that the MIPS register use conventions prohibit user processes from using
GPRs \$k0 and \$k1. This rule is in place to help optimize TLB miss handlers for
ENCM 369 Winter 2015 Lab 11
page 7 of 13
Figure 2: Flat page table. PTEs are 32-bit words; in the diagram each nonzero PTE is
shown split into a 12-bit status-bit field and a 20-bit PPN field.
Index
0xbffff 0x00e 0x81f23
0xbfffe 0x00f 0x99012
0xbfffd 0x00f 0x87005
0xbfffc
0x0
..
..
.
.
0x10012
0x0
0x10011 0x008 0x44552
0x10010 0x00f 0x95232
0x1000f
0x0
..
..
.
.
0x00403
0x0
0x00402 0x00d 0x86aba
0x00401 0x00d 0x9a9a0
0x00400 0x00d 0x92777
0x003ff
0x0
..
..
.
.
0x00000
0x0
ENCM 369 Winter 2015 Lab 11
page 8 of 13
Figure 3: Sketch of TLB miss handler code for the page table organization of Figure 2.
A few special memory management instructions to copy the VPN
into \$k0 and the base address of the page table into \$k1.
sll
lw
srl
\$k0,
\$k1,
\$k1,
\$k0,
\$k0, 2
\$k1, \$k0
(\$k1)
\$k0, 2
#
#
#
#
\$k0 = VPN * 4
\$k1 = PTE
restore VPN in \$k0
A few more special memory management instructions to inspect
the page status bits within the PTE in \$k1 and make a quick
decision.
speed—a miss handler can use those two registers without first saving them to
memory to preserve data belonging to a user process.
Here is an example of how one of the all-zero PTEs in Figure 2 could be used.
Suppose our computer runs a Linux-like operating system. A novice programmer
writes a program for it in MIPS assembly language. When the program runs as a
user process, suppose it happens to be given virtual memory exactly corresponding
to the page table shown in Figure 2.
Suppose the program contains the instruction lw \$t1, (\$t0). Unfortunately
the program is defective, and when the load instruction is reached, the address
in \$t0 is 0x1001_2468, which isn’t in the set of virtual addresses the program is
allowed to use. What happens, in detail, is as follows:
• The VPN of 0x10012 is generated from 0x1001_2468.
• There is a miss in the data TLB. (Make sure you understand why there cannot
possibly be a hit!)
• The TLB miss handler software uses the VPN as an index in the page table,
which produces an all-zero PTE in \$k1.
• Because the Alloc bit in the PTE is zero, the TLB miss handler decides that
the process has tried to make an illegal memory access, and shuts down the
process. Our novice programmer is left to figure out why the program died
with a “segmentation fault” error message.
5.2
What to Do, Part 1
• For the process whose page table is shown in Figure 2, is the data word at
virtual address 0x1001_1400 on disk or in physical memory? If it is in physical
memory, what is the physical address for the word?
• Repeat the previous question, but use 0xbfff_dffc as the virtual address.
• What is the combined size, in KB, of all the instruction, static data, and stack
pages in use by the process of Figure 2?
• What is the size, in KB, of the page table shown in Figure 2?
ENCM 369 Winter 2015 Lab 11
5.3
page 9 of 13
I hope you concluded from the last two answers you found in What to Do, Part 1,
that the page table was unreasonably huge compared to how much memory was
needed for the actual instructions and data of the process. This example illustrates
the fact that a flat page table is not a practical solution in the case of 32-bit addresses
and 4 KB pages.
In Linux on x86 systems, addresses really are 32 bits wide and pages really are
4 KB in size. Page table organization is close to what is shown in Figure 4 on
page 10. (I am omitting some complicated details, so I can’t say the organization
is exactly as shown in the figure.) Instead of one huge page table for each process,
there are a number of medium-size page tables. Part of the VPN is used to find a
pointer within an array of pointers; this array of pointers is called a page directory.
If a valid (non-null) pointer is found in the page directory, that pointer is assumed
to point at the base of the page table where the needed PTE is located. Details
about which bits of the VPN are used for what purposes are given in the caption
to Figure 4.
5.4
What to Do, Part 2
• For the process with the page table information shown in Figure 4, describe
how a TLB miss handler would determine that 0x10012 is not a legal VPN.
• For the process whose page table is shown in Figure 4, is the instruction word
at virtual address 0x0040_2820 on disk or in physical memory? If it is in
physical memory, what is the physical address for the word?
• Repeat the previous question, but consider the data word at virtual address
0x1001_1454.
• What is the combined size, in KB, of all the instruction, static data, and stack
pages in use by the process of Figure 4?
• What is the combined size, in KB, of the page directory and the page tables
shown in Figure 4? Does this seem reasonable compared to the size of the
page table in Figure 2?
• Consider the TLB miss assembly language code of Figure 3. Replace the sll,
addu, lw, and srl instructions with a sequence suitable for searching the page
table organization of Figure 4. Pretend that in addition to \$k0 and \$k1, one
more GPR called \$k2 is available to hold intermediate results. (Hint: Your
sequence should be about 8–10 instructions in length, and should include two
lw instructions.)
5.5
More notes about page table organization
• Real 32-bit MIPS systems use an efficient but somewhat hard-to-explain system for page tables that is not a two-level organization.
• 32-bit x86 systems have a two-level page table organization much like what
is presented in this exercise, but routine TLB miss handling is not done by
kernel software—instead special hardware is dedicated to fast searches for
PTEs within page tables. (However, if there is a miss in the page table after
a miss in a TLB, then kernel software will have to manage the problem.)
ENCM 369 Winter 2015 Lab 11
page 10 of 13
Figure 4: Two-level page table structure. PTEs have exactly the same format as in
Figure 2. Bits 31–22 of a virtual address (so, bits 19–10 of the VPN for that virtual
address) are used as an index into the page directory array. If a non-null pointer is found
in the page directory, that pointer is used as the base address of a 1024-word page table,
and bits 21–12 of the virtual address (so, bits 9–0 of the VPN for that virtual address) are
used as an index into that page table. (Note: Translations for the process of this figure
are not the same as translations for the process with the page table of Figure 2.)
Page table
0x00e 0x89772
0x00e 0x84dfe
0x00f 0x91f01
0x0
..
.
Index
0x3ff
0x3fe
0x3fd
0x0
0x000
Page table
0x0
..
.
Index
0x3ff
..
.
-
Index
0x2ff
0x2fe
..
.
Page directory
r
0x041
0x040
0x03f
..
.
0x0
r
0x002
0x001
0x000
0x0
r
0x0
..
.
0x0
0x88a62
0x9000e
0x0
..
.
0x012
0x011
0x010
0x00f
..
.
0x0
0x000
Page table
0x0
..
.
Index
0x3ff
..
.
0x00d
0x00f
0x0
..
.
..
.
-
0x0
0x00d
0x008
- 0x00d
0x0
0x81fff
0x94321
0x8a8a5
0x003
0x002
0x001
0x000
ENCM 369 Winter 2015 Lab 11
6
6.1
page 11 of 13
Exercise E: Followup to Exercise D
This is an optional exercise, which you should do if you want to learn about the
effect of address size on page table organization.
6.2
What to do, Part 1: Back to the 1970’s
Consider a machine with 16-bit words, 16-bit virtual addresses, and 20-bit physical
addresses. If pages are 4 KB in size, what would the size of a “flat” page table be,
assuming that a single PTE could fit in a 16-bit word? Would there be any reason
to use a two-level table instead?
(This problem is based loosely on documentation of Digital Equipment Corporation PDP-11 computers, which I am just barely young enough not to have used
6.3
What to do, Part 2: Actual computers of 2015
In the x86-64 architecture, pointers are 64 bits wide. However, on Linux, using
current processor chips, the range of addresses available to a user process runs from
0 to 0x7fff_ffff_ffff—so bits 46–0 of the address can be a mix of 0’s and 1’s,
but bits 63–47 have to be 0. (I think this is also true for 64-bit Windows and 64-bit
Mac OS X, but I haven’t checked.)
That means that if pages are 4 KB in size, the meaningful part of a VPN is
47 − 12 = 35 bits wide.
Consider using a two-level page table structure for such a system. Suppose
that even the smallest process needs a page directory and at least two page tables
that would be found using pointers within the page directory. What would be the
minimum combined size of a page directory and two page tables? (Keep in mind
that pointers are 64 bits wide, and assume that a PTE is 64 bits in size, because
32 bits is now probably not enough room to hold a PPN and page status bits.)
What can you conclude about whether a two-level page table organization is
reasonably space-efficient for x86-64?
7
7.1
Exercise F: Loads and stores in a writeback cache
This is an optional exercise, which you should do if you want to get some insight
into the detailed operation of a write-back set-associative data cache.
7.2
This exercise is about an example writeback data cache design that was not presented in lectures this year. However, you can learn about the design by following
http://people.ucalgary.ca/~norman/encm369winter2015/section01/
. . . and reading slides 50–65 of Set 9, and the related notes.
A diagram of this cache circuit is given in Figure 5.
Suppose that this cache is used in a computer that does not have virtual memory.
Suppose also the the computer has only one level of caches—you do not have to
worry about whether this is an L1 or L2 cache.
ENCM 369 Winter 2015 Lab 11
page 12 of 13
Figure 5: 4 KB 2-way set-associative cache, with 2-word blocks and LRU replacement.
This diagram shows details of hit-detection and data-read logic, but does not show all the
hardware related to writes.
main memory
D V tag
8
search tag
decoder
set bits
21
way 0
way 1
00
.. .. ..
. ..
..
.
word 1 word 0 D V tag word 1 word 0 U
..
.
..
.
32
32
1
=
..
.
..
.
set 1
set 0
21
block
offset
.. .. ..
.. .
set 255
set 254
.. ..
. .
21
0
32
32
1
0
=
32
1
Hit1
Hit0
32
0
32
1 for hit, 0 for miss
data
available to core
Figure 6: State of the cache after a program has been running for a while, just before
it starts on the sequence of instructions listed in “What to Do” of Exercise F. Tags, data
words and set numbers of given in hex; 0x is left out to save space.
way 1
way 0
D V tag
word 1
word 0
D V tag
word 1
word 0
U set
0 0 000000 00000000 00000000 1 1 0ffffb 00000063 00000058 1 ff
.. ..
.. ..
..
..
..
..
..
..
..
. .
. .
.
.
.
.
.
.
.
0
1
1
0
1
1
1
1
020021
020020
020021
020022
00000062
fffffff1
00000065
00000007
fffffffe
fffffff5
64636261
00000009
0
1
1
0
1
1
1
1
0ffffc
0ffffc
020023
020021
0000002a
004000b4
fffffff4
00000171
00000037
00000013
ffffffd8
00000161
1
0
1
1
03
02
01
00
ENCM 369 Winter 2015 Lab 11
page 13 of 13
Figure 7: State of some main memory words after a program has been running for a while,
just before it starts on the sequence of instructions listed in “What to Do” of Exercise A.
0x1001_0000
0x1001_0004
0x1001_0008
0x1001_000c
0x1001_0010
0x1001_0014
0x1001_0018
0x1001_001c
7.3
data
0xffff_ffff
0xffff_fff8
0xffff_fffc
0xffff_fff4
0xffff_fffa
0xffff_fff1
0xffff_ffe0
0xffff_ffe7
What to Do
Suppose a program has been running for a while, and is about to start on the
following sequence of instructions:
lui
lw
sw
lw
lw
ori
sw
\$t0, 0x1001
\$t1, (\$t0)
\$t2, \$t1, 1
\$t1, (\$t0)
\$t3, 16(\$t0)
\$t4, 28(\$t0)
\$t5, \$zero, 0x345
\$t5 12(\$t0)
At that moment, the cache is in the state shown in Figure 6, and some relevent
main memory contents are as shown in Figure 7.
Make a table similar to Figure 6 to show the state of the cache just after all
the listed instructions have finished. Also, make a list of the addresses and data
involved in writing dirty blocks back to main memory via a write buffer.
Hint: There is a useful program in encm369w14lab11/exF
```