Week 7 out-of-class notes, discussions and sample problems

Transcription

Week 7 out-of-class notes, discussions and sample problems
Week 7 out-of-class notes, discussions and sample problems
In these notes, we concentrate on the lower levels of the memory hierarchy: main memory (DRAM) and
virtual memory (swap space). We will also briefly look at flash memory, a form of removable storage.
We start with a look at memory technology and improvements.
SRAM – static RAM, built out of flip-flops. 6 transistors can be used to build 1 storage cell. SRAM is
used to make both registers and cache. Although we refer to cache as “on-chip” and “off-chip”, the latest
generation of processors have enough room for 2 to 3 caches on the chip (L1, L2 and optionally L3). The
largest L3 on-chip cache is 12 MB. Often we see smaller L2 caches in our personal computer and laptop
markets. As we already studied cache in class, we go on to the other forms of memory.
DRAM – dynamic RAM, using single transistors (capacitors) to store each bit. The “dynamic” name is
applied because the capacitor quickly loses any charge placed there requiring that each DRAM cell be
recharged fairly often. Additionally, DRAM offers only destructive reads – reading the contents of a cell
causes that cell to be discharged. Therefore, DRAM is set up so that an outgoing charge circles around
and goes back into the cell to recharge it. This will be used to recharge all memory locations.
Early DRAM may have been as much as 1MB of space, which would require 20 address bits. Because
microprocessors (CPUs on 1 chip) had few pins, the address had to be multiplexed – that is, delivered in
two or more bus transfers. Commonly, an address would be divided into two parts, a row and a column.
The two addresses (row number, column number) would be sent in two bus cycles. Decoders would be
used to take each address and select the byte (or word) within memory corresponding to that row and
column. More recently, memory has been divided into banks. This permits parallel accesses to each
bank. Consecutive memory locations can either appear on the same bank so that different banks stored
different areas of memory (high order interleave) or consecutive memory locations would appear in
consecutive banks (low order interleave).
In order to recharge cells, DRAM uses a memory controller which would in essence read one entry from
each bank in each cycle. Even though the contents of the read would not be sent anywhere, it would
refresh each of those memory locations. The controller would then step through each address in the banks
one at a time until it reaches the last address and then cycle around to start again. Recharging the full
bank might take 8 ms time, plenty of time before any memory was lost due to discharge. The refreshing
strategy would not be in use continuously or else DRAM would not be able to respond to requests from
other sources (cache or a DMA controller from the I/O subsystem). In fact, according to the authors, the
memory controller would be refreshing memory about 5% of the time.
DRAM has lagged behind SRAM/CPU speed (see the chart on the next page). Let’s take a look here at
how DRAM has improved over the years. First, DRAM capacity continues to grow at an exponential
rate. DRAM capacity has improved just as transistor count has. In CPU transistor count, we estimate a
doubling of transistors on a chip in roughly 18-24 months (doubling every 2 years for instance). For
DRAM, we see a quadrupling roughly every 3 years, which is a 55% increase per year. Thus, it is
common today for our personal computers to store as much as 8 (and in some cases, 16GB) whereas 10
years ago, the size was not even 1GB.
Unfortunately, access time in DRAM has lagged behind. While we expect processors to be 20-30 times
faster today than those of 10 years ago, DRAM access speed has only increased by a factor of 2 or less in
the same time. The consequence is that the processor’s speedup increases much greater than DRAM’s
speedup. So there is an ever widening gulf between CPU time and DRAM access time. In fact, since
1980, DRAM has only increased by a factor of about 6 whereas processors are thousands of times faster
over the same time interval. See figure 2.13, p. 99.
Since DRAM itself seems to offer little speedup, what can we do about this ever increasing gap between
processor and DRAM? Architects and computer engineers have hit on a number of ideas. First,
architects added a buffer to retain the row address. This allowed successive accesses to memory to
require only sending one address (column). This is because the successive accesses would occur within
the same row (usually).
The next enhancement was to move DRAM from an asynchronous interface to one that was regulated by
the system bus. Thus, synchronous DRAM or SDRAM was created, which resulted in a faster access
response time because memory would be attuned to the system clock to start its response. SDRAM also
added a byte count register so that, in one CPU request, multiple consecutive bytes could be accessed.
While this does not impact access time, it permits overlapping accesses or a pipelining of memory reads
in which successive reads can be sent back over the bus to the CPU in successive clock cycles. For
instance, a DRAM access might take 80 cycles and a bus transfer 2. Thus, the first byte (or word) is
transferred back in 82 cycles from the start of the request, the next at cycle 84 (2 cycles later), the next at
cycle 86, etc. This is known as burst mode.
Another innovation is to increase the bus transfer rate. Today, we see DDR (double data rate) DRAMs.
The idea is that data can be transferred on both the leading edge and trailing edge of the clock cycle, thus
doubling data transfer rate. The introduction of banks helps support this by permitting parallel accesses.
Consider using low order interleave, we would see address patterns as follows (we will assume 8 banks):
n
n+1
n+2
n+3
n+4
n+5
n+6
n+7
n+8
n+9
n+10 n+11 n+12 n+13 n+14 n+15
n+16 n+17 n+18 n+19 n+20 n+21 n+22 n+23
So, if we want to retrieve word n, we can also initiate the transfers for n+1 through n+7 simultaneously.
After waiting 80 ns for the accesses to take place, pairs of words are sent back every clock cycle (using
DDR). So, we can retrieve 8 words in just 84 ns. If our bus is wider (say 64 bits instead of 32), we could
even retrieve the 8 words in just 82 ns. High order interleave would not permit this, but we could obtain a
different form of parallelism by letting multiple processes call upon memory simultaneously. For
statistics of various forms of speedup obtained by DDR SDRAMs, see figure 2.14 on page 101.
One last innovation of note is the GDRAM, or the Graphics Data RAM. This is an SDRAM tailored for
higher bandwidth demands that come about from heavy graphics usage. Among other things, these
DRAMs have both higher clock rates for their pins and wider buses to provide a speedup of at least 2 over
SDRAMs (at a higher cost of course).
Finally, let’s consider Flash memory. Although Flash memory is a form of removable storage and so
should be grouped with optical disc storage, we tend to use Flash more like main memory in many cases
especially with handheld devices which use flash memory in lieu of DRAM or hard disk. The flash
memory is a form of EEPROM (electronically erasable programmable ROM). The EPROM (erasable
programmable ROM) is a memory that can only be wholly erased, that is, to erase memory requires
erasing the entire contents of memory (much like a CD RW). The EEPROM allows piecemeal deletions
but unlike memory, you cannot delete at the level of individual words. Instead, you can only delete at the
level of blocks, much like hard disk deletion. Unlike SRAM and DRAM, Flash memory is non-volatile
so that the contents can be stored permanently.
Drawbacks of Flash memory are that it has a limited lifetime (early Flash memory would only stand up to
about 1000 erasures for any given block before that block failed, today the number is at least 100,000
erasures) and much slower access times. It is estimated that Flash memory is at least 4 times slower than
SDRAM in reads and at least 10 times slower for writes, possibly much slower. But with the demise of
floppy disks, flash drives have become a preferred method for porting data as flash memories can store as
much as 8GB.
We now turn to virtual memory. The need for virtual memory arose because of limited main memory
sizes back in the 1960s. At that time, programmers would have to ensure that their programs were small
enough to fit into the available memory space (perhaps a 1 or a few megabytes). If not, they wrote their
programs and used a technique called overlays. The idea behind an overlay is that the program would
reside on disk and when the first half of the program was done, the program would load the second half
and overlay it over the first half, and then branch back to the “top” of the program (which would be the
start of the second half). This had to be done carefully to ensure that memory addresses were correctly
reflected throughout the entire program’s execution. Also, if the program were to branch back to the first
half, the first overlay would need to be loaded back into memory. Notice that if a program were moved
from one computer to another, part of it would probably have to be rewritten to reflect the new
computer’s memory size.
The idea behind virtual memory is similar to overlays – program code is loaded into memory when
needed. However, virtual memory turns the various chores of loading program units and memory
mapping over to the operating system rather than the programmer. We see a lot of benefits for VM:
 allows programs to be larger in size than main memory size without having to resort to overlays
 allows multiprogramming/multitasking as we can now place parts of several processes in memory
 avoids memory fragmentation (described below)
 provides an easy mechanism for relocation of code
 allows quicker initial load time from disk for the program (because only part of the program is
initially loaded)
Prior to VM, memory was allocated to processes in contiguous blocks. Consider a multiprogramming
system (an OS which would switch off between processes when one process had to perform I/O or wait
on another process) in which there are current five processes plus the OS loaded in memory. Assume
process P2 terminates and can be removed, this creates a gap. If P6 is selected to run next, it fills only part
of that gap leaving a fragment. Next, if P4 terminates and P7 starts, we have another gap. P8 cannot fit
into memory as is, but if we were to compress the processes (move the fragments together), we would
have enough space for P8.
To avoid fragmentation and improve performance as described in the list above, VM was created. Now,
all programs are decomposed into fixed sized blocks and main memory is decomposed into equally fixed
sized frames. One page fits exactly into one frame. The operating system moves some initial program
pages into available memory frames and keeps track of where things are through page tables. The page
table is merely a record of where each page has been inserted into memory if it is resident in memory, or
invalid because it is not in memory. There is one page table per running process. Below is an example
mapping a 4-page process into 3 frames in memory and 1 on disk. The area of disk is called the swap
space.
VM has a lot of advantages as noted above. What are the costs?
1. We use some of the hard disk space for swap space (this is pretty negligible today now that our
hard disks are so large).
2. We need to implement some form of protection to ensure that one process does not use the space
allocated to another process. In fact, this was also needed with contiguous allocation but the
mechanism for that form of protection was quite simple.
3. We need to perform address translation (mapping) so that, for instance, we know to map a
memory location in page A into the frame starting at 16K.
4. Swapping, the process of moving a new page from swap space to memory, will greatly impact
processor performance. With continuous allocation, all disk access was “up front” and once the
process started, we no longer needed disk access (except to deal with data files). Now, we will
have to access the disk often. We have to perform swapping whenever an address is requested
that is not currently in memory, this is known as a page fault.
The 4 questions that we asked regarding cache access (slide 4 of the powerpoint notes) are also applicable
for VM:
1. Where can a block be placed? Our answer here is much simpler than in cache where it was based
on the block address which, based on the type of cache, dictated where blocks were placed. We
can insert a page in any free frame. However, within that answer, we have to balance two
additional things. First, we will use a replacement strategy to select a frame if none are free.
Second, we might have an OS policy that proscribes the number of pages allowable per process.
For instance, if a process is only given 1024 pages and all are used, then any new page will have
to replace one of this process’ frames and not a frame from another process.
2. How is a block found? As with cache, we need to map from the virtual address to the physical
location. This can be done easily by just exchanging a page number for a frame number. We will
use a page table for this.
3. Which block should be replaced? Again, we need a replacement strategy. While in cache, we
had to make this choice in hardware and quickly, because disk access is so slow, we want to make
a wise choice. So we use the OS to make the decision and it will use either an LRU (least
recently used) algorithm or an approximation LRU. Another factor is that if we choose a page
that is dirty (modified), we have to write it back to disk, slowing down the swapping process. So
our LRU policy may favor clean pages over dirty pages as a clean page can be discarded without
being written back to disk.
4. What happens on a write? There is no efficient way to implement a write through policy as the
disk access is just too time consuming, so we will always use write back and include a dirty bit in
our page table to indicate that if a page is to be removed, whether it should be written back to disk
or not.
Notice in the above figure how every memory access requires first accessing the page table (in memory).
Even if the item we seek is in cache, we must first translate the address from virtual to physical requiring
a page table access. So we will copy the most recently accessed parts of the page table into a cache called
the translation lookaside buffer (TLB). TLBs tend to be very small and direct-mapped. For instance, the
Opteron used a 40 entry TLB. We may want 2 TLBs, one for instruction access and one for data access.
Also, recall one of our cache optimizations was to avoid address translation. This can be done by storing
entries in L1 caches using virtual addresses and not physical addresses, which would then allow us to
only have to perform virtual memory mapping if we missed the L1 cache. It is also possible to use virtual
addresses for other caches although this more problematic. There are a number of drawbacks in storing
cache items by virtual addresses as discussed in pages B-36 – B-40 of the textbook.
About the only issues that need addressing in VM are the page size and how to implement protection. For
page sizes, notice that the size of a page has a direct impact on the size of a page table. Imagine a process
of 1MB and a page size of 1KB. This gives us 1K pages for this process, so our page table will have 1K
entries. If we use a page size of 4KB, we would only have 256 pages. So the larger the page size, the
smaller the page table. This is not much of an issue today because main memory sizes are definitely large
enough to store even large page tables of dozens of processes. However, the larger page size has a few
other significant impacts:
 Page transfer time is longer because there is more to transfer so miss penalty to swap space is
greater, however this also has a benefit in that a larger transfer is worthwhile because you are
already invested in accessing disk. The slowest part of the disk access is relocating the read/write
head to the proper location on disk, the actual disk read and data transfer take less time. So once
you begin the lengthy access, spending more time to bring more from disk into memory might be
worthwhile as it is more efficient than two disk accesses.
 The larger the page size, the longer a process may go before a page fault because there is already
more of the process in memory. This may lead to a lower memory miss rate. On the other hand,
the larger the page size, the fewer pages can be kept in memory. Imagine that we have a policy
that says that a process only gets 1/32 of memory. If memory is 4GB and a page size is 4KB,
then there are 1M frames, so a process gets to move 32K pages into memory. But if the page size
were 1KB, each process would get 128K pages instead.
 The larger the page size, the smaller the page table which means a larger proportion of the page
table can be kept in TLB at any time. For instance, if a process has 1KB pages and the TLB
stores only 40 entries, then only 4% of the process’ page information is in the TLB. If the page
size were 4 times greater, the process would only have 256 pages and 16% of the process’ page
information could be kept in the TLB.
 Larger page sizes mean larger frame sizes which in turn can permit caches to be both larger and
faster.
For protection, we need to ensure that any process which generates an address generates a legal address.
This address can only be of the process’ memory space or shared space. Additionally, if space is shared,
we need proper synchronization techniques. This becomes complicated in the Pentium architecture which
used segmentation instead of paging. In paging, we just have to ensure that the page table is up-to-date.
Since only the OS can modify the page table, we are guaranteed that a process cannot change page table
information so that it now has pages that point to frames owned by other processes. In addition to this
restriction, processors may include additional bits in the page table to indicate a variety of information
such as:
o Valid – page is in memory
o Read/write – page is read only or readable/writable
o User/supervisor – page is owned by the user and accessible or page is owned by the OS
and has limited or no access to the user program
o
o
o
o
o
Dirty – page has been modified
Accessed – page has been accessed (whether read or write) in the recent past (used to
help decide if the page should be replaced or not in the near future)
No execute – prevents code from executing on some pages, for instance if the page stores
data only
Page level cache disable – can this page be cached?
Page level write through – can this page, in cache, be written through or does it have to
use write back?
For more information on protection and also a discussion of virtual machines, read section 2.4. We will
skip the examination of two memory hierarchies, the ARM Cortex A-8 and the Intel Core i7 (2.6).
Sample problems:
Let’s examine pseudo-associativity as a solution to improve hit rate of a direct-mapped cache. Which
provides faster avg memory access time for 4KB and 256 KB caches: direct-mapped, 2-way associative
or pseudo-associative (PAC)? Assume hit time of 1 cycle for direct mapped, 1.36 for 2-way set
associative, and miss penalty of 50 cycles. The PAC may have two attempts at accesses, the first to the
address as generated by the CPU and a second on a miss by altering the address. Assume the second
address takes 3 cycles. We alter our formula for PAC because a miss does not necessarily accrue a 50
cycle penalty but instead a 3 cycle penalty if the item is in the other position in cache. We need two hit
rates: the normal hit rate and the hit rate of finding the item in the second position (we will call this the
alternative hit rate).
alternative hit rate = hit rate2 way - hit rate1 way = 1 - miss rate2 way - (1 - miss rate1 way) =
miss rate1 way - miss rate2 way
Avg mem access time PAC = 1 + (miss rate1 way – miss rate2 way) * 3 + miss rate2 way * miss penalty
PAC: 4 KB = 1 + (.098 - .076) * 3 + (.076 * 50) = 4.866
PAC: 256 KB = 1 + (.013 - .012) * 3 + (.012 * 50) = 1.603
Direct-mapped: 4 KB = 1 + .098 * 50 = 5.9
Direct-mapped: 256 KB = 1 + .013 * 50 = 1.65
2-way: 4 KB = 1.36 + .076 * 50 = 5.16
2-way: 256 KB = 1.36 + .012 * 50 = 1.96
So, pseudo-associative cache outperforms both!
Let’s look at the impact of pipelining cache access. The advantage is that it allows us to reduce clock
cycle time, the disadvantage is that with a shorter clock, cache misses have a larger impact. Compare the
MIPS 5-stage pipeline vs. the MIPS R4000 8-stage pipeline, assuming clock rates of 1 GHz for MIPS and
1.8 GHz for MIPS R4000. A main memory access time of 50 ns (we will assume no second level cache)
and a cache miss rate of 5%. Assuming no other source of stalls, which machine is faster?
First, we have to convert main memory access time into clock cycles as the two machines have two
different clock cycle rates.
MIPS miss penalty = 50 ns / (1 / 1) ns = 50 cycles
MIPS R4000 miss penalty = 50 ns / (1 / 1.8) ns = 90 cycles
CPU time MIPS = (1 + .05 * 50) * 1 = 3.5
CPU time MIPS R4000 = (1 + .05 * 90) * 1 / 1.8 = 3.06
So the gain of increased clock speed by pipelining cache accesses more than offsets the increased miss
penalty. To truly see if this is advantageous, we would also have to factor in the impact of structural
hazards and branch penalties. The longer the pipeline, the greater the impact is.
Another form of compiler optimization to support cache access is to merge loops. Consider the following
loop: for(i=0;i<n;i++) a[i]=b[i]*c[i]; If we assume n is a fairly large value, then it is likely that we
would be loading a[i], b[i] and c[i] into cache in blocks, but as i increases, we would be discarded
previous blocks of the three arrays. For instance, let’s assume a data cache of 256 blocks where each
block stores 16 words. We will also assume that a, b, c are doubles and that n=1024. Thus, the memory
space required for all of a, b, and c is 1024 * 8 * 3 bytes = 24KB. Our cache is 256 * 16 * 4 = 16 KB.
Half of our cache would be replaced before the loop terminates. Now this problem is further exacerbated
if we have a later loop like this: for(i=0;i<n;i++) d[i]=b[i]+c[i]; because we will have already discarded
half of b and c (at least). So a compiler might rearrange the code:
for(i=0;i<n;i++)
a[i]=b[i]*c[i];
for(i=0;i<n;i++)
d[i]=b[i]+c[i];
into
for(i=0;i<n;i++) {
a[i]=b[i]*c[i];
d[i]=b[i]+c[i];
}
Can you figure out the improvement in terms of the number of cache misses that we would remove
assuming that none of a, b, c or d are in cache initially and assuming no other data is stored in the data
cache?
Assume memory is organized as follows:
– two L1 caches (one data, one instruction)
– one L2 cache
– main memory
– disk cache
– disk (swap space)
Assume miss rates and access times of
– data cache: 5%, 1 clock cycle
– instruction cache: 1%, 1 clock cycle
– L2 cache: 10%, 10 clock cycles
– main memory: 0.2%, 100 clock cycles
– disk cache: 20%, 1000 clock cycles
– swap space: 0%, 250000 clock cycles
If 40% of all instructions are loads or stores, what is the effective memory access time for this machine?
Average memory access time = %instruction * (hit time instr cache + miss rate instr cache * (hit time
second level cache + miss rate second level cache * (hit time main memory + miss rate main memory *
(hit time disk cache + miss rate disk cache * hit time disk)))) + %data * (hit time data cache + miss rate
data cache * (hit time second level cache + miss rate second level cache * (hit time main memory + miss
rate main memory * (hit time disk cache + miss rate disk cache * hit time disk))))
With 1.4 memory accesses per instruction, the % of instruction accesses = 1.0 / 1.4 = 71.4%, and % of
data accesses is 0.4 / 1.4 = 28.6%
Average memory access time = 71.4% * (1 + .01 * (10 + .10 * (100 + .002 * (1000 + .20 * 250000)))) +
28.6% * (1 + .05 * (10 + .10 * (100 + .002 * (1000 + .20 * 250000)))) = 1.647.