Rechnerarchitektur 1 Content - Computer Architecture Group

Transcription

Rechnerarchitektur 1 Content - Computer Architecture Group
Vorlesung Rechnerarchitektur
Seite 1
Rechnerarchitektur 1
Content
1) Introduction, Technology, von Neumann Architecture
2) Pipeline Principle
a) Dependencies
b)Predication
3) Memory Hierachy
a) Registers
b)Caches
c) Main Memory
d)Memory Management Unit (MMU)
4) Prozessor Architectures
a) Complex Instruction Set Computer (CISC)
Micro programming
b)Reduced Instruction Set Computer (RISC)
c) Instruction Set Architecture x86
5) Software Hardware Interface
a) Instruction Level Parallelism
b)Prozesses, Threads
c) Synchronization
6) Simple Interconnections (Bus)
7) Modern Processor Architectures
a) Superscalar Processors
b)Very Long Instruction Word Machine (VLIW, EPIC)
8) I/O Systeme
a) Prozessor-,System- & Peripheral- Busses
b)I/O Components
c) Device Structure
d) DMA Controller
Vorlesung Rechnerarchitektur
Seite 2
Rechnerarchitektur 1
Literature
Deszo Sima, Terence Fountain, Peter Kasuk
Advanced Computer Architectures, A Design Space Approach
Adision-Wesley, 1997
Hennessy, John L.; Patterson, David A. (Standardwerk)
Computer architecture : A quantitative approach
2nd ed 1995. hardback £25.00
Morgan Kaufmann
1-55860-329-8
Giloi, Wolfgang K. (empfehlenswert, Übersicht, leider z.Zt. vergriffen)
Rechnerarchitektur
2.vollst. überarb. Aufl.1993 DM58.00
Springer-Verlag
3-540-56355-5
Hwang, Kai (advanced, viel parallele Arch.)
Advanced computer architecture : Parallelism, scalability and programmability
1993. 672 pp. paperback £24.99
McGraw-Hill ISE
0-07-113342-9
Patterson, David A.; Hennessy, John L. (leichter Einstieg)
Computer organization and design : The hardware/software interface
2nd ed 1994. hardback £60.50
Morgan Kaufmann
1-55860-281-X
Tanenbaum, Andrew S. (optional, Standardwerk für Betriebssysteme))
Modern operating systems
1992. 1055 pp. paperback £26.95
Prentice Hall US
0-13-595752-4
Vorlesung Rechnerarchitektur
Seite 3
History of Computer Generations
History of computer generations
Generation
Technology &
Architecture
Software & Applikations
Systems
First
(1945-54)
Vacuum tubes and relay
memories,
CPU driven by PC and
accumulator, fixed point
arithmetic
Machine/assembly languages, single user, no subroutine linkage,
programmed I/O using CPU
ENIAC,
Princeton
IAS, IBM
701
Second
(1955-64)
Discrete transistors and core
memories,
floating point arithmetic,
I/O processors, multiplexed
memory access
HLL used with compilers,
subroutine libraries, batch
processing monitor
IBM 7090,
CDC 1604,
Univac
LARC
Third
(1965-74)
Integrated ciruits (S/MSI),
microprogramming,
pipelining, cache, lookahead processors
Multiprogramming and
timesharing OS, multiuser
applications
IBM 360/70,
CDC 6600,
TI-ASC,
PDP-8
Fourth
(1975-90)
LSI/VLSI and semiconductor memory,
multiprocessors, vector
supercomputers, multicomputers
Multiprocessor OS, lanVAX 9000,
guages, compilers and envi- Cray X-MP,
ronment for parallel
IBM 3090
processing
Fifth
(1991-96)
ULSI/VHSIC processors,
memory and switches, high
density packaging
Massively parallel processing, grand challenge applications
Cray MPP,
CM-5, Intel
Paragon
Sixth
(present)
scalable off-the-shelf architectures, workstation clusters, high speed
interconnection networks
heterogeneous processing,
fine grain message transfer
Gigabit Ethernet, Myrinet
[Hwang, Advanced Computer Architecture]
Vorlesung Rechnerarchitektur
Seite 4
Technology
Introduction
First integrated microprocessor Intel 4004 1971
2700 transistors
Vorlesung Rechnerarchitektur
Seite 5
Technology
Introduction
year 1998 microprocessor
PowerPC 620
7,5 Mio. transistors
Vorlesung Rechnerarchitektur
Seite 6
Technology
Introduction
year 2003 microprocessor
[AMD]
AMD Opteron
~106 Mio. transistors
Vorlesung Rechnerarchitektur
Seite 7
Technology
Technology forecast
The 19th April 1965 issue of Electronics magazine contained an articel with the title
"Cramming more components onto integrated circuits" Its author, Gorden E. Moore, director, Research and Development, Fairchild Semiconductor, had been asked to predict what
would happen over the next ten years in the semiconductor component industry. His article
speculated that by 1975 it would be possible to cram as many as 65 000 components onto a
single silicon chip about 6 mm2. [Spectrum, June ’97]
Moore based his forecast on a log-linear plot of device complexity over time.
The complexity for minimum component costs has increased at a rate of
roughly a factor of two per year. Certainly over the short term this rate can
be expected to continue, if not to increase. Over the longer term, the rate of
increase is a bit more uncertain, although there is no reason to believe it will
not remain constant for at least 10 years.
Moore’s astonishing prediction was based on empirical data from just three Fairchild data
points!
• first planar transistor in 1959
• few ICs in 1960 with ICs in production with 32 components in 1964
• IC with 64 components in 1965
In 1975 he revisted the slope of his function to doubling transistor count every 18 month and
this stayed true until today.
20
Number of components
per integrated function [log2]
15
10
l
l
5
l
0
l
’59’60
’65
’70
’75
’80
’85
Year
Vorlesung Rechnerarchitektur
Seite 8
Technology
Andrew (Andy) Grove
Robert Noyce
Gordon Moore
Vorlesung Rechnerarchitektur
Seite 9
Technology
Cost reduction due to mass fabrication
red curve for logic chips
(no memory chips)
Cost per Transistor [log10]
on-chip
l
values from Fairchild (Gordon Moore)
10-1
l
10-2
10-3
l
10
-4
l
10-5
l 4.5Mio.T = 30$US
10-6
’60
’59
’65
’70
’75
’80
’85
’90
’95
’00
Year
’02
~ 6.6 * 10-6 $US
year 2002: 4.5Mio. transistors in 0.18μm CMOS technology on a 5x5mm die with BGA pakkage ~ 30 $US = 6.6 x 10-6 $US per transistor; standard cell design
Vorlesung Rechnerarchitektur
Seite 10
Technology
Modern Chip Technology
Using copper for the 6 metal interconnect structure of a CMOS chip delivers lower ressitance of the wires and thus increases the performance.
Gate structures can be found in the lower right part of the picture (arrow)
Vorlesung Rechnerarchitektur
Seite 11
ATOLL - ATOmic Low Latency
-
approx. 5 Mio transistors (1.4 Mio logic gates)
Die area: 5.78mm x 5.78mm
0.18um CMOS (UMC, Taiwan)
385 staggered I/O pads
3 years of development
year 2002
Vorlesung Rechnerarchitektur
Seite 12
Speed Trends
Processor Memory I-O -Speed Trends
Speed
[log MHz]
2GHz
2000
first 1GHz CPU
Research
1000
900
800
700
600
500
2.2GHz
P4
internal
CPU Clock
ProcessorMemory
Gap
PentiumIII
DEC
Alpha
400
external
BUS Clock
300
PC266
200
r
100
90
80
70
60
50
r
PC133
r
PCI-X
133MHz
IO
r
PCI
r
i860
40
30
ProcessorI/O
Gap
SDRAM
r
1/tacc
MC68030
25ns in page
40ns
20
1/tacc
60ns random
10
Year
’90
’92
’94
’96
’98
2000
’02
Die ständige Steigerung der Rechenleistung moderner Prozessoren führt zu einer immer größer werdenden Lücke zwischen der Verarbeitungsgeschwindigkeit des Prozessors und der
Zugriffsgeschwindigkeit des Hauptspeichers. Die von Generation zu Generation der
DRAMs steigende Speicherkapazität (x 4) führt auf Grund der 4-fachen Anzahl an Speicherzellen trotz der Verkleinerung der VLSI-Strukturen zu nur geringen Geschwindigkeitssteigerungen.
(not included: new DRAM technology RAMBUS)
Vorlesung Rechnerarchitektur
Seite 13
Speed Trends
Processor Memory I-O -Speed Trends
Results
a high Processor-Memory Performance Gap (increasing)
a very high Processor-I/O Performance Gap (increasing)
Single Processor Performance will increase
Number of processors in SMPs will increase moderately
STOP
Bus Systems will become the main bottleneck
Solution
Transition to “switched” Systems
on all system levels
- Memory
- I/O
- Network
Switched System Interconnect
Vorlesung Rechnerarchitektur
Seite 14
Was ist Rechnerarchitektur ?
Rechnerarchitektur <- ’ computer architecture ’
The types of architecture are established not by architects but by society, according to the
needs of the different institutions. Society sets the goals and assigns to the architect the job
of finding the means of achieving them.
Entsprechend ihrem Zweck unterscheidet man in der Baukunst Typen von Architekturen .....
Arbeit des Computer-Architekten:
- finden eines Types von Architektur, die den vorgegebenen Zweck erfüllt.
- Architekt muß gewisse Forderungen erfüllen.
- Leistung
- Kosten
- Ausfalltoleranz
- Erweiterbarkeit
Materialien des Gebäudearchitekten : Holz, Stein, Beton, Glas ...
Materialien des Rechnerarchitekten : integrierte Halbleiterbausteine ...
Folgende Komponenten stellen die wesentlichen HW-Betriebsmittel des Computers dar:
Prozessoren, Speicher, Verbindungseinrichtungen
Vorschrift für das Zusammenfügen von Komponenten
Operationsprinzip
Struktur der Anordnung von Komponenten
Strukturanordnung
+
Rechnerarchitektur
Vorlesung Rechnerarchitektur
Seite 15
Bestandteile der Rechnerarchitektur
Eine Rechnerarchitektur ist bestimmt durch ein Operationsprinzip für die Hardware und die
Struktur ihres Aufbaus aus den einzelnen Hardware-Betriebsmitteln. [Giloi 93]
Das Operationsprinzip
Das Operationsprinzip definiert das funktionelle Verhalten der Architektur durch Festlegung einer Informationsstruktur und einer Kontrollstruktur.
Die (Hardware) - Struktur
Die Struktur einer Rechnerarchitektur ist gegeben durch Art und Anzahl der Hardware-Betriebsmittel sowie die sie verbindenden Kommunikationseinrichtungen.
- Kontrollstruktur : Die Kontrollstruktur einer Rechnerarchitektur wird durch
Spezifikation der Algorithmen für die Interpretation und Transformation der
Informationskomponenten der Maschine bestimmt.
- Informationsstruktur : Die Informationsstruktur einer Rechnerarchitektur wird
durch die Typen der Informationskomponenten in der Maschine bestimmt, der
Repräsentation dieser Komponenten und der auf sie anwendbaren Operationen.
Die Informationsstruktur läßt sich als Menge von ’abstrakten’ Datentypen spezifizieren.
- Hardware-Betriebsmittel : Hardware-Betriebsmittel einer Rechnerarchitektur
sind Prozessoren, Speicher, Verbindungseinrichtungen und Peripherie.
- Kommunikationseinrichtungen : Kommunikationseinrichtungen sind die Verbindungseinrichtungen, wie z.B.: Busse, Kanäle, VNs ; und die Protokolle, die
die Kommunikationsregeln zwischen den Hardware-Betriebsmittel festlegen.
Vorlesung Rechnerarchitektur
Seite 16
von Neumann Architektur
Struktur:
Beschreibt eine abstrakte Maschine des minimalen Hardwareaufwandes, bestehend aus:
•
•
•
•
einer zentralen Recheneinheit (CPU), die sich in Daten- und Befehlsprozessor aufteilt
einem Speicher für Daten und Befehle
einem Ein/Ausgabe-Prozessor zur Anbindung periphärer Geräte
einer Verbindungseinrichtung (Bus) zwischen diesen Komponenten
Commands & Status Information
CPU
InstructionProcessor
Instructions
DataProcessor
I/OProcessor
Data
Data
Bus
Data&Instruction
Memory
Vorlesung Rechnerarchitektur
Seite 17
von Neumann Architektur
Verarbeitung von Instruktionen:
Programmanfang
ersten Befehl aus
Speicher holen
Befehl in das
Befehlsregister
bringen
Ausführung eventueller
Adreßänderungen und ggf.
Auswertung weiterer Angaben
im Befehl
evtl. Operanden
aus dem Speicher
holen
nächsten Befehl aus
dem Speicher holen
Umsetzen des Operationscodes
in Steueranweisungen
Operation ausführen,
Befehlszähler um 1 erhöhen
oder Sprungadresse einsetzen
Programmende?
Ja
Ende
Nein
Vorlesung Rechnerarchitektur
Seite 18
von Neumann - Rechner
(Burks, Goldstine, von Neumann)
Architektur des minimalen Hardwareaufwands !
Die hohen Kosten der Rechnerhardware in der Anfangszeit der Rechnerentwicklung erzwang den möglichst kleinen Hardwareaufwand.
-
Befehle holen
Operand holen (2x)
Befehle interpretieren
Befehl ausführen
Indexoperation, Adressrechnung
arithmetische Op. (gr. Aufwand an Zeit + HW)
Ist die arithmetische Operation der wesentliche Aufwand, so können die anderen Teilschritte
sequentiell ausgeführt werden, ohne die Verarbeitungszeit stark zu verlängern.
Zeitsequentielle Ausführung der Befehle auf minimalen Hardware-Resourcen
Veränderung der Randbedingungen :
1. Reduzierung der Hardwarekosten
Hochintegration, Massenproduktion
2. Aufwand in den einzelnen Op. verschob sich.
Gesucht :
Ideen, mit zusätzlicher Hardware bisher
nicht mögliche Leistungssteigerungen
zu erreichen.
Ideen = Operationsprinzipien
Vorlesung Rechnerarchitektur
Seite 19
Operationsprinzip - Pipeline
Unter dem Operationsprinzip versteht man das funktionelle Verhalten der Architektur, welches auf der zugrunde liegenden Informations- und Kontrollstruktur basiert.
Verarbeitung von mehreren Datenelementen mit nur einer Instruktion
Pipeline - Prinzip
Vektorrechner
Feldrechner
(’array of processing elements’)
Pipeline - Prinzip
Beispiel : Automobilfertigung
Bestandteile eines ’sehr’ einfachen Autos :
-
Karosserie
Lack
Fahrgestell
Motor
Räder
parallele V.E. (P.E‘s)
Vorlesung Rechnerarchitektur
Seite 20
Operationsprinzip - Pipeline
Beispiel : Automobilfertigung
Welche Möglichkeiten des Zusammenfügens (Assembly) gibt es ?
Fließband
(PIPELINE)
Workgroup
Worauf muß man achten ?
Abhängigkeiten !
Fließband
Blech
Karosserie assembly
Lack
Fahrgestell
Lackiererei
41
Fahrgestelleinbau
Motor
Motoreinbau
Räder
Räder
montieren
Produktion von verschiedenen Modellen : 3 Farben R(ot),
G(rün)
B(lau)
Auftrag
2 Motoren N(ormal),
F
R
I(njection)
M
N
2 Karosserien L(imousine),
K
K
K(ombi)
Vorlesung Rechnerarchitektur
Seite 21
Operationsprinzip - Pipeline
Pipeline des Fertigungsvorgangs
20 min
K
La
10 min
F
M
R
10 min
10 min
10 min
Lb
20 min
Optimierung der Stufe : Lackierung
L1 L2
10 min
10 min
Stufen - Zeit - Diagramm der Pipeline
stage
Auftrag 41
K
L1
L2
F
M
R
1:5
41
time
2:4
42
41
43
42
41
43
42
41
43
42
41
43
42
41
3:3
43
42
2:4
43
1:5
3:3
3:3
3:3
ppt
Vorlesung Rechnerarchitektur
Seite 22
Pipeline - Register
Unter einem Register verstehen wir eine Hardwarestruktur die ein- oder mehrere Bits speichern kann. Register sind (normalerweise) D-Flip Flops (siehe PI2).
Register
D-FF
D
Q
32
32
clk
Wichtige Kenndaten eines Registers:
Clock-to-output Zeit (tco):
Zeit zwischen Taktflanke und dem Zeitpunkt an dem eine Änderung des Eingangs am Ausgang sichtbar ist.
Setup Zeit (tsu):
Zeit vor der Taktflanke in der der Eingangswert schon stabil sein muss (sich nicht mehr ändern darf; Grund-> Digitale Schaltungstechnik, Stichwort: metastabile Zustände)
Hold Zeit (th):
Zeit nach der Taktflanke an der sich der Eingangswert noch nicht ändern darf (Grund wieder
wie tsu).
cycle time
tcyc
Clock
D
Q
th
Daten sollten am Eingang stabil sein
tsu
tco
Ausgang gueltig mit neuen Daten
Vorlesung Rechnerarchitektur
Seite 23
Pipelining
The performance gain achieved by pipelining is accomplished by partitioning an operation
F into multiple suboperations f1 to fk and overlapping the execution of the suboperations of
multiple operations F1 to Fn in successive stages of the pipeline [Ram77]
Assumptions for Pipelining
1
the operation F can be partitioned
2
all suboperations fi require approximately
the same amount of time
3
there are several operations F
the execution of which can be overlapped
Technology requirement for Pipelining
4
the execution time of the suboperations is long
compared with the register delay time
Linear Pipeline with k Stages
F
stage
result(s)
instr.
& operands
f1
f2
f3
fk
[RAM77] Ramamoorthy, C.V., Pipeline Architecture, in: Computing Surveys, Vol.9, No. 1, 1977, pp. 61102.
Vorlesung Rechnerarchitektur
Seite 24
Pipelined Operation Time
tp ( n, k) = k + (n-1)
time to fill
the pipeline
time
for this example:
tp (10,5) = 5 + (10 - 1) = 14
time to process
n-1 operations
stages
pipeline
phases
1
2
3
4
=
n
k + (n-1)
5
start-up
or fill
processing
drain
Durchsatz Throughput
TP ( n, k) =
number of operations
tp (n,k)
operations
time unit
Gewinn Gain
scalar execution time
=
pipelined execution time
S ( n, k) =
n k
k + (n-1)
initiation rate, latency
Effizienz Efficiency
E ( n, k) =
1
k
lim S→ k
n→∞
S ( n, k) =
k
n k
=
( k + (n-1))
n
k + (n-1)
Pipeline Interrupts
data dependencies
control-flow dependencies
resource dependencies
Vorlesung Rechnerarchitektur
Seite 25
Assumptions for Pipelining
the operation F can be partitioned
1
F
time tf
F
time tf / 2
1
1’
time tf
2
f1
f2
time tf / 2
2
2’
F
time tf / 2
time tf
all suboperations fi require approximately
the same amount of time
Version 1
f2
f1
f2 a
f2 b
f3
f2 c
f1 << f2
time t2 /3
f3 << f2
f1
f2
f3
f2
f1
f2
Version 2
f2
f3
Vorlesung Rechnerarchitektur
Seite 26
Assumptions for Pipelining
3
there are several operations F
the execution of which can be overlapped
If there is a discontinuous stream of operations F owing to a conflict, bubbles are inserted
into the pipeline. This reduces the performance gain significantly.
A typical example of this is the control dependency of the instruction pipeline of a processor.
Here, each conditional branch instruction may disrupt the instruction stream and cause (k-1)
bubbles (no-operations) in the pipeline, if the control flow is directed to the nonpredicted
path.
4
the execution time of the suboperations is long
compared with the register delay time
Item 4 is a technological requirement for the utilization of a pipeline. Assuming a partitioning of the operation F into three suboperations f1, f2, f3, and also no pipelining, the operation F can be executed in the time:
t (F) = tf1 + tf2 + tf3
Introduction of registers
stage 1
stage 2
stage 3
Clock
Di
tf1
tsu
tco
Di
tf2
tsu
tco
input
t (F) = ( max (tfi) + tco + tsu )
tcyc
tcyc = max (tfi) + tco + tsu
Do
tf3
tsu
tco
output
3=3
max (tfi) + 3
k stages
( tco + tsu )
register delay time
fcyc = 1 / tcyc
The registers are introduced behind each function of the suboperation and this creates the
pipeline stages. Placing the register at the output (not at the input!!!) makes suboperation stages compatible with the definition of state machines, which are used to control the pipeline.
Vorlesung Rechnerarchitektur
Seite 27
Data Flow Dependencies
Three different kinds of data flow dependency hazards can occur between two instructions.
These data dependencies can be distinguished by the definition-use relation.
- flow dependency
- anti-dependency
- output dependency
read after write RAW
write after read WAR
write after write WAW
Basically, the only real dependencies in a program are the data flow dependencies, which
define the logical sequence of operations transforming a set of input values to a set of output
values (see also Data Flow Architectures). Changing the order of instructions must consider
these data flow dependencies to keep the semantic of a program.
The Flow Dependency
read after write (RAW)
This data dependency hazard can occur e.g. in a pipelined processor where values for a second instruction i+1 are needed in the earlier stages and have not yet been computed by the
later stages of instruction i. The hazardous condition arises if the dependend instructions are
closer together than the number of stages among which the conflicting condition arises.
definition
relation
use
destination ( i ) = source ( i + 1 )
X <- A + B
instruction i
Y <- X + C
instruction i+1
stage
issue i
issue i+1
RF
read
read
2 read ports
ALU
execute
RF
write
read
execute
write
read A,B
i
time
i+1 read X,C i X:= A op B
i+1 Y:= X op C i
write X
i+1 write Y
write
1 write port
To avoid the hazard, the two instructions must be separated in time within the pipeline. This
can be accomplished by inserting bubbles (NOPs) into the pipeline (simply speaken: by
waiting for i to complete) or building a special hardware (hardware interlock), which inserts NOPs automatically.
ppt
Vorlesung Rechnerarchitektur
Seite 28
Data Flow Dependencies
The Anti-Dependency
write after read (WAR)
This dependency can only arise in the case of out-of-order instruction issue or out-of-order
completion. The write back phase of instruction i+1 may be earlier than the read phase of
instruction i. Typical high-performance processors do not use out-of-order issue. This case,
then, is of less importance. If the compiler reorders instructions, this reordering must preserve the semantics of the program and take care of such data dependencies.
source ( i ) = destination ( i + 1 )
X <- Y + B
instruction i
Y <- A + C
instruction i+1
The Output Dependency write after write (WAW)
The result y of the first instruction i will be written back to the register file later than y of
instruction i+1 because of a longer pipeline for the division. This reverses the sequence of
writes i and i+1 and is called out-of-order completion.
destination ( i ) = destination ( i + 1 )
Y <- A / B
instruction i
Y <- C + D
instruction i+1
FU2
i+1
C op D
stage
issue i
issue i+1
FU2
read A,B FU1
i
i+1 read C,D i 1. A op B
i 2. A op B
i 3. A op B
i+1 write Y
i
FU1
write Y
RF
write
RF
read
read
time
read
execute
write
execute
write
Vorlesung Rechnerarchitektur
Seite 29
Data Flow Dependencies
Inserting Bubbles
We must avoid the RAW-hazard in the pipeline, because reading X before it is written back
from the previous instruction reads an old value of X and thus destroys the semantic of the
two sequential instructions. The sequential execution model (von Neumann architecture) assumes that the previous instruction is completed before the execution advances to the next
instruction.
2 read ports
1 write port
RF
read
RF
write
ALU
pipeline distance
stage
read
issue i
i
execute
write
read A,B
i
time
X:= A op B
i
issue i
issue NOP
issue NOP
issue i
issue NOP
issue NOP
issue i+1
write X
read A,B
i
i X:= A op B
w1
NOP
w2
w1
NOP i
write X
NOP
w2
NOP w1
NOP
w2
NOP
read A,B
i
i X:= A op B
w1
NOP
w2
w1
NOP i
write X
NOP
NOP w1
NOP
i+1 read X,C w2
w2
NOP
i+1 Y:= X op C
i+1 write Y
The conflicting stages are RF read and the RF write, which have a distance of 3 pipeline stages. Reading the correct value for X from the register file requires the insertion of 2 NOPs
in between the two conflicting instructions. This delays the instruction i+1 by 2 clocks which
then removes the RAW-hazard. The compiler (or programmer) is responsible for detecting
the hazard and and inserting NOP into the instruction stream.
The 2 bubbles in the pipeline can be utilized by other usefull instructions independent from
i and i+1 (see instruction scheduling).
Vorlesung Rechnerarchitektur
Seite 30
Data Flow Dependencies
Hardware Interlock
A hardware mechanism for detecting and avoiding the RAW-Hazard is the interlock hardware. The RAW-Hazard condition is detected by comparing the source fields of instruction
i+1 (and i+2) with the destination field of instruction i, until i is written back to the register
file and thus the instruction is completely executed.
stage
issue i
issue i+1
RF
read
ALU
read
execute
RF
write
instruction
fetch
i
i+1
i+2
i+2
i+2
i+3
execute
write
read A,B
i
time
i+1 read X,C i X:= A op B
i+1 Y:= X op C i
write X
i+1 write Y
write
stage
time
read
decode
i
i+1
i+1
i+1
i+2
issue
point
read
issue
check
for i+1
instruction
execute
write
i read A,B
A
i X:= A op B
i
write X
i+1 read X,C
i+1 Y:= X op C
i+1 write Y
B
delay
point
for
conflict
bubbles
in the
pipeline
The hardware interlock detects the RAW-Hazard and delays the issue of instruction i+1 until
the write back of instruction i is completed.
The hardware mechanism doesn’t need additional NOPs in the instruction stream. The
bubbles are inserted by the hardware.
Nevertheless the produced bubbles can be avoided by scheduling useful and independent instructions in between i and i+1 (see also instruction scheduling).
Vorlesung Rechnerarchitektur
Seite 31
Data Forwarding
Data forwarding
Data forwarding is a hardware structure, which helps to reduce the number of bubbles in the
pipeline. As can be seen in the stage time diagram, the result of a calulation in the execute
stage is ready just behind the ALU register. Using this value for the next source operand instead of waiting after the write back stage and then reading it again from the RF, can save
the two bubbles.
forwarding control
2
data forwarding path
(S1)
RF
read
(S2)
ALU
(R)
RF
write
load data path
forwarding
data mux
data from cache/memory
stage
instruction
time
fetch
i
i+1
decode
i
i+1
issue
point
read
i read A,B
i+1 read X,C
instruction
execute
i X:= A op B
i+1 Y:= X op C
write
i
write X
i+1 write Y
data
forwarding
A
issue
check
for i+1
no
bubble
in the
pipeline
The forwarding data path takes the result from the output of the execute stage and send it
directly to the input of the ALU. There a data forwarding multiplexer is switch to the forwarding path in order to replace the invalid source operand read from the register file.
The forwarding control logic detects that a specific register (eg. R7) is under new definition
and at the same time is used in the following instruction as a source operand. In this case,
the corresponding mux is switch to replace S1 or S2, which is the old value from register R7.
The scoreboard logic implements this checking in the stage before the issue point.
Vorlesung Rechnerarchitektur
Seite 32
Control Flow Dependencies
The use of pipelining causes difficulties with respect to control-flow dependencies. Every
change of control flow potentially interrupts the smooth execution of a pipeline and may produce bubbles in it. One of the following instructions may be responsible for them:
-
conditional branch instruction
jump instruction
jump to subroutine
return from subroutine
The bubbles (no operations) in the pipeline reduce the gain of the pipeline thus reducing performance significantly. There are two causes of bubbles. The first one is a data dependency
in the pipeline itself, the branch condition being evaluated a pipeline stage later than needed
by the instruction fetch stage to determine the correct instruction path. The second one is the
latency for accessing the instruction and the new destination for the instruction stream.
Pipeline Utilization
U =
n
n+m
m=b
(1-p) Nb + b q No
=
1
1+ m
n
n number of useful instructions
m number of no operations
(bubbles)
b number of branches
p probability of correct guesses
Nb penalty for wrong guess
q frequency of other causes (jump)
No penalty for other causes
no ops caused by branches
no ops caused by the latency of branch target fetches
U =
1
b
1+ n
(1-p)
b
Nb + n
q
No
Vorlesung Rechnerarchitektur
Seite 33
Control Flow Dependencies
Reduction of Branch Penalty
The ’no’ operations (NOPs) in the pipeline can be reduced by the following techniques:
reduction of branch penalty Nb
- forwarding of condition code
- fast compare
- use of delay slots
- delayed branch
- delayed branch with squashing
increase of p
- static branch prediction
- dynamic branch prediction
- profiled branch prediction
reduction of instruction fetch penalty No
-
instruction cache
improvement of instruction prefetch
branch target buffer
alternate path following
avoid branches
- predication of instructions
Branch architectures
The design of control flow and especially conditional branches in ISAs is very complex. The
design space is large and advantages and disadvantages of various branch architectures are
difficult to analyze. The following discussion of branches shows only a small area of the design space.
For further reading please refer to:
Sima et al., Advanced Computer Architectures, A Design Space Approach
Vorlesung Rechnerarchitektur
Seite 34
Control Flow Dependencies
Effect of Branches
Before we start to analyse the effects of branching we should introduce the basic terms and
have a closer look to the internals of the instruction fetch unit.
Definition :
Branch successor: The next sequential instruction after a (conditional) branch is named the branch successor or fall-through instruction.
In this case, it is assumed the branch is not taken. The operation of fetching the next instruction in the instruction stream requires to add the distance from the actual instruction to the
next instruction to the value of the actual program counter (PC).
Definition :
Branch target: The instruction to be executed after a branch is taken
is called a branch target.
The operation required to take the branch is to add the branch offset from the instruction
word to the PC and then fetch the the branch target instruction from the instruction memory.
The number of pipeline cycles wasted between a branch taken and the resumption of instruction execution of the branch target is called a delay slot. Depending on the pipeline the number of delay slots b may vary between 0 and k-1.
In the following stage-time diagram we assume that the instruction fetch can immediately
continue when the branch direction is defined by the selection of the condition code, which
is forwarded to the instruction fetch stage.
stage
fetch
time
i-1
cmp
branch ’cc’
i
delay slot 1
delay slot 2
delay slot 3
correct next
instructions
branch successor
or
branch target
forwarding
of cmpare result (condition codes)
forwarding
of CC selection
decode
read
execute
write
i-1
i
i-1
i
i-1
i
i-1
i
Vorlesung Rechnerarchitektur
Seite 35
Control Flow Dependencies
Effect of Branches
In the following we use the term condition code CC as the result of a compare instruction,
which might be stored in a special CC-register or stored in a general purpose integer register
Rx. Using Rx as destination allows multiple cmp-instructions to be overlapped. The corresponding bcc has to read its register and decode the actual cc from the bits of Rx.
Feeding the result of this bit selection to the instruction fetch stage through the forwarding
path of cc selection instructs the IF stage to switch the control flow to the new direction.
cmp R1,R2 -> CC
calculate all reasonable bits
and store to condition code register
bcc CC, offset
change the control flow
depending on the selection of CC bits
calculate next PC
cmp R1,R2 -> R4
cmp R6,R7 -> R5
bcc R4, offset
bcc R5, offset
forwarding path of cc selection
cc
IF
RF
read
DEC
ALU
RF
write
forwarding data path for CC
branch
predecessor
i-1
branch
i
delay slot
branch
successor i+2
i+3
i+1
branch
target
t
t+1
numbering is used to identify the instructions in the flow of processor code
The calculation of the branch PC might be performed in the ALU, but is typically shifted to
earlier stages with an additional adder. The PC is held in the IF-stage and not stored in the
RF!
Vorlesung Rechnerarchitektur
Seite 36
Forwarding of the Condition Code
forwarding path of cc selection
cc
IF
DEC
RF
read
ALU
RF
write
forwarding data path for CC
stage
fetch
time
i-1
cmp
branch
delay slot
correct next
instructions
i
i+1
i+2
i+3
decode
read
forwarding
of cmp data (CC)
forwarding
of cc
execute
write
i-1
i
i+1
i+2
i+3
i-1
i
i+1
i+2
i+3
i-1
i
i+1
i+2
i+3
i-1
i
i+1
i+2
i+3
Vorlesung Rechnerarchitektur
Seite 37
Control Flow Dependencies
Fast Compare
simple tests for equal, unequal, <0, <=0, >0, >=0, =0
fast compare logic
0
=
CC
IF
DEC
RF
read
ALU
RF
write
forwarding
of fast cmp data
stage
fetch
time
i-1
cmp
branch
delay slot
delay slot
correct next
instructions
i
i+1
i+2
decode
read
execute
write
i-1
i
i+1
i+2
i-1
i
i+1
i+2
i-1
i
i+1
i+2
i-1
i
i+1
i+2
Vorlesung Rechnerarchitektur
Seite 38
Delayed Branch
Idea: Reducing the branch penalty by allowing <d> useful instructions to be executed before
the control transfer takes place.
67% of all branches of a program are taken (loops!). Therefore, it is wise to use the prediction "branch taken".
cc=true
ready
stage
time
branch instruction
delay slot instr.
target instruction
target +1 instr.
fetch
i
i+1
t
t+1
decode
i
i+1
t
t+1
read
i
i+1
t
t+1
execute
write
i
i+1
t
t+1
i
i+1
t
t+1
control transfer
delay slot
in the
pipeline
(a) branch taken
CC=false
ready
stage
time
branch instruction
delay slot instr.
target instruction
target +1 instr.
successor instr.
fetch
i
i+1
t
t+1
i+2
decode
i
i+1
t
t+1
i+2
read
i
i+1
t
t+1
i+2
execute
write
forwarding
of CC
i
i+1
t
t+1
i+2
i
i+1
t
t+1
i+2
(b) branch not taken
Goal: zero branch penalty
Technique: moving independent instructions into the delay slots.
Probability of using delay slots: 1. slot ~ 0.6; 2. slot ~ 0.2; 3. slot ~ 0.1;
control transfer
bubbles
in the
pipeline
Vorlesung Rechnerarchitektur
Seite 39
Delayed Branch with Squashing
31
0
bcc
static branch prediction bit
branch offset
anullment bit
forwarding
of CC to control flow
4-stages
fetch
instructions time
i
branch
i+1
delay slot
t
target
t+1
target + 1
target + 2
t+2
decode
& read
branch taken
execute
write
i
i+1
t
t+1
t+2
i
i
i+1
t
t+1
i+1
t
t+2
t+1
t+2
forwarding
of CC to control flow
4-stages
fetch
instructions time
i
branch
i+1
delay slot
t
target
i+2
successor+ 2
successor+ 3
i+3
decode
& read
branch NOT taken
execute
write
i
i+1
t
i+2
i+3
i
i+1
t
i+2
i+3
i
i+1
t
i+2
i+3
annulled delay and target instructions (a=1)
If the anullment bit is asserted, the delay slot instruction will be anulled, when the branch
decision was predicted falsely.
This enables the compiler to schedule instructions from before and after the branch instruction into the delay slot.
ppt
Vorlesung Rechnerarchitektur
Seite 40
Branch Prediction
Two different approaches can be distinguished here:
- static-branch prediction
- dynamic-branch prediction
The static-branch information associated with each branch does not change as the program
executes. The static-branch prediction utilizes compile time knowledge about the behaviour of a branch. The compiler can try to optimize the ordering of instructions for the correct path.
Four prediction schemes can be distinguished:
-
branch not taken
branch taken
backward branch taken
branch direction bit
built-in static prediction strategies
branch
delay slot
=1
branch
successor
branch
not taken
Static branch
=0 prediction bit
branch
target
branch
taken
branch
backward
branch with
static prediction
Vorlesung Rechnerarchitektur
Seite 41
Static Branch Prediction
The hardware resources required for static branch prediction are a forwarding path for the
prediction bit, and a logic for changing the instruction fetch strategy in the IF stage. This prediction bit supplied by the output of the instruction decode stage decides which path is followed. The prediction bit allows execution of instructions from the branch target or from the
successor path directly after the delay slot instruction.
IF
DEC
RF
read
ALU
RF
write
Static branch prediction bit
branch
delay slot
=1
=0
branch successor
branch target
The static branch prediction is controlled by the programmer/compiler (e.g. prediction bit
within the branch instruction).
For example, the GCC will try to use static branch prediction if available from the architecture and optimizations are turned on. If the programmer wants to retain control, GCC provides a builtin function for this purpose ( __builtin_expect(condition, c) ):
if (__builtin_expect (a==b, 0))
f();
This means, we expect a==b to evaluate to false ( =0 in C) and therefore to not execute the
function f().
Vorlesung Rechnerarchitektur
Seite 42
Profiled Branch Prediction
The profiled branch prediction is a special case of static prediction. The guess of the
branch direction is based on a test run of the program. This profile information on branch
behaviour is used to insert a static branch prediction bit into the branch instruction that has
a higher proportion of correct guesses.
from the GCC 3.x manual
‘-fprofile-arcs’
Instrument "arcs" during compilation to generate coverage data or
for profile-directed block ordering. During execution the program
records how many times each branch is executed and how many times
it is taken. When the compiled program exits it saves this data
to a file called ‘AUXNAME.da’ for each source file. AUXNAME is
generated from the name of the output file, if explicitly
specified and it is not the final executable, otherwise it is the
basename of the source file. In both cases any suffix is removed
(e.g. ‘foo.da’ for input file ‘dir/foo.c’, or ‘dir/foo.da’ for
output file specified as ‘-o dir/foo.o’).
For profile-directed block ordering, compile the program with
‘-fprofile-arcs’ plus optimization and code generation options,
generate the arc profile information by running the program on a
selected workload, and then compile the program again with the same
optimization and code generation options plus
‘-fbranch-probabilities’ (*note Options that Control Optimization:
Optimize Options.).
The other use of ‘-fprofile-arcs’ is for use with ‘gcov’, when it
is used with the ‘-ftest-coverage’ option.
With ‘-fprofile-arcs’, for each function of your program GCC
creates a program flow graph, then finds a spanning tree for the
graph. Only arcs that are not on the spanning tree have to be
instrumented: the compiler adds code to count the number of times
that these arcs are executed. When an arc is the only exit or
only entrance to a block, the instrumentation code can be added to
the block; otherwise, a new basic block must be created to hold
the instrumentation code.
Vorlesung Rechnerarchitektur
Seite 43
Dynamic Branch Prediction
Dynamic branch prediction uses information on branch behaviour as the program executes. No initial dynamic information exists before the program starts executing.
The dynamic information is normally associated with a specific branch. The association can
be realized by using the instruction address of the branch as the unique identifier of the
branch.
The dynamic prediction depends on the past behaviour of this branch and is stored in a table
addressed by the branch instruction address. A unique addressing would need the whole
address as an index, usually 32 bit in length. This length prohibits direct use of this address
as an index to the state table.
simple branch predictor using only one history bit
X’_Y’ : = f (a, X_Y)
a current branch behaviour
X prediction bit
Y history bits
old state
input
new state
low order of
branch address
a
XY
/a
n
00
predictor
state
memory
/a
a
01
/a
10
1
11
a
n
history bits
prediction bit
to/from
instruction
sequencer
a
current
branch
behaviour
a
X
Y
predictor
state
machine
X’
Y’
/a
a current branch behaviour
taken a
not taken /a
X prediction bit
take
1
take not 0
Y history bit
last branch taken 1
last branch not taken 0
Vorlesung Rechnerarchitektur
Seite 44
Dynamic Branch Prediction
Integration of Dynamic Branch predictor into the pipeline
branch address
dynamic branch predictor
a
X
CC
a X
IF
DEC
RF
read
ALU
RF
write
Vorlesung Rechnerarchitektur
Seite 45
Reduction of Instruction Fetch Penalty
Instruction Cache
Instruction Buffer
Idea: Holding a number of instruction (e.g. 32) in fast buffer organized as an "instruction
window".
prefetch
PC
fetch
PC
prefetch PC
top
buffer
length
fetch PC
memory
fetch
bottom
DEC
stage
instruction buffer
hit condition
<=
bottom <= fetch PC <= top
>=
top
bottom
&
hit
Branch Target Buffer
Optimize instruction fetch in case of a correctly predicted branch
low order bits of
branch address
m high order bits of
branch address
n
predictor
state
memory
=
hit
branch
address
tag field
branch
PC
memory
branch
target
instructions
t
1
prediction bit
to/from
instruction
sequencer
current
branch
behavior
n
history
bits
predictor
state
machine
branch PC
to IF stage
hiding Nb
to DEC stage
hiding No
t+3
t+2
t+1
Vorlesung Rechnerarchitektur
Seite 46
Predication
Minimization of branches
• "guarded" or conditional instructions
- the instruction executes, if a given condition is true.
If the condition is not true the instruction does not execute
- example: CMOVcc instructions (conditional move) from the IA-32 architecture
(available since P6)
• the test is embedded in the instruction itself
- no branch involved
• predicated instructions
- predicates are precomputed values stored in registers
- instruction contains a field to select a predicate from the predicate RF
- example: IA-64 architecture
’guarded’ or conditional instruction
p1 = "true"
p2 = p1 = "false"
<p1> a = a + 1
<p2> b = b + 1
predicate
cmp
p1
cmp
p1
yes
a=a+1
no
yes
p2
no
b=b+1
<p1>
a+1
p1
<p2>
b+1
p2
Vorlesung Rechnerarchitektur
Seite 47
Predication
How Predication works
Instruction 1
1. The branch has two
possible outcomes.
2. The compiler assigns
a predicate register to
each following instruction,
according to its path.
Instruction 2
3. All instructions along
this path point to
predicate register P1.
Instruction 3
(branch)
4. All instructions along
this path point to
predicate register P2.
Instruction 4
(P1)
5. CPU begins executing
instructions from both paths.
Instruction 7
(P2)
Instruction 5
(P1)
6. CPU can execute
instructions from different
paths in parallel because they
have no mutual dependencies.
Instruction 8
(P2)
Instruction 6
(P1)
7. When CPU knows the
compare outcome, it discards
results from invalid path.
Instruction 9
(P2)
The compiler might rearrange instructions in this order, pairing instructions
4 and 7, 5 and 8, and 6 and 9 for parallel execution.
128-bit long
instruction
words
Instruction 1
Instruction 2
Instruction 3
(branch)
Instruction 4
(P1)
Instruction 7
(P2)
Instruction 5
(P1)
Instruction 8
(P2)
Instruction 6
(P1)
Instruction 9
(P2)
Predication replaces branch prediction by allowing the CPU to
execute all possible branch paths in parallel.
Vorlesung Rechnerarchitektur
Seite 48
Memory Access
Load/Store Data Path
Using data directly from the register file or forwarding result data from the output of the
ALU of the EX stage is straight forward. But in which way the data is placed initially in the
register file?
There are two different approaches:
• Instructions like ADD which can access a memory location by using an effective address
generation (memory-to-register Architecture like MC68020, see also 1- 2- or 3-address
machines) e.g. ADD D3, (A4)
• Instructions like LOAD (LD) or STORE (ST) which are specialized for memory accesses,
e.g. LD R3, (R4)
In the following we will focus on LD/ST architectures, because the decoupling of LD and
data processing instructions like ADD is a very advantageous feature of high performance
processors. The latency of the memory (Processor Memory Gap!) can be handled independently of the ADD instruction. Scheduling the LD instruction early enough before the use of
the data value can hide the memory latency (see ILP).
forwarding control
4
data forwarding path
forwarding
data mux
EX-stage
(S1)
RF
read
(S2)
(R)
ALU
RF
write
load data path
LD
functional unit ST is omitted
for simplicity
latency >10 clock ticks
memory access
stage
issue
point
instruction
time
LD
fetch
i
i+1
decode
i
i+1
instruction
read
execute
write
i LD R3,(R4)
A
issue
check
for i+1
i LD R3,(R4)
stalled
data
forwarding
10 clock ticks
i+1 use R3
i+1 use R3
i write R3
This simple LD/ST architecture does not include address calculation in the instructions.
Other processors of the LD/ST type add another stage in front of LD/ST stage to perform
address calculations.
Vorlesung Rechnerarchitektur
Seite 49
Memory System Architecture
The two basic principles that are used to increase
the performance of a processor are:
pipelining and parallel execution
the optimization of the memory hierarchy.
Applying a high-performance memory system to modern microprocessors is another step towards improving performance. Whenever off-chip accesses have to be performed, throughput is reduced because of the delay involved and the latency of external storage devices. The
memory system typically consists of a hierarchy of memories.
Definition :
A memory hierarchy is the result of an optimization process with respect to technological and economic constrains. The implementation
of the memory hierarchy consists of multiple levels, which are differing in size and speed. It is used for storing the working set of the ‘process in execution’ as near as possible to the execution unit.
The memory hierarchy consists of the following levels:
-
registers
primary caches
local memory
secondary caches
main memory
speed
cost
more
expensive
size
faster
on chip
denser
larger
off chip
The mechanisms for the data movement between levels may be explicit (for registers, by means of load instructions) or implicit (for caches, by means of memory addressing).
CPU Chip
CPU
Register Files
1st-Llevel Cache
External Processor
Interface
2nd-Level Cache
Main Memory
Disk Storage
Vorlesung Rechnerarchitektur
Seite 50
Registers
Registers are the fastest storage elements within a processor. Hence, they are used to keep
values locally on-chip, and to supply operands to the execution unit(s) and store the results
for further processing. The read-write cycle time of registers must be equal to the cycle time
of the execution unit and the rest of the processor in order to allow pipelining of these units.
Data is moved from and into the registers by explicit software control.
Registers can be grouped in a wide variety of ways:
-
accumulator
evaluation stack
register file
multiple register windows
register rotation
register banks
Evaluation Stack
A very simple register structure is the evaluation stack. The addressing of values is done implicitly by most of the operations. Push and pop operations are used to enter new values on
the stack, or to store the top of the stack back to the memory. Dyadic operations, like add,
sub, mul, consume the top two registers as source operands, leaving the result on the top of
the stack. Implicit addressing of the stack reduces the need for instructions to specify the location of their operands, which in turn reduces the size of instructions and results in a very
compact code.
A
B
C
10
5
3
A
B
C
10
5
3
A
B
C
10
5
3
A
B
C
7
10
5
3 lost
push
( 7)
A
B
C
5
3
x
pop
(10)
A
B
C
15
3
x
add
before
operation
A:= A + const (7)
A:=A+B
A
B
C
10
5
3
A
B
C
17
5
3
add + const
( 7)
after
operation
Vorlesung Rechnerarchitektur
Seite 51
Register File
The collection of multiple registers into a register file, directly accessible from the instruction word, provides a faster access to data, that will have a high ‘hit rate’ within a short period of time. The compiler must identify those variables and allocate them into registers.
This task is named register allocation. The reuse of these variables held in registers reduces
main memory accesses and therefore speeds up execution.
The register file (RF) is the commonly used grouping of registers in a CPU. The number of
registers depends on many constrains, as there are:
- the length of the instruction word containing the addresses of the registers combined in an operation
- the speed of the register file which is inverse proportional to the number of register cells and the number of ports.
- the number of registers needed for software optimization,
parameter passing for function call and return (32 is more than enough)
- the penalty for saving and restoring the RF in case of a process switch
- the number of ports of the RF for parallel operations
The addressing of the registers in the RF is performed by the register address of an instruction. The number of register address fields in the instruction field divide the processors into
classes called two, 2 1/2, or three address machines. Because dyadic operations are very frequent, most modern processors are three address machines and have two source address
fields and one destination address field. The following figure presents a typical 32-bit instruction format of a RISC processor.
31
26 25
instruction class
6
15 14
opcode
11
10 9
source 1
5
5 4
source 2
5
0
destination
5
5bit register address
The limitation of the instruction word width restricts the addressable number of registers to
32 caused by the 5-bit register address fields.
The typical VLSI realization of a RF is a static random access memory with a 6-transistor
cell, 4 for the storage cell and two transistors for every read/write access port, which means
two simultaneous reads (two read-ports) or one write (one write-port).
Vorlesung Rechnerarchitektur
Seite 52
Register File
1 x write port
result
destination
source2
source1
Line
Driver
result
result
BIT LINE A
BIT LINE B
data line
WORD LINE B
select line
word width
WORD LINE A
Vref
2 x read port
+ - + -
operand 1
Sense
Amplifier
operand 2
BIT LINE A
BIT LINE B
WORD LINE B
additional
WORD LINE C
additional
pass transistor
figure b
WORD LINE A
additional
BIT LINE C
The register file features two read ports and one write port, which shares the two bit lines.
The read and the write cycle are time-multiplexed.
The read operation activates the two source address decoders and they drive the corresponding word line A and B. Operand 1 and operand 2 can be read on the two bit lines. Special
sense amplifiers and charge circuits control the data read-out. The write cycle activates the
destination decoder which must drive both word lines. The result data is forced to the storage
cell by applying data and negated data on both bit lines to one cell. The geometry of the transistors is chosen so that the flip-flop is overruled by the external bit line information.
An additional read port (3read/1write port) requires one extra word line (from additional
source 3 decoder) a pass transistor and an extra bit line as shown in figure b. Every extra port
at the register file increases the area consumed by the RF and thus reduces the speed of the
RF.
Vorlesung Rechnerarchitektur
Seite 53
Register Windows
The desire to optimize the frequent function calls and returns led to the structure of overlapping register windows. The save and restore of local variables to and from the procedure
stack is avoided, if the calling procedure can get an empty set of registers at the procedure
entry point. The passing of parameter to the procedure can be performed by overlapping of
the actual with the new window. For global variables a number of global registers can be
reserved and accessed from all procedure levels. This structure optimize the procedure call
mechanism of high level languages, and optimizing compiler can allocate many variables
and parameters directly into registers.
register file
135
parameter
128
127
120
119
local
variables
parameter
112
procedure A
R31A
R24A
R23A
R16A
R15A
overlapping registers
between A and B
parameter
local
variables
procedure B
overlapping registers
between B and C‘
R31B
parameter
R8A
local
variables
R24B
R23B
R16B
R15B
R8B
parameter
local
variables
procedure C
R31C
parameter
R24C
R23C
R16C
R15C
31
R8C
parameter
24
23
16
15
parameter
local
variables
24 window
registers
parameter
local
variables
parameter
8
7
0
global
variables
R7A
R0A
global
variables
R7B
R0B
global
variables
R7C
R0C
global
variables
8 global
registers
Due to the fixed size of the instruction word the register select fields have the same width (5
bits) as in the register file structure. The large RF requires more address bits and therefore
the addressing is done relatively to the window pointer. The window pointer is kept in a special register called current window pointer cwp and provides the base address of the actual
window. An addition of the cwp with the content of the register select field provides the physical address to the large register file. The following figure presents the logical structure of
the addressing scheme. The addition and the global register multiplexer can be incorporated
in the register file decoder by using the addressing truth table as the basis for the decoder
implementation. Nevertheless this address translation slows down the access of the RF.
The cwp is controlled directly by the instruction ‘call’, which decrements the pointer and by
the ‘return’, which increments the cwp.
This scheme has been implemented by the SPARC-family of processors. Note that the register window allocated by a call instruction has a fixed size regardless of the size necessary
to accomodate the called function. The Itanium family by Intel (IA64) uses and optimized
version enabling variable size windows.
Vorlesung Rechnerarchitektur
Seite 54
Register Rotation
Consider a typical loop for a numerical calculation: X=V+U where V,U,X are vectors.
loop:
ld U[lc], R32
ld V[lc], R33
add R32,R33,R34
st R34, X[lc]
dec lc
cmp lc, #0
bne loop
;lc = loop counter
;R32,R33,R34 = register used by loop
Problem: Dependency between ld/add/st operations even through loop iterations although
individual loops are logically independent! Dependency because of usage of same register.
One can use loop unrolling together with usage of more registers to solve this problem (see
ILP, loop unrolling).
Different solution: Register Rotation
register file
135
128
127
registers independent
between iteration #0 and #1
loop
register #i
registers independent
between iteration #1 and #2
120
119
loop iteration 2
112
loop iteration 1
R342
loop iteration 0
loop
register #1
loop
register #0
31
loop
R322 register #2
R341
loop
R321 register #1
R340
loop
R320 register #0
R310
R311
R312
24
23
16
15
global
registers
global
registers
global
registers
global
registers
8
7
0
R00
R01
R02
Special loop counter register used as base adress for register selection (address calculation
required). The same register address (from the programmers point of view) addresses a different (physical) register in each loop iteration.
It is possible to fully pipeline the loop.
The Itanium processor features register rotation to facilitate loop pipelining.
Vorlesung Rechnerarchitektur
Seite 55
Caches
Caches are the next level of the memory hierarchy. They are small high speed memories employed to hold small blocks of main memory that are currently in use.
The principle of locality, which justifies the use of cache memories, has two aspects:
- locality in space or spatial locality
- locality in time or temporal locality
Most programs exhibit this locality in space in cases where subsequent instructions or data
objects are referenced from near the current reference. Programs also exhibit locality in time
where objects located in a small region will be referenced again within a short periode. Instructions are stored in sequence, and data objects normally being stored in the order of their
use. The following figure is an idealized space/time diagram of address references, representing the actual working set w in the time interval Δτ.
Address
Space
Data
Δτ
w ( T, T +
Data
Δτ)
Instruction
T
T+
Δτ
time
Caches are transparent to the software. Usually, no explicit control of cache entries is possible. Data is allocated automatically by cache control in the cache, when a load instruction
references the main memory.
Some processors feature explicit control over the caching of data. Four types of user mode
instructions can improve hit rate significantly (cache bypassing on stores, cache preloading,
forced dirty-line flush, line allocation without line fill).
Vorlesung Rechnerarchitektur
Seite 56
Caches
Cache memory design aims to make the slow, large main memory appear to the processor
as a fast memory by optimizing the following aspects:
-
maximizing the hit ratio
minimizing the access time
minimizing the miss penalty
minimizing the overhead for maintaining cache consistency
The performance gain of a cache can be described by the following formula:
Gcache
Tm
=
( 1 - H ) Tm + H
miss ratio
Tc
(1-H)
1- H
+H
Tc
Tm
hit ratio
Tm = tacc of main memory
1
=
1
=
(1-
Tc
Tm
Tc = tacc of cache memory
)
H = hit ratio [0, ...1]
G
5
example for
4
3
Gcache (H=1) =
2
1
0
0.5
0.9 1
Tm
Tc
=
5
1
H
The hit ratio of the cache (in %) is the ratio of accesses that find a valid entry (hit) to accesses
that failed to find a valid entry (miss). The miss ratio is 100% minus hit ratio.
The access time to a cache should be significantly shorter than that to the main memory. Onchip caches (L1) normally need one clock tick to fetch an entry.
Access to off-chip caches (L3) is dependent on the chip-to-chip delay, the control signal protocol, and the access time of the external memory chips used.
Vorlesung Rechnerarchitektur
Seite 57
Cache Terms
Cache block size:
Cache block size is the size of the addressable unit within a cache, which is at the same time
the size of the data chunk to be fetched from main memory.
Cache line:
A cache line is the data of a cache block, typically a power of 2 number of words. The term
is mainly used for the unit of transport between main memory and cache.
Cache entry:
A cache entry is the cache line together with all required management information, e.g. the
tag field and the control field.
Cache frame:
The cache frame refers to the addressable unit within a cache which can be populated with
data. It defines the empty cache memory space, where a cache line can be placed.
H [%]
fully associative
1
0.9
16 - 32 KB L1 Cache H > 0.9
0.8
0.5
directly mapped
0.2
0
1
2
4
8
16
32
64
log2 of Cache size [KB]
hit ratio versus cache size
Set:
A set is the number of cache blocks which can be reached by the same index value.
A set is only these pairs (2-way) or quadruples (4-way) of cache blocks, not the whole part
of one way of the set-associative cache memory.
Bedauerlicherweise gibt es keinen Begriff für den jeweiligen Teil des set-associative cache
Speichers. Somit wird ’set’ auch häufig als die Bezeichnung für diesen Teil des Speichers
verwendet.
Vorlesung Rechnerarchitektur
Seite 58
Cache Organizations
Five important features of the cache, with their various possible implementations are:
-
mapping
organization
addressing
management
placement
direct, set associative, fully associative
cache block size, entry format, split cache, unified cache
logically indexed, physically indexed, logically indexed/physically tagged
consistency protocol, control bits, cache update policy
random, filo, least recently used
One of the most important features is the mapping principle. Three different strategies can
be distinguished, but the range of caches - from directly mapped to set associative to fully
associative - can be viewed as a continuum of levels of set associativity.
Cache Mapping
m+n+x+z-1
m Bits
Tag
i
2n
page
size
m
th
2
blocks
of
cache
size
Main Memory
ith
x Bits
Index
Word Byte
select select
Address
index
i
tag
index
m
mem block
0
z Bits
n Bits
cache block
2n
entries
ith
cache
size
Cache
Directly Mapped Cache
n
Tag
Mem
=
hit
Hardware Structure
(Address Path Only)
Vorlesung Rechnerarchitektur
Seite 59
Cache Mapping
m+n+x+z-1
m+1 Bits
Tag
n-1
ith
2
page
size
index
i
2m+1
blocks
of
cache
size
ith
Address
tag
set 0
m+1
tag
index
n-1
Tag
Mem 1
cache block
Tag
Mem 0
cache
size
set 1
0
z Bits
Word Byte
select select
Index
2n-1
entries
mem block
x Bits
n-1 Bits
ith
=
=
set 1
cache block
Main Memory
set 0
or
hit
Hardware Structure
(Address Path Only)
Cache
2-way Set Associative Cache
m+n+x+z-1
x Bits
m+n Bits
Tag
0
z Bits
Word Byte
select select
Address
m+n
2n
entries
2m+n
main
memory
blocks
Tag 2n-1
=
Tag 3
Tag =
2
Tag =
1
Tag =
0
=
mem block
cache
size
oror
hit
Main Memory
Cache
Hardware Structure
(Address Path Only)
Fully Associative Cache
Vorlesung Rechnerarchitektur
Seite 60
Cache Mapping
index
from tag
compare
index
n-1
n
set 0/1
set 1
set 0
word
select
word
select
x
set
mux
word
mux
from tag
compare
set
mux
word
mux
directly mapped
word
select
word
mux
set associative
fully associative
Data Paths of Differently Mapped Caches
Cache Organization
The basic elements of a cache organization are:
-
the entry format
the cache block size
the kind of objects stored in cache
special control bits
MESI
Entry format
modified
exclusive
shared
invalid
placement
pid
physical
tag field
word 0
logical
control field
word 1
word 2
data field
word 3
Vorlesung Rechnerarchitektur
Seite 61
Cache Line Fetch Order
Always fetch required word first. This keeps the memory access latency to a minimum.
access sequence: interleaved mode (INTEL mode )
0. Adr.
1. Adr.
2. Adr.
3. Adr.
0
8
10
18
8
0
18
10
10
18
0
8
18
10
8
0
mod ++
mod - mod ++
mod - byte offsets for 64bit words
Mem_Adr
0
10
1. read of Mem
8
18
0,8
8,0
2. read of Mem
10,18
18,10
EN_L
EN_H
fast multiplexing with TS driver or multiplexer
CPU BUS
see also: DRAM Burst mode for further explanations
- interleaved mode
- sequential mode
- programmable burst length
Vorlesung Rechnerarchitektur
Seite 62
Cache Consistency
The use of caches in shared-memory multiprocessor systems gives rise to the problem of cache consistency. Inconsistent states may occur, when two processors keep a copy of the
same memory cell in their caches, and one processor modifies the cache contents or the main
memory by a write.
Two memory-update strategies can be distinguished:
- the write back (WB), sometimes also known as copy back,
- and the write through (WT).
The WT strategy is the simplest one. Whenever a processor starts a write cycle, the cache is
updated and the same value is written to the main memory. The cache is said to be writtenthrough. Nevertheless, this write must inform all other caches of the new value at this
address. While the active bus master (CPU or DMA) is placing its write address on to the
address bus, all other caches in the other CPUs must check this address against their cache
entries so as to invalidate or update the cache line.
The WB strategy is more efficient, because the main memory is not updated for each store
instruction. The modified data is stored in the cache data field only, the line being marked
as modified in the cache control field. The write to the main memory is performed only on
request, and then whole cache lines are written back (WB). This memory update strategy is
called write back or copy back and allows the cache to hold newer values than the main memory. Information must be available in the cache line, which keeps track of the state of a
cache entry. The MESI states and MESI consistency protocol are widely used and are therefore given here as an example of cache consistency protocols. Four possible states of a cache
line are used by the MESI protocol:
- Modified: one or more data items of the cache line are written by a store operation, the modified or dirty bit being set
- Exclusive unmodified: the cache line belongs to this CPU only, the contents
is not modified
- Shared unmodified: the cache line is stored in more than one cache and can be
read by all CPUs. A store to this address must invalidate all other copies and
update the main memory
- Invalid: the cache entry is invalid; this is the initial state of a cache line that
does not contain any valid data.
Vorlesung Rechnerarchitektur
Seite 63
Cache Consistency
The states of the cache consistency control bits and the transitions between them are illustrated in two figures, the first one showing all transitions of the cache of a bus master CPU, and
the second state diagram showing the transitions of a slave CPU cache (snooping mode).
Definition :
A processor of a shared-memory multiprocessor system (bus-interconnected) is
called bus master if it has gained bus mastership from the arbitration logic and is
in the process of performing active bus transactions.
Processors of a shared-memory multiprocessor system are called bus slaves if these processors can not currently access the bus for active bus transactions. They
have to listen passively (snooping) for the active transactions of the bus master.
Cache Consistency State Transitions for Bus Master CPU
Shared Read Miss
Invalid
Write Hit [2]
Write Miss[1]
Shared
Read Miss [3]
Read Hit,
Write Hit
M
E
Exclusive Read Miss [3]
Modified
Exclusive
Read Hit /
Shared Read miss
Shared Read Miss [3]
Exclusive
Read Miss
Write Miss
S
Shared
Unmodified
Exclusive Read Miss [3]
I
Exclusive
Unmodified
Write Hit
Read Hit /
Exclusive Read miss
[1] = Read with intent to modify
[2] = Invalidation Bus Transaction
[3] = Address tag miss
= Snoop response
State Transitions for Snooping CPU (Slave)
I
Snoop Hit on Write
or on Read w.i.t.m.
or on Invalidation
Snoop Hit
on Read
Snoop Hit on Read
Snoop Hit on Write [4]
or on Read w.i.t.m. [4]
Invalid
S
Shared
Unmodified
Snoop Hit
on Read [4}
M
Snoop Hit on Write
or on Read w.i.t.m.
E
Exclusive
Unmodified
Modified
Exclusive
[4] = Copy back of modified data
Vorlesung Rechnerarchitektur
Seite 64
Cache Addressing Modes
The logically addressed cache is indexed by the logical (or virtual) address of the process
currently running on the CPU. The address translation need not be performed for the cache
access, and the cache can therefore be accessed very fast, normally within one clock cycle.
Only the valid bit must be checked to see whether or not the data item is in the cache line.
The difficulty with the logically addressed cache is that no snooping of external physical
addresses is possible.
address
valid
MMU
ATC
physical
address
logical
address
data
valid
address
valid
data
valid
logical
address
data
CACHE
MMU
ATC
CACHE
TLB
logically addressed
data
physical
address
physically addressed
The physically addressed cache is able to snoop and need not be invalidated on a process
switch, because it represents a storage with a one-to-one mapping of addresses to main memory cells. The address translation from logical to physical address must be performed in
advance. This normally slows down the cache access to two clock cycles, the address translation and the cache access.
address
valid
MMU
ATC
physical
address
Address
Compare
access
valid
physical
address
logical
address
data
CACHE
data
valid
logically addressed/physically tagged
The logically indexed/physically tagged cache scheme avoids both disadvantages by
storing the high-order part of the physical address in the cache line as additional tag information. If the size of the index part of the logical address is chosen to match the page size,
then the indexing can be performed without MMU translation.
Vorlesung Rechnerarchitektur
Seite 65
Cache Consistency
Cache Consistency State Transitions
for Bus Master CPU
Shared
Read Miss [3]
Read Hit,
Write Hit
M
Exclusive Read Miss [3]
Write Hit
Read Hit /
Shared read miss
Snoop Hit on
Read w.i.t.m. or Write [4]
Write Miss[1]
Write Miss
Write Hit [2]
Snoop Hit on
Read w.i.t.m. or Write
I
Shared Read Miss [3]
Exclusive
Read Miss
S
Snoop Hit
on Read
Snoop Hit
on Read [4}
Snoop Hit on
Read w.i.t.m. or Write
E
M
E
S
Snoop Hit on Read
Shared Read Miss
Exclusive Read Miss [3]
I
Cache Consistency State Transitions
for BusSlave CPU (Snooping)
Read Hit /
Exclusive read miss
Prozessor 0
Prozessor 1
L1 Cache
Tag
i
L1 Cache
M E S I
Data
HitM
Tag
i
M E S I
Data
HitM
Hit
Hit
Snoop Response
BG0 BG1
Main Memory
Arbiter
0
BR1
4
8
C
Snoop Response
BR0
m
LD R1 <- ($A004)
ADD R1, #1, R1
ST R1 -> ($A004)
Symbols for Animation
i
MESI
Hit
HitM
I
E;S
M
0
0
1
1
0
1
0
1
Vorlesung Rechnerarchitektur
Seite 66
Cache Placement
directly mapped cache
single entry
no choice
set 0
set
set/fully associative
random replacement
random FF
fifo replacement
first in - first out
circular fifo
per cache line
set count = n
pointer/per index
nod (number of sets)
least recently used LRU
t
(read cache)
count
am längsten zurückliegender Zugriff
eines Eintrags ... der Eintrag wird überschrieben
... Algorithmus
Vorlesung Rechnerarchitektur
Seite 67
More Cache Variants
There are a couple more cache related terms one encounters with todays processors:
• split cache versus unified cache
• inclusive/exclusive caches
• trace caches
Split Cache / Unified Cache
Unified Caches serve both as instruction and data cache; in a split cache architecture, a dedicated instruction cache and a dedicated data cache exist. Split caches are very often found
as L1 caches of processors (Internal Havard Architecture).
A processor having a dedicated instruction memory and a dedicated data memory is called
Havard Architecture.
Processor
L1
I-cache
L1
D-cache
Unified L2 Cache
to memory
Inclusive/Exclusive Caches
Inclusive: Data is held in L1 and L2 on fetch; if evicted in L1, remains in L2; fast access later
on; effectively cache size is lost
Exlucise: Data is only fetcher into L1; if evicted from L1 it is written to L2; effective cache
size is L1+L2; copy from L1 to L2 costs time; can be optimized using a Victim Buffer.
Trace Cache
A trace cache is a special case of an instruction cache. A trace cache does not hold instructions exactly as they are found in main memory but rather in a decoded form, and possibly
a sequential entry of trace cache corresponds to an instruction stream across a branch.
Vorlesung Rechnerarchitektur
Seite 68
Memory Technology
Development
Cost per Transistor [log10]
on-chip
l
values from Fairchild (Gordon Moore)
10
-1
l
10-2
10-3
l
10-4
l
10-5
l 4.5Mio.T = 30$US
10-6
’60
’59
’65
’70
’75
’80
’85
’90
’95
’00
Year
’02
~ 6.6 * 10-6 $US
year 2002: 4.5Mio. transistors in 0.18μm CMOS technology on a 5x5mm die with BGA pakkage ~ 30 $US = 6.6 x 10-6 $US per transistor; standard cell design
Vorlesung Rechnerarchitektur
Seite 69
Memory Technology
Cost development of DRAMS
Stone, Harold S., High-Performance Computer Architecture:
"Memory chips have been quadrupling capacity every two to three years. The manufacturing
cost per chip is usually constant per chip, regardless of the memory capacity per chip. When
a new memory chip that has four times the capacity of its predecessor is introduced, a typical
strategy is to sell it at four or five times the price of its predecessor. Although the price per
bit is about equal for new and old technologies, the newer technology leads to less expensive
systems because of having only one fourth the number of memory chips."
Cost per MByte of Memory [log10 $US]
DRAM
106
1Mio $US (1MByte)
z
105
104
z2500 $US
3
10
102
12 $US
z
1
10
z
z
3,5 $US (512MB)
1
10 -1
10 -2
10 -3
10 -4
’60
’65
’70
’60
some data
1982 16kbit SRAM
1982 64kbit DRAM
1995 1Mbit SRAM
1995 4Mbit DRAM
’75
’80
’85
~ 5.4 *
~80 $US
~20 $US
~50 $US
~6 $US
’90
10-8
’95
’00
Year
’02
$US/bit
chip mit 16Kbit 16Kx1
chip mit 64Kbit 64Kx1
chip mit 1Mbit 1Mbx1 oder 256Kx4
chips mit 4Mbit 4Mbx1 oder 1Mx4
Vorlesung Rechnerarchitektur
Seite 70
Main Memory
The main memory is the lowest level in the semiconductor memory hierarchy. Normally all
data objects must be present in this memory for processing by the CPU. In the case of a demand-paging memory, they are fetched from the disk storage to the memory pages on demand before processing is started. Its organization and performance are extremely important
for the system’s execution speed.
The advantages of DRAMs in respect of cost and integration density are much more important than the faster speed provided by SRAMs.
SRAM
DRAM
trac 60 - 100 ns
tcac 20 - 30 ns
tcyc 110 - 180 ns
access time
1-10 ns
power consumption
similar @ high MHz
few when IDLE
2000 mW / 256Mbit
memory capacity
10 MByte (cache)
16 Mbit per chip
4-32 GByte (MM)
4 Gbit per chip
price
$ 1/Mbit
$ 10/Gbit
trac row-address access time
tcac column-address access time
tcyc cycle time
Important parameters of the DRAM are:
- RAS access time - tRAC; typ. 80ns
the time from row address strobe (RAS) to the delivery of data at the outputs on
a read cycle
- CAS access time - tCAC; typ. 20ns
the time from column address strobe (CAS) to the delivery of data at the outputs
on a read cycle
- RAS recovery time - tREC; typ. 80ns
after the application of a RAS pulse, the DRAM needs some time to restore the
value of the memory cell (destructive read) and to charge up the sense lines
again, until the next RAS pulse can be applied, this time allows the chip to ‘recover’ from an access
- RAS cycle time - tCYC; typ. 160ns
is the sum of the RAS access time and the recovery time and defines the minimum time for a single data access cycle
- CAS cycle time - tPCYC; typ. 45ns
the time from CAS being activated to the point at which CAS can be reactivated
to start a new access within a page
Vorlesung Rechnerarchitektur
Seite 71
Main Memory
The address lines of DRAMs are multiplexed to save pins and keep the package of the chip
small. The partitioning of the address into the Row Address RA and the Column Address CA
necessitates a sequential protocol for the addressing phase. The names ‘row’ and ‘column’
stem from the arrangement of the memory cells on the chip. The RA is sampled into the
DRAM at the falling edge of RAS*, then the address is switched to the CA and sampled at the
falling edge of CAS* (CAS*↓).
tCYC
DRAM 1 Mbit x 4
tREC
tCAC
RAS*
CAS*
A9-0
row address
col. address
WE*
OE*
tRAC
I/O4-1
valid data-out
single read cycle
tRAC
tCAC
RAS*
tPCYC
CAS*
A9-0
row address
col. address 1
col. address 2
col. address 3
WE*
OE*
I/O4-1
page mode cycle
data-out 1
data-out 2
data-out 3
Vorlesung Rechnerarchitektur
Seite 72
Word-Wide Memories
The simplest form of memory organization is the word-wide memory, matching the bus
width of the external processor interface.
Request
Ackn
CPU
Addr
32
Data
MEM DEC
RA
MUX
64
CA
Addr Mux
=
addr cmp
RAS#
CAS#
WE#
OE#
Memory
Control
Logic
clock cycle 20 ns
Memory
tCYC
tREC
RAS*
MUX*
CAS*
A9-0
row address
col. address
WE*
OE*
tRAC
I/O4-1
valid data-out
DTACK*
AS*
DS*
WRITE*
IDLE
RW1
READ1 READ2
READ3
memory cycle start
READ4
READ5
IDLE
data sampling by processor
address stable period
bus transfer time
data transfer time
recovery time
Vorlesung Rechnerarchitektur
Seite 73
State Machine of Simple DRAM
default
IDLE
/REFREQ*
/AS* & mem sel & REFREQ*
RW
1
/RAS*
REF
1
/CAS*
WRITE*
/WRITE*
/DS*
WRITE
1
/RAS*
/MUX*, /WE*
READ
1
/RAS*
/MUX*
REF
2
/RAS*, /CAS*
WRITE
2
/RAS*
/MUX*, /WE*
/CAS*
READ
2
/RAS*
/MUX*
/CAS*, /OE*
REF
3
/RAS*
WRITE
3
/RAS*
/CAS*
READ
3
/RAS*
/CAS*, /OE*
REF
4
/RAS*
WRITE
4
/CAS*
/DTACK*
READ
4
/CAS*, /OE*
/DTACK*
REF
5
/RAS*
READ
5
/CAS*, /OE*
/DTACK*
REF
6
/DS*
WRITE
5
/CAS*
/DTACK*
DS*
to
IDLE
DS*
to
IDLE
REF
7
to
IDLE
Vorlesung Rechnerarchitektur
Seite 74
Word-Wide Registered Memories
High-performance processors need more memory bandwidth than a simple one-word memory can provide. The access and cycle times of highly integrated dynamic RAMs are not
keeping up with the clock speed of the CPUs.
Therefore, special architectures and organizations must be used to speed-up the main memory system.
The design goal is to transport one data word per clock via the external bus interface of the
CPU to or from the main memory.
This sort of performance cannot be obtained by a one-word memory.
The two basic principles for enhancing speed - pipelining and parallel processing - must
also be applied to the main memory. Pipelining of a one-word memory involves attempting
to divide the memory access operation into suboperations and to overlap their execution in
the different stages of the pipeline. The subdivision of the memory cycle into two suboperations - the addressing phase and the data phase - allows pipelining of the bus interface and
the memory system. If there are separate handshake signals for address and data, several
transfers can be active at different phases of execution.
Request
NA*
data phase
DACK*
CPU
ADDR
32
DATA
64
OEout
PAGE COMP
=
ADR REG
DATA REGout
DATA REGin
CLKin
CLKout
page hit
CLKin
4
OEin
CLKout
OEout
Memory
Control
Logic
RAS#
CAS#
WE#
OE#
Memory
MUX
tCYC 20ns
clock address register
ADR
address 0
request
NA#
RAS#
MUX
CAS#
Memdata
DATA
stay in page
DACK#
OEin
MUX
address 2
address 1
RA0
CA0
CA0
CA1
data 0
data 1
data 0
clock data register
data 1
Vorlesung Rechnerarchitektur
Seite 75
New DRAM Architectures
New generations of special DRAM devices have been designed to support fast block-oriented data transfers and page-mode accesses. Four different types of devices are listed in the
table below from various manufacturers. Only the synchronous DRAM has survived.
Type of dynamic RAM
Enhanced
Cache
Synchronous
X1,X4
X4
X4,X8,X9
X8,X9
67
50-100
50-100
500
First-access latency
Cache/bank hit, ns
Cache/bank miss, ns
15-20
35-45
10-20
70-80
30-40
60-80
36
112
Cache-fill bandwidth, MB/s
7314
114
8533 (a)
9143 (a)
Cache/bank size, bits
2048
8192
4096 (a)
8192 (a)
5
7
5-10
10-20
CMOS/TTL
CMOS/TTL
CMOS/TTL,
GTL/CTT
600mV swing,
terminated
Asynchronous
DRAM-like
Synchronous
proprietary
Synchronous
pulsed/RAS
Synchronous
proprietary
Yes
Yes
Undecided
No
28/SOJ
44/TSOP
44/TSOP
32/VSMP
4M
4M
16M
4M
I/O width, bits
Data rate, single hit, MHz
Area penalty, percent (b)
Output level
Access method
Access during refresh
Pin count/package
Density, bits
Rambus
SOJ= small-outline J-lead package; TSOP= thin small-outline package;
VSMP= nonstandard vertically mounted package
(a) Synchronous and Rambus DRAMs store data in sense amplifier latches,
not in separate synchronous RAM caches.
(b) Area penalty is relative to the manufacturer’s standard die size,
so that the figures are not directly comparable.
The SDRAM has been developed over some generations: SDRAM, DDR SDRAM, DDR2
SDRAM to DDR3 SDRAM.
All are incremental improvements of the previous generation, optimizing the data transfer
rate and the termination principle of the signaling interface. Complex initialization sequences and data strobe trimming is included.
A complementary DRAM interface was designed by Intel. It has a much smaller signaling
interface, more message oriented, but requires a special buffer chip onevery DIMM. Search
for more info under the keyword fully buffered DIMM and AHB.
Vorlesung Rechnerarchitektur
Seite 76
Synchronous DRAM
The SDRAM device latches information in and out under the control of the system clock.
The information required for a memory cycle is stored in registers; the SDRAM can perform
the request without leaving the CPU idle at the interface. The device responds after a programmed number of clock cycles by supplying the data. With registers on all input and output signals, the chip allows pipelining of the accesses. This shortens its average access time
and is well suited to the pipelined interfaces of modern high-performance processors like the
i860XP. The interface signals are common CMOS levels, which appear to restrict the data
rate to 100Mbit/s (1bit devices). A JEDEC approval procedure is currently in progress.
Two internal memory banks support interleaving (see Section 4.3.2) and allow the precharge
time of one bank to be hidden in an access of the other bank. In the same way, the refresh
can be directed to the second bank while accessing the first one. The built-in timer for the
refresh period and the refresh controller can hide the refresh inside the chip. The 512 x 4 sense amplifier can hold the data of one page like a cache, and all accesses to the data within
the page are very fast.
clock
signals
CKE#
CLK
CS#
DQM
Data In
Buffer
DQ0-DQ3
4
Data Out
Buffer
Synchronous
Control
control signals
Logic
RA
bank 0
Row
Decoder
WE#
CAS#
RAS#
Row
Address
Buffers
11
.
.
Bank 0
Memory
Array
.
2048
.
.
Sense Amplifier
I/O-Gating
11
Clock
Generator #1
.
.
512
.
.
4
Latch
Latch
.
.
512 x 4
RA
11
RA
bank 1
VCC
GND
Column
Addr Buffer
.
.
Row
Address
Latch
Column
Decoder
Sense Amplifier
I/O-Gating
Clock
Generator #1
11
Row
Address
Buffers
Row
Decoder
A10 - A0
9
11
Row
Multiplexer
Latch
Latch
.
.
512
.
.
Refresh
Counter
CA
Burst
Burst
Counter
Counter
Self Refresh
Oscillator
& Timer
Refresh
Controller
Column
Addr Latch
9
4
.
.
.
.
A10
512 x 4
A11
.
.
2048
.
.
Bank 1
Memory
Array
.
Functional Block Diagram of 16 Mbit Synchronous DRAM
The 16Mbit SDRAM contains 4 banks which are not explicitly marked !!!
Vorlesung Rechnerarchitektur
Seite 77
Burst Mode Memory
Bei einem Zugriff auf den Hauptspeicher werden mehr Daten geholt, als die Wortbreite des
Speichers liefert. Bei diesem "Burst Transfer" werden nacheinander mehrere (2n ; typically
n=2 or 3) Werte gelesen oder geschrieben.
Vorteil: Durch die Ankündigung (burst mode control signal) eines solchen Transfers ist es
möglich, die weiteren Daten vorausschauend und im Pipeline-Modus zu holen und damit
eine wesentlich höhere Datentransferleistung zu erbringen. Hierbei wird der "page mode"
des Speichers ausgenutzt.Die Bezeichnung des Transfers erfolgt häufig nach folgender Syntax: (L:B:B:B) - (5:1:1:1). Die Notation bedeutet eine Startlatenz von 5 Takten gefolgt von
weiteren Daten jeden weiteren Takt.
Es gibt unterschiedliche Festlegungen für die Adreßsequenz innerhalb eines Burst-Zugriffs.
- linearer Burst
ABCD
- modulo Burst
- upcounting
ABCD | BCDA | CDAB | DABC
- interleaved
ABCD | BADC | CDAB | DCBA
Probleme entstehen, wenn die Startadresse des Burst Cycles nicht auf einer Startadresse
liegt, die ’aligned’ ist, oder der Burst die page der Speichers (oder der MMU) überschreitet.
Aus diesem Grund werden oft Einschränkungen bei der Adreßsequenz vorgenommen.
Memory Size [Bytes] = 2m+2+2
31
4092
s Bits
byte address
140
3
2
1
0
MemAdr
MemSelect
3
2 1
0
2 Bits 2 Bits
Word Byte
select select
D
C
memory frame 136
B
132
memory_frame_base_00_00
m Bits
A
128
3
2
1
block start address
0
cache line
word
3
2
1
0
0
32 bit word
Die Startadresse sollte bei einem Cache Line Fill möglichst das Wort sein, welches als erstes
vom Prozessor benötigt wird. Dadurch enstehen Burst cycles mit ’missaligned’ Startadressen. Dafür verwendet man dann meist einen modulo Burst Zugriff.
Vorlesung Rechnerarchitektur
Seite 78
Interleaved Memories
The next step is to apply parallel processing to the main memory. This solution has long been
employed in high-performance vector supercomputers such as the CRAY series in the form
of memory interleaving.
The parallel processing made use of interleaving requires partitioning of the memory system
into parallel memory banks, which are controlled by a local bank controller. The global interleave controller checks and controls the interaction between the CPU and memory banks.
The number of memory banks defines the order of interleaving (the CRAY-1 memory system for example, contains 32 banks and is therefore described as 32-way-interleaved).
Usually the number of banks for interleaving is to the power of 2.
The one-word memory with data and address registers forms the basic hardware structure of
each bank.
Two basic forms of interleaving can be distinguished:
- low-order interleaving
- high-order interleaving
Low-order interleaving assumes that the least significant bits of the address A are used to
distinguish the banks. The selection of a bank B is performed by the modulo function B = A
mod n, where n is the number of banks. The performance gain achieved by a low-order interleaved memory depends on the address pattern applied to the memory and on the number
of banks. A linear sequence of the addresses, selecting one bank only for every nth access,
increases the available bandwidth by n, compared with a word-wide memory. However, if
the access function references the same bank, the bandwidth is equal to that of the word-wide
memory. Depending on the access function, the performance gain lies between these two extremes. The burst-mode access fetching four consecutive (or specially sequenced) data values also fits in well with low-order interleaving. The fetch can be performed as one access
to all banks in parallel, and the sequential data transport from the registers to the external bus
interface is controlled by the data-path controller of the memory system. The memory can
execute a new request in the addressing phase while the data phase is active. This requires
that the microprocessor overlaps or pipelines the address and data phase on the bus.
High-order interleaving uses the most significant bits of the memory address to select the
banks. For this structure, the next memory address must be ‘very far away’ from the previous
one and should have distinct high-order bits, so that the access can be scheduled to different
banks. An address pattern of this sort is highly application-dependent, and this makes the utilization of high-order interleaving rather difficult.
NUMA architectures can profit from high order interleaving, placing process context and
data in the address space in a waz that it is close to the processor operating on this data.
Vorlesung Rechnerarchitektur
Seite 79
Low-Order Interleaved Memories
k-1
CPU
Address
0
2j
2x2j
0
k-n-j Bits
n Bits j Bits
Index Address
Bank Byte
Select Select
(2n-1)x2j
2nx2j
(2n+1)x2j
(2n+2)x2j
2k-(2x2j)
2k-(2j)
Index
Address
0
2j
2x2j
0
2j
2x2j
2k-n-j
2k-n-j
Bank 0
0
2j
2x2j
0
2j
2x2j
2k-n-j
2k-n-j
Bank 1
Bank 2n-1
Bank 2
mapping from CPU address to physical memory bank
Bank 0
Bank 1
Bank 2
Bank 3
0
32
64
8
40
72
16
48
80
24
56
88
65504
65512
65520
65528
Interleave
Controller
RAM
Control
SEL0
SEL1
SEL2
SEL3
0
1
2
3
ACK*
A4,A3
A15-A5 A2-A0
hardware structure of the addressing path (no data path)
Vorlesung Rechnerarchitektur
Seite 80
High-Order Interleaved Memories
CPU
Address
0
2j
2x2j
2
k-n-j
k-1
0
n Bits
k-n-j Bits
j Bits
Bank
Select
Index Address
Byte
Select
j
-2
2k-n-j
2k-n-j+2j
2x2k-n-j-2j
2x2k-n-j
2x2k-n-j+2j
Index
Address
0
2j
2x2j
2nx2k-(2j)
2k-n-j
0
2j
2x2j
0
2j
2x2j
0
2j
2x2j
2k-n-j
2k-n-j
2k-n-j
Bank 0
Bank 1
Bank 2n-1
Bank 2
mapping from CPU address to physical memory bank
Bank 0
Bank 1
Bank 2
Bank 3
0
8
16
16384
16392
16400
32768
32776
32784
49152
49160
49168
16376
32760
49144
65528
Interleave
Controller
RAM
Control
SEL0
SEL1
SEL2
0
1
2
3
ACK*
A15,A14
A13-A3 A2-A0
hardware structure of the addressing path (no data path)
SEL3
Vorlesung Rechnerarchitektur
Seite 81
Four-Bank Low-Order Interleaved Memories
This figure presents the address and data pathes of a four way low-order interleaved memory.
CPU
CONTROL
ADDR
Data
Phase
Control
32
DATA
64
Address
Phase
Control
64
64
64
64
Memory
Bank
Control
Memory
Bank
0
Memory
Bank
1
Memory
Bank
2
Memory
Bank
3
Modern Microprocessors include memory controller with two or more banks, which are organized in low- or high- order interleving schemes.
Vorlesung Rechnerarchitektur
Seite 82
DRAMs
Speed Trends: DRAMs & Processors
Die ständige Steigerung der Rechenleistung moderner Prozessoren führt zu einer immer größer werdenden Lücke zwischen der Verarbeitungsgeschwindigkeit des Prozessors und der
Zugriffsgeschwindigkeit des Hauptspeichers. Die von Generation zu Generation der
DRAMs steigende Speicherkapazität (x 4) führt auf Grund der 4-fachen Anzahl an Speicherzellen trotz der Verkleinerung der VLSI-Strukturen zu nur geringen Geschwindigkeitssteigerungen.
10000
9000
8000
7000
6000
5000
Speed
[log MHz]
4000
3000
2GHz
2000
first 1GHz CPU
Research
1000
900
800
700
600
500
l
internal
CPU Clock
l
DEC
Alpha
400
l
external
BUS Clock
300
200
SDRAM
100
90
80
70
60
50
40
1/tacc
25ns in page
30
20
1/tacc
60ns random
10
Year
’90
’92
’94
’96
’98
2000
’02
Vorlesung Rechnerarchitektur
Seite 83
EDO-DRAM
DRAM: Beispiel EDO~
Extended Data-Out (EDO) DRAMs feature a fast Page Mode access cycle. The EDODRAM separates three-state control of the data bus from the column address strobe, so
address sequencing-in is independent from data output enable.
This Page Access allows faster data operation within a row-address-defined page boundary.
The PAGE cycle is always initiated with a row-address strobed-in by the folling edge of
RAS_ followed by a column-address strobed-in by the folling edge of CAS_. CAS_ may be
toggled by holding RAS_ low and ’strobing-in’ different column-addresses, thus executing
faster memory cycles.
no cas control of data out
Data In
Buffer
WE#
DQ0-DQ3
4
Data Out
Buffer
CAS#
Clock
Generator #2
OE#
CA
10
Column
Decoder
4
.
.
.
.
10
1024
11
Column
Addr Buffer
4
Column
Address
Latch
bit planes
Refresh
Controller
Sense Amplifier
Refresh
Counter
Latch
.
.
1024 x 4
.
.
I/O-Gating
CASE
1
RAS#
RA
11
.
.
2048
.
.
.
.
2048
.
.
2 x 2048
x 1024 x 4
Memory
Array
24/26-Pin SOJ
Row
Row
Transfer
Transfer
11
Row
Address
Latch
Row
Decoder
11
A10 - A0
Complement
Complement
Complement
Select
Complement
Select
Select
Select
bit planes
Clock
Generator #1
VCC
GND
top view
EDO Operation: Improvement in page mode cycle time was achieved by converting the normal sequential fast page mode operation into a two-stage pipeline. A page address is presented to the EDO-Dram, and the data at that selected address is amplified and latched at the
data output drivers. While the output buffers are driving the data off-chip, the address decode
and data path circuitry is reset and able to initiate access to the next page address.
Refer to Datasheet for EDO DRAM at: ext_infos
Vorlesung Rechnerarchitektur
Seite 84
EDO-DRAM
Timing EDO-DRAM
The primary advantage of EDO is the availability of data-out even after /CAS_ high. EDO
allows CAS-Precharge time to occur without the output data going invalid. This elemination
of CAS Output control allows "pipelining" of READs. [Micron, Data Sheet
MT4LC4ME8(L), 1995.
The term pipelining is misused in this context, it is simply an overlapping of addressing and
data-transfer.
tRAC
tCAC
tCAC
tPCYC
RAS*
CAS*
A9-0
row address
CA 2
CA 1
CA 3
CA 4
data-out 2
data-out 3
WE*
tAA(CAS)
tAA
OE*
I/O4-1
data-out 1
tOEA
tCOH
data-out 4
tOEZ
hyperpage mode cycle EDO
Im Gegensatz zum ’fast page mode’-DRAM muß CAS_ nicht low sein, um den Datenausgangs-Buffer zu treiben. Die nächste CA kann somit unabhängig von der Data-Output Phase
früher in das CA-Latch übernommen werden. Das Treiben der Daten wird ausschließlich
vom Signal OE_ gesteuert (tOEA und tOEZ). Die Daten am Ausgang schalten erst auf die
neuen Daten von der nächsten CA um, wenn die neue CA im DRAM wirksam wird ( nach
tCOH ).
Vorlesung Rechnerarchitektur
Seite 85
Memory Interface Signaling
High speed interfaces
VDD(Q) (VCC)
VOH
Noise
Margin
High
VIH
VREF
switching region
VIL
Noise
Margin
Low
VOL
VSS (GND)
Driver
Receiver
Signal Line
+
-
single ended transmission
UREF
=
LVTTL
=
CTT
UREF
BTL
GTL
HSTL
SSTL
VOH
2,4
1,9
2,1
1,2
1,1
Vtt + 0,4
VIH
2
1,7
0,85
0,85
1,85
Vtt + 0,2
VREF
1,5
1,5
1,55
0,8
0,75
1,5 = Vtt
VIL
0,8
1,3
1,47
0,75
0,65
Vtt - 0,2
VOL
0,4
1,1
1,1
0,4
0,4
Vtt - 0,4
VDD (Q)
3,3
3,3
5
1,2
1,5
3,3
NM(H)
0,4
0,2
0,48
0,35
0,25
0,2
NM(L)
0,4
0,2
0,37
0,35
0,25
0,2
SWING(O)
2
0,8
1
0,8
0,7
0,8
Receiver
Driver
positive Signal Line
UDIFF
inverted Signal Line
differential transmission
UOFFSET
=
+
-
Vorlesung Rechnerarchitektur
Seite 86
Memory Management
Ziele des Memory Management
• Schutzfunktion => Zugriffsrechte
• Speicherverwaltung für Prozesse
• Erweiterung des begrenzten physikalischen Hauptspeichers
Eine sehr einfache Möglichkeit den Speicher zu organisieren ist die Aufteilung in einen
Festwertspeicher (Read-Only-Memory) für das Programm und in einen Schreib-Lese-Speicher für die Daten.
- R/W Memory
- ROM
- kann nicht überschrieben werden
Eine solche feste Aufteilung ist nur für ’single tasking’-Anwendungen sinnvoll, wie sie z.B.
in ’eingebetteten Systemen’ mit Mikrocontrollern verwendet werden.
CPU
Instr./Data
ROM
Read/Write
Literature:
Andrew S. Tanenbaum,
"Structured Computer Organization", 4th edition,
Prentice-Hall, p. 403ff.
RAM
R/W
Address Space
Beim Multi-processing oder Multi-tasking Systemen reicht die oben genannte Möglichkeit,
den Speicher zu organisieren, nicht aus. Es existieren viele Prozesse, die quasi gleichzeitig
bearbeitet werden müssen.
Probleme:
- Verlagern von Objekten im Speicher - relocation
- Schutz von Objekten im Speicher
- protection
Lösung:
Einführung eines logischen Adreßraumes pro Prozeß und einer Abbildung der logischen
Adreßräume auf den physikalischen Adreßraum (Hauptspeicher), die Adreßumsetzung ’address translation’
=> viele Prozesse konkurrieren um den physikalischen Speicher!
=> um ausgeführt zu werden, müssen alle Segmente (.text / .data / .bss) in den Speicher
geladen werden
Segmentierung ist die Aufteilung des logischen Adreßraumes in kontinuierliche Bereiche
unterschiedlicher Größe zur Speicherung von z.B. Instruktionen (.text) oder Daten (.data)
oder Stack-Daten (.bss), etc.
Jedes Segment kann jetzt mit Zugriffsrechten geschützt werden.
Vorlesung Rechnerarchitektur
Seite 87
Memory Management
Memory Segmentation
Virtuelle Adresse
MMU
Physikalische Adresse
zur
Laufzeit
pid = 1
phys. Mem.
from processor
pid = 1
upper bound
4000
>
.data 1
.data 1
.text 1
11000
.text 1
.text 2
10000
9600
pid = 2
3600
address
access trap
physical addr
to memory
=
.data 2
3000
......
.text 2
......
1000
.data 2
400
0
base of segment
base of segment
.data
.text.
+
R/W
I/D
S/U
access rights
1000
0
virtual address
protection
violation trap
Memory fragmentation
0
Um einen Prozeß ausführen zu können, müssen alle Segmente des Prozesses im Hauptspeicher sein.
- Der Platz ist bereits durch andere Prozesse belegt
=> Es müssen so viele Segmente anderer (ruhender) Prozesse aus dem Speicher auf die
Platte ausgelagert werden, wie der neue auszuführende Prozess Platz benötigt. Randbedingung durch die Segmentierung: es muß ein fortlaufender Speicherbereich sein!
• Swapping: Aus- und Einlagerung ganzer Prozeßsegmente (Cache flush!)
- Es ist zwar noch Platz vorhanden, aber die Lücken reichen nicht für die benötigten neuen
Segmente aus. Wiederholtes Aus- und Einlagern führt zu einer Fragmentierung des Speichers -’Memory fragmentation’
=> Es müssen Segmente verschoben werden, d.h. im Speicher kopiert werden, um Lükken zu schließen (Cache flush!).
In den gängigen Architekturen wird mit n=32 oder 64 Bit adressiert. Daraus folgt die
Größe des virtuellen Adressraums mit 2n Byte (232= 4 GByte; 264= 16 ExaByte= 16.77
Millionen TBytes)
- Der Hauptspeicherplatz reicht von seiner Größe nicht für den neuen Prozeß aus
=> der Hauptspeicher wird durch das Konzept des Virtuellen Speichers durch die Einbeziehung des Sekundärspeichers (Platte) erweitert.
Vorlesung Rechnerarchitektur
Seite 88
Memory Management
Literatur:
Hwang, Kai; Advanced Computer Architecture, Mc Graw Hill, 1993.
Stone, Harold S.; High Performance Computer Architecture, Addison Wesley, 1993
Giloi: Rechnerarchitektur
Tannenbaum, Andrew S.: Modern Operating Systems, Prentice-Hall, 1992.
Intel: i860XP Microprocessor Programmers Reference Manual
Motorola: PowerPC 601 RISC Microprocessor User’s Manual
z Grundlagen
Die Memory Managment Unit (MMU) ist eine Hardwareeinheit, die die Betriebssoftware
eines Rechners bei der Verwaltung des Hauptspeichers für die verschiedenen auszuführenden Prozesse unterstützt.
- address space protection
- virtual memory - demand paging
- segmentation
Each word/byte in the physical memory (PM) is identified by a unique physical address.
All memory words in the main memory forms a physical address space (PAS).
All program-generated (or by a software process generated) addresses are called virtual
addresses (VA) and form the virtual address space (VAS).
When address translation is enabled, the MMU maps instructions and data virtual addresses into physical addresses before referencing memory. Address translation maps a set of vitual addresses V uniquely to a set of physical addresses M.
Virtual memory systems attempt to make optimum use of main memory, while using an
auxiliary memory (disk) for backup. VM tries to keep active items in the main memory and
as items became inactive, migrate them back to the lower speed disk. If the management algorithms are successful, the performance will tend to be close to the performance of the higher-speed main memory and the cost of the system tend to be close to the cost per bit of the
lower-speed memory (optimization of the memory hierarchy between main memory and
disk).
Most virtual memory systems use a technique called paging. Here, the physical address space is divided up into equallly sized units called page frames. A page is a collection a data
that occupies a page frame, when that data is present in memory. The pages and the page
frames are always of the same fixed size (e.g. 4K Bytes). There might be larger (e.g. 8KB)
or smaller page sizes defined in computer systems depending on the compromise between
access control and number of translations.
Vorlesung Rechnerarchitektur
Seite 89
Memory Management
z Virtueller Speicher / Paging
Logischer und physikalischer Adressraum werden in Seiten fester Größe unterteilt, meist 4
oder 8KByte. Logische Pages werden in einer Pagetable auf physikalische Pageframes abgebildet, dabei ist der logische Adressraum im allgemeinen wesentlich größer als der physikalisch vorhandene Speicher. Nur ein Teil der Pages ist tatsächlich im Hauptspeicher, alle
anderen sind auf einen Sekundärspeicher (Platte) ausgelagert.
- Programme können größer als der Hauptspeicher sein
- Programme können an beliebige physikalischen Adressen geladen werden, unabhängig von der Aufteilung des physikalischen Speichers
- einfache Verwaltung in Hardware durch feste Größe der Seiten
- für jede Seite können Zugriffsrechte (read/write, User/Supervisor) festgelegt
und bei Zugriffen überprüft werden
- durch den virtuellen Speicher wird ein kostengünstiger großer und hinreichend
schneller Hauptspeicher vorgetäuscht (ähnlich Cache)
Die Pagetable enthält für jeden Eintrag einen Vermerk, ob die Seite im Hauptspeicher vorhanden ist (P-Bit / present). Ausgelagerte Pages müssen bei einer Referenz in den Hauptspeicher geladen werden, ggf. wird dabei eine andere Page verdrängt. Modifizierte Seiten
(M-Bit / modify) müssen dabei auf den Sekundärspeicher zurückgeschrieben werden. Dazu
wird ein weiteres Bit eingeführt, das bei einem Zugriff auf die Seite gesetzt wird (R-Bit /
referenced)
Die Abbildung des virtuellen Adressraums auf den physikalischen erfolgt beim paging
durch die Festlegung von Zuordnungspaaren (VA-PA).
Hierbei werden die n (n=12 fuer 4K oder 13 fuer 8KB) niederwertigen bits der Adresse von
der VA zur PA durchgereicht und nicht uebersetzt
Vorlesung Rechnerarchitektur
Virtueller Speicher / Paging
Replacement-Strategien :
- not recently used - NRU
mithilfe der Bits R und M werden vier Klassen von Pages gebildet
0: not referenced, not modified
1: not referenced, modified (empty class in single processor systems!)
2: referenced, not modified
3: referenced, modified
es wird eine beliebige Seite aus der niedrigsten nichtleeren Klasse entfernt
- FIFO
die älteste Seite wird entfernt (möglicherweise die am häufigsten benutzte)
- Second-Chance / Clock
wie FIFO, wurde der älteste Eintrag benutzt, wird zuerst das R-Bit gelöscht und
die nächste Seite untersucht, erst wenn alle Seiten erfolglos getestet wurden,
wird der älteste Eintrag tatsächlich entfernt
- least recently used - LRU
die am längsten nicht genutzte Seite wird entfernt, erfordert Alterungsmechnismus
Seite 90
Vorlesung Rechnerarchitektur
Seite 91
Memory Management
Verfahren zur Adreßtransformation
Address Translation
Page Address Translation
Segment Address Translation
Block Address Translation
..............
direct
mapping
one
level
multi
level
inverted
mapping
associative
PT
wie PAT
inverted
PT
one level
mapping
base-bound
checking
PAT: Es wird eine Abbildung von VA nach PA vorgenommen, wobei eine feste Page Size
vorausgesetzt wird. Der Offset innerhalb einer Page ist damit eine feste Anzahl von bits (last
significant bit (LSB)) aus der VA, die direkt in die PA übernommen werden.
Der Offset wird also nicht verändert !
Die Abbildung der höherwertigen Adreßbits erfolgt nach den oben genannten Mapping-Verfahren.
BAT: Provides a way to map ranges of VA larger than a single page into contiguous area of
physical memory (typically no paging)
• Used for memory mapped display buffers or large arrays of MMU data.
• base-bound mapping scheme
• block sizes 128 KB (217) to 256 MB (228)
• fully associative BAT registers on chip
small number of BAT entries (4 ... 8)
+ BAT entries have priority over PATs
Vorlesung Rechnerarchitektur
Seite 92
Memory Management
Direct Page Table
For all used VA-Frames exist one PT-Entry which maps VA to PA. There are much more
entries required as physical memory is available.
direkt PT
VAS
PAS
page frame
VA Page Table Index
virtual page
used
page frame
virtual page
virtual page
PA
page frame
Inverted Page Table
For all PA exist one PT-Entry. Indexing with the VA doesn’t work any more!!!
To find the right mapping entry a search with the VA as key must be performed by linear
search or by associative match. Used for large virtual address space VAS, e.g. 264 and to
keep the PT small.
VAS
inv PT
VA high
PAS
page frame
=
virtual page
......
......
virtual page
virtual page
page frame
search match
VAhigh
PA
page frame
Vorlesung Rechnerarchitektur
Seite 93
Memory Management
Einstufiges Paging
0
31
dir_base
000000000000
12
20
31
VA
VA high
PT
12
0
VA Offset
220 Entries
PA high
Cntrl
20
Page Table Index
0
31
PA
PA high
12
Die virtuelle Adresse VA wird in einem Schritt in eine physikalische Adresse PA umgesetzt.
Dazu wird ein höherwertiger Teil der VA zur Indizierung in die Page Table PT verwendet.
In der PT findet man dann unter jedem Index genau einen Eintrag mit dem höherwertigen
Teil der PA. Der niederwertige Teil der Adresse mit z.B. 12 bit wird als Seiten-Offste direkt
zur physikalischen Adresse durchgereicht. Diese einstufige Abbildung kann nur für kleine
Teile der VA high verwendet werden (Tabellengröße 4 MB bei 32 bit entries)
Vorlesung Rechnerarchitektur
Seite 94
Memory Management
Mehrstufiges Paging
Linear Address
31
22
12
0
Directory Table
Offset
486TM CPU
31
0
CR0
CR2
CR2
CR3 dir_base
Control registers
10
31
10
31
{}
20
0
{}
Physical memory
12
0
base
30
base
20
{}
Address
20
Page table
Page directory
Bei 32 Bit Prozessoren und einer Seitengröße von z.B. 4 KByte wird die Pagetable sehr groß,
z.B. 4 MByte bei 32 Bit Pagetable Einträgen. Da meist nicht alle Einträge einer Tabelle
wirklich genutzt werden, wird eine mehrstufige Umsetzung eingeführt. Zum Beispiel referenzieren die obersten Adressbits eine Tabelle, während die mittleren Bits den Eintrag in
dieser Tabelle selektieren.
- es ist möglich, in der page dir Einträge als nicht benutzt zu kennzeichnen
- damit werden in der zweiten Stufe weniger Tabellen benötig
- die Tabellen der zweiten Ebene können selbst ausgelagert werden
Pagetables können aufgrund ihrer Größe nur im Hauptspeicher gehalten werden. Pagetables
sind prinzipiell cachable, allerdings werden die Einträge wegen ihrer relativ seltenen Benutzung (im Vergleich zu normalen Daten) schnell aus dem allgemeinen Cache verdrängt.
TLB
(32 entries)
Linear Address
31
Physical
Memory
0
Page directory
Page table
Zum Beschleunigen der Adressumsetzung, insbesondere bei mehrstufigen Tabellen, wird
ein Cache verwendet. Dieser Translation Lookaside Buffer (TLB) oder auch Address Translation Cache (ATC) enthält die zuletzt erfolgten Adressumsetzungen. Er ist meist vollassoziativ ausgeführt und enthält z.B. 64 Einträge. Neuerdings wird auch noch ein
setassoziativer L2-Cache davorgeschaltet.
Vorlesung Rechnerarchitektur
Seite 95
Memory Management
z i860XP
A virtual address refers indirectly to a physical address by specifying a page table through a
directory page, a page within that table, and an offset within that page.
Format of a Virtual Address
22 21
31
0
12 11
PAGE
DIR
OFFSET
10
10 + 2
10
2 bit byte sel
Page table
Page directory
1023
Page frame 4kBytes
1023
1023
Page table entry
physical address
word
DIR entry
dirbase register
0
0
0
two-level Page Address Translation
A page table is simply an array of 32-bit page specifiers. A page table is itself a page (1024
entries with 4 bytes = 4kBytes). Two levels of tables are used to address a page frame in main
memory. Page tables can occupy a significant part of memory space (210 x 210 words = 222
bytes; 4MBytes).
The physical address of the current page directory is stored in the DTB (Directory table base)
field of the dirbase register.
Vorlesung Rechnerarchitektur
Seite 96
Memory Management
z i860XP
A page table entry contains the page frame address and a number of management
bits.
The present bit can be used to implement demand paging. If P=0, than that page is not present in main memory. An access to this page generates a trap to the operating system, which
has to fetch the page from disk, set the P bit to 1 and restart the instruction.
Format of a Page Table Entry (i860XP)
12 11 9
31
PAGE Frame Address
7
5
3
0
C W
AVAIL XX D A D T U W P
Available for system programmers use
(Reserved)
Dirty
Accessed
Cache Disable
Write-through
User
Writable
Present
Definition :
Page
The virtual address space is divided up into equal sized units called pages. [Tanenbaum]
Eine Page (im Kontext der MMU) entsteht durch die Unterteilung des virtuellen
und physikalischen Speichers in gleich große Teile.
Definition :
Page frame
Ist der Speicherbereich im Hauptspeicher, in den genau eine Page hineinpaßt.
Vorlesung Rechnerarchitektur
Seite 97
Memory Management
Hashing
Literatur: Sedgewick, Robert, Algorithmen, Addison-Wesley, 1991, pp.273-287
Hashing ist ein Verfahren zum Suchen von Datensätzen in Tabellen
• Kompromiß zwischen Zeit- und Platzbedarf
Hashing erfolgt in zwei Schritten:
1. Berechnung der Hash-Funktion
Transformiert den Suchschlüssel (key, hier die VA) in einen Tabellenindex. Dabei wird
der Index wesentlich kürzer als der Suchschlüssel und damit die erforderliche Tabelle
kleiner.
Im Idealfall sollten sich die keys möglichst gleichmäßig über die Indices verteilen.
- Problem: Mehrdeutigkeit der Abbildung
2. Auflösung der Kollisionen, die durch die Mehrdeutigkeit entstehen.
a) durch anschließendes lineares Suchen
b) durch erneutes Hashing
VAhigh
7 6 5 4
VAhigh
3 2 1 0
VA high
n
simple
Hash
Function
2n
3
HashFkt.
z.B.
n/2
2
1
Hash-Index
0
linear Search
2n/2
0
0
Index
inverted PT
Entry
2n/2-1
Entry
VA
PA
Cntrl
Vorlesung Rechnerarchitektur
Seite 98
Memory Management
LRU-Verfahren
Als Beispiel für einen Algorithmus, der das LRU-Verfahren realisiert, sei ein Verfahren betrachtet, das in einigen Modellen der IBM/370-Familie angewandt wurde.
Sei CAM (Content Addressable Memory) ein Assoziativspeicher mit k Zellen, z.B. der Cache. Zusätzlich dazu wird eine (k x k)-Matrix AM mit boolschen Speicherelementen angelegt. Jeder der ’entries’ des CAM ist genau eine Zeile und eine Spalte dieser Matrix
zugeordnet.
Wird nun ein entry vom CAM aufgesucht, so wird zuerst in die zugehörige Zeile der boolschen Matrix der Einsvektor e(k) und danach in die zugehörige Spalte der Nullvektor n(k)
eingeschrieben (e(k) ist ein Vektor mit k Einsen; n(k) ist ein Vektor mit k Nullen). Dies wird
bei jedem neuen Zugriff auf das CAM wiederholt. Sind nacheinander in beliebiger Reihenfolge alle k Zellen angesprochen worden, so enthält die Zeile, die zu der am längsten nicht
angesprochenen Zelle von CAM gehört, als einzige den Nullvektor n(k).
AM0 AMi
AM0
0
cache entry
AMi
1
1
Encoder
1
0
0
0
k
0...
1
0
1
i
...
k cache
entries
AM
k-1
CAM
k-1
0
0
.........
i
AMk-1
1
k-1
i
k x k-Matix
OR
Sei i der Index der als erste angesprochenen Zelle von CAM, und sei AM die (k x k)-Alterungsmatrix. Dann ist nach dem Ansprechen der Zelle i AMi = n(k), während alle Elemente
von AMi eins sind bis auf das Element (AMi)i, das null ist (da zunächst e(k) in die Zeile AMi
und dann in die Spalte AMi eingeschrieben wird). Dabei bezeichnen wir die Zeilen einer
Speichermatrix durch einen hochgestellten Index und die Spalten durch einen tiefgestellten
Index.
Bei jeder Referenz einer anderen Zelle von CAM wird durch das Einschreiben von e(k) in
die entsprechende Zeile und nachfolgend durch das Einschreiben von n(k) in die entsprechende Spalte von AM eine der Einsen in AMi durch Null ersetzt und eine andere Zeile mit
Einsen angefüllt (bis auf das Element auf der Hauptdiagonale, das null bleibt). Damit werden nach und nach alle Elemente von AMi durch Null ersetzt, falls die Zelle i zwischenzeitlich nicht mehr angesprochen wird. Da aber nach k Schritten nur in einer der k Zellen alle
Einsen durch Nullen überschrieben sein können, müssen alle anderen Zeilen von AM noch
mindestens eine Eins enthalten. Damit indiziert die Zeile von AM, die nur Nullen enthält,
die LRU-Zelle (entry) des Assoziativspeichers. [ Giloi, Rechnerarchitektur, Springer, 1993,
pp. 130]
Vorlesung Rechnerarchitektur
Seite 99
Memory Management
LRU-Verfahren: Beispiel
Es wird ein CAM mit 8 entries angenommen. Das kann ein vollassoziativer Cache mit 8 entries sein oder auch ein Set eines Caches mit 8 ’ways’ sein. Die Initialisierung der Alterungsmatrix erfolgt mit ’0’. Als Beispiel für die Veränderung der Werte in AM soll die folgende
Zugriffsreihenfolge betrachtet werden.
Zugriffsreihenfolge: 0, 1, 3, 4, 7, 6, 2, 5, 3, 2, 0, 4
0
1
2
3
4
5
6
7
0 0
1
0
0
0
0
0
0
0
0
7
0
1
2
3
4
5
6
7
3
0
1
2
3
4
5
6
7
1
1
0
0
0
0
0
0
0
0
2
1
0
0
0
0
0
0
0
0
3
1
0
0
0
0
0
0
0
0
4
1
0
0
0
0
0
0
0
0
5
1
0
0
0
0
0
0
0
0
6
1
0
0
0
0
0
0
0
0
7
1
0
0
0
0
0
0
0
0
0
1
0
1
1
0
0
1
0
0
0
0
1
1
0
0
1
0
1
1
0
1
1
0
0
1
0
0
0
0
0
1
0
0
1
0
0
0
0
0
0
0
0
1
0
1
1
0
1
1
0
0
1
0
1
1
0
1
1
0
0
1
0
0
1
0
1
0 x
0
1
0
1
0 x
0 x
1
0
0
1
1
1
0
0
1
1
0
0
0
1
0
0
0
0
1
1
0
1
1
1
1
1
1
1
1
0
1
0
0
0
0
1
0
1
0
1
0
0
1
1
0
0
1
1
1
0
0
0
1
0
0
0
0
0
0
0
1
1
0
0
1
0
0
0 x
0
1
1
0
0
1
1
0
10 1 2 3
0 0
1 1 1
1 0
1 0
1 1
0
x 0
x 0 0 0 0
x 0 0 0 0
x 0 0 0 0
x 0 0 0 0
x 0 0 0 0
x 0 0 0 0
4
1
1
0
0
0
0
0
0
0
5
1
1
0
0
0
0
0
0
0
6
1
1
0
0
0
0
0
0
0
7
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
1
1
1
0
1
1
0
1
0
1
0
1
0
1
0
0
1
0
1
0
1
0
0
1
0
0
0 x
0
0
0 x
1
0
0
30
0
1
x 0
1
x 0
1
0
0
0
1
0
2
1
1
0
1
0
3
0
1
0
1
0
1
0
x
x
x
x
0
0
0
0
0
0
0
0
6
0
0
0
0
4
1
1
0
1
0
0
0
0
0
5
1
1
0
1
0
0
0
0
0
6
1
1
0
1
0
0
0
0
0
7
1
1
0
10
0
0
0
0
0 0 0
1 0
1 0 0
1 0
1 0
1 0
1 1
0
0
1 1 0
1 0
1 1 0
1 1
0 0 0 0
1 1 0
1 1
1 1 0
1 1
0
0
1
0
0
0
0
1
1
1
1
1
0
1
1
0
1
1
0
0
1
0
0
0
0
0
0
0
0
10
0
0
0 x
1
0
1
0
0
10
10
10
10
10
10
10
1
0
0
1
1
0
1
1
1
1
0
0
1
1
0
0
0
0
1
0
0
1
1
0
1
0
0
1
0
0 x
1
1
0
1
1
0
0
0
0
0
40
0
1
x 0
1
x 01
1
0
0
0
1
1
0
x
x
x
0
0
0
0 0 0 0
0 0 0 0
0 0 0 0
5
1
1
0
1
1
0
0
0
0
0
1
1
1
1
01
0
0
1
1
1
1
0
0
0
0
0
0
1
0
0
0
1
0
1
1
0
0
0
1
0
0
1
0
0
1
0
1
0
1
0
1
0
1
1
0
1
1
0
0
1
0
0
1
0
1 0 1 1 0
1 0
1 0 1 1 0
1 0
0 x
0
1
0
0
1
0
1
0
0
0
0
0
0
0
0
0
1
0
1
1
1
1
1
1
1
0
0
0
0
0
0
0
1
0
1
0
0
0
0
0
1
0
1
1
0
1
1
1
1
0
1
1
0
0
0
0
1
0 x
1
1
0
1
1
0
2
0
1
0
1
1
0
1
0
1
0
0
0
1
1
0
1
0
1
1
0
1
1
0
1
0
0
0
0
0
1
0
1
0
1 1 1
2
0 0 0 x
0 0 0
1 1 1
0
1 1 1
0 0 0
0 1 1
0 0 1
0 0 0
3
0
0
0
0
1
0
4
0
1
0
1
0
0
1
1
0
6
1
1
0
1
1
0
0
0
0
7
1
1
0
1
1
0
0
0
0
x
x
x
x
5
0
0 0 0 0 0
1 0 0 0 0
1 1
11 11 01 0
0
1 1 1 0 1
1 1 0 0 0
0 0 1
1 1 1
1 1 0 0 1
1 1 0 0 1
2
1
1
0
1
1
0
4
1
0
0
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
1
0
0
1
0
0
0
0
0
1
0
1
1
0
1
0
0
Der Start erfolgt mit allen Einträgen als gleichalt markiert. Der erste Eintrag in Zeile und
Spalte 0 läßt den Eintrag 0 altern. Es sind danach nur noch die mit x markierten Einträge zum
Ersetzen zu verwenden. Sind alle Einträge einmal referenziert, so bleibt in jedem weiteren
Zugriffsschritt immer nur ein Eintrag als der Älteste markiert stehen. [Tannenbaum, Modern
Operating Systems, Prentice Hall, 1992, pp.111-112]
ppt
Vorlesung Rechnerarchitektur
Seite 100
Main Memory
Some definitions ...
.... mode :
page mode :
Betriebsart in der auf alle bits in dieser Page (RA) wesentlich schneller
zugegriffen werden kann. (12-25ns tCAC)
Zum Zugriff wird eine neue CA und CAS benötigt.
1 Mbit x 1 bit
10
CA
10
C_Dec
10
page
RA
R_Dec
1024 bit
32 bit Word
=> 4 MByte
22 21
31
Mem base
burst mode :
21 0
12 11
RA
CA
byte
sel
Holt 2n Werte aus dem Speicher ohne eine neue CA.
CAS muß aber geschaltet werden, um den internen Zähler zu inkrementieren.
memory modell :
latency in clocks
L -B- B - B
6 - 2 - 2 - 2
tRAC tCAC
Vorlesung Rechnerarchitektur
Seite 101
Main Memory
Some definitions ...
Hardware view:
asynchronous :
Without a relation to a global clock signal.
Signals propagating a net of gates behave asynchronous, because of the delay within each gate level.
synchronous:
All signals and signal changes are referenced to a global clock signal.
All locations are supplied by this clock signal, it is assumed that the edge of the clock signal defines the
same time at all locations
Caution : simplified view
- clock jitter
- clock skew
- clock network delay
bandwidth :
Term derived from ET. HFT, means the available frequency space Δf for a transmission channel
f1
Δf
f2
Δf=f2-f1
unit [MHz]
used in TI as:
possible number of data, which can be carried through an interface, eg. bus, memory.
unit [MBytes/s]
transfer rate :
Number of Data items, which are moved in one second.
unit [MBytes/s]
Vorlesung Rechnerarchitektur
Seite 102
Development Trend- RISC versus CISC
CISC
PDP11
PDP11
CISC
CISC
8086
Digital Equipm.
MC 68000
Intel
Motorola
CISC
CISC
’186
’010
Intel
Motorola
CISC
VAX
DEC
CISC
CISC
Cray/Alliant
numeric
Coprocessor
’286
’020
Intel
CISC
RISC+superscalar
’386
i860XL
Intel
CDC6600
Motorola
VLIW
Intel
CISC
MIPS
???
RISC+superscalar
’030
88100
Motorola
Motorola
VLIW
MIPS
VLIW
CISC
RISC+superscalar
CISC
RISC+superscalar
’486
i860XP
’040
88110
Intel
Intel
Motorola
Motorola
64 bit RISC+superscalar
+ superpipelined
150MHz
21064
ALPHA
DEC
CISC + superscalar RIP
CISC + superscalar RIP
Pentium
’060
Intel
Motorola
RISC+superscalar
+ superpipelined + PCI-Bus
300MHz
21164
ALPHA
DEC
CISC + superscalar + MMX
RIP
Pentium II
Intel
RISC+superscalar
+ superpipelined + PCI-Bus
800MHz
21264
ALPHA
DEC
internal RISC + superscalar
Pentium III
900 MHz
VLIW
Intel
RISC+superscalar
+ superpipelined + Memory Control
+ SystemIO + Switch
21364
ALPHA
ISA Emulation
DEC
internal RISC + superscalar
Pentium 4
Itanium
Intel
Intel + HP
VLIW + predicated Instr.
VLIW + predicated Instr.
Itanium 2
Intel + HP
1GHz ?
Vorlesung Rechnerarchitektur
Seite 103
Genealogy of RISC Processors
influenced by
mainframe
research projects
CDC 6600
Cray-1
Control Data
products
products under development
RISC
Bulldog
Compiler
IBM 801
μCode
compaction
IBM research
VLIW
RISC
RISC
ELI
RISC I
MIPS
NY State Univ.
UC Berkley
Stanford
VLIW
RISC
RISC
RISC
RISC-superscalar
PA
PC/RT.
TRACE
RISC II
IBM RS6000
Spectrum
MIPS-X
Multiflow Inc.
UC Berkley
IBM
Hewlett-Packard
Stanford
AM 29000
AMD
RISC
RISC
RISC
R2000
R3000
Hewlett-Packard
mips Corp.
PA
MC88110
UC Berkley
RISC
7100
SOAR
SPARC
SUN Microsyst.
RISC + superscalar
Apple
RISC-superpipelined
Intel i486 Instr.Set
RISC + superscalar
R4000
601
RISC + superscalar
PowerPC
Motorola +IBM
SuperSPARC
K5
mips Corp.
SUN & TI
RISC-superpipelined
AMD
RISC + superscalar
R4400
mips Corp. +
Silicon Graphics
604
PowerPC
multimedia ops ’VIS’
Jpeg
K6
Motorola +IBM
RISC + superscalar
vector +
graphics ops
AMD
RISC-superpipelined
UltraSPARC
RISC + superscalar
SUN & TI
620
RISC + superscalar
PowerPC
PA
Motorola +IBM
Hewlett-Packard
RISC + superscalar
1GHz
Ultra
550MHz
SPARC III
AMD
R12000
RISC + superscalar
+ Vector unit
SUN & TI
AltiVec
64bit extension
K8 Hammer
AMD
Athlon 64
Opteron
R10000
Silicon Graphics
8500
internal Superscalar
K7 Athlon
+ VLIW
IBM RS6000
PowerPC
IBM
Motorola
Silicon Graphics
Vorlesung Rechnerarchitektur
Seite 104
CISC - Prozessoren
Computer Performance Equation
P[MIPS] =
<1
RISC
>1
Superscalar
VLIW
f [MHz] x C i
c
N xN
m
i
fc
clock frequency
Ci
N i
instruction count per clock cycle
N m
memory access factor
average number of clock cycles per micro instruction
CISC - Complex Instruction Set Computer
Ci=1
one Instruction per clock tick
= one Operation executed by one Execution-Unit
Ni =5-7
5-7 clock ticks required for the execution of one instruction by a number
of microinstruction (Microprogram A.)
Nm=2-5
dependent of the memory system and of the memory hierarchy
Goal : Closing the gap between High Level Language and Processor Instructions (semantic
gap)
- manifold addressing modes
- microprogrammed complex instructions required a variable instruction lenght
=> varible execution time
- orthogonal instruction set : every operation matches every addressing mode
Vorlesung Rechnerarchitektur
Seite 105
CISC - Prozessoren
CISC - Architekturen
Nachteile
- Pipelining schwierig
- variables Instruktionsformat -> Instruction-Fetch komplex
- unterschiedlich lange Instruction-Execution
- Memory hierachie
- Speicherzugriff ist in Operation enthalten
- sequentielle Instruktionsverarbeitung
- kein prefetch
- Compilation -> Instruction-Set wird nicht genützt !
Vorteile
- Kompakter Code
- Instruction-Cache kleiner
- Instruction-Fetch transfer rate niedriger 1/2 Codesize of RISC (32 bit
festes Format )
- Assembler -> dichter an High-Level-Language
Vorlesung Rechnerarchitektur
Seite 106
CISC - Prozessoren
MC 68020
typical CISC - Member
16 bit Instruction format with variable length of n times 16 bit-Units
1≤n≤6
8 * Data Register e.g.: Addition of Data Value ADD
8 * Address Register e.g.: Address Computation ADDA
Instruction-Set extendable by Coprocessor Instruction Encoding Fxxx
15
12 11
9 8
Register Dn
ADD
3 2
6 5
Op Mode
0
Effective Address
mode
Addressing
Modes
Register
3 bit
Operation Mode
-
Byte
Word (16 bit)
Long (32 bit)
Double (64 bit)
<ea> + <Dn> -> <Dn>
<Dn> + <ea> -> <ea>
ea = effective address
Effective Address Encoding Summary
-
Data Register Direct
Address Register Direct
Address Register Indirect
Address Register Postincrement
Address Register Predecrement
Address Register Displacement
Address Register and Memory Indirect with Index
Absolut Short (16 bit)
Absolut Long (32 bit)
PC-Counter Indirect with Displacement
PC-Counter and Memory Indirect with Index
.....
⇒ 8 Register
Vorlesung Rechnerarchitektur
Seite 107
CISC - Prozessoren
MC 68020
Example of an addressing mode
Address register indirect with index (base displacement)
EA = (An) + (Xn) + d8
31
An
Memory Address
7 0
31
displacement
sign Extended
+
Int
31
Index Register
0
0
Sign Ext. Value
Scale
+
*
31
0
Memory Address :
Operand
Example of a ’complex instruction’
15
11 10 9
3 2
6 5
0
Effective Address
MOVEM
dr
SZ
mode
Register
0
15
A7 A6 A5 A4 A3 A2 A1 A0 D7 D6 D5 D4 D3 D2 D1 D0
Register Mask List
instruction for saving the register contest
- uses postincrement addressing mode for save
- uses predecrement addressing mode for restore
Vorlesung Rechnerarchitektur
Seite 108
CISC - Prozessoren
DBcc instructions
(-> DBcc Instruction im CPU32 User Manual)
Don’t branch on condition !!!
REPEAT
(body of loop)
UNTIL (condition is true)
Assembler syntax:
DBcc Dn, <label>
The DBcc instruction can cause a loop to be terminated when either the specified condition
CC is true or when the count held in Dn reaches -1. Each time the instruction is executed,
the value in Dn is decremented by 1. This instruction is a looping primitive of three parameters: a condition, a counter (data register), and a displacemt. The instruction first tests the
condition to determine if the termination condition for the loop has been met, and if so, no
operation is performed. If the termination condition is not true, the low order 16 bits of the
counter data register are decremented by one. If the result is - 1 the counter is exhausted and
execution continues with the next instruction. If the result is not equal to - 1, execution continues at the location indicated by the current value of the PC plus the sign-extended 16-bit
displacement. The value in the PC is the current instruction location plus two.
IF (CC == true)
THEN PC++
ELSE {
Dn-IF (Dn == -1)
THEN PC++
ELSE PC <- PC + d
}
Label:
branch back
Loop
pair of
instructions
Dn - 1
true
SUB
generation of condition code
DBcc
decrement and branch on condition code
Dn = -1
?
CC false (1) and Dn != -1
CC true (0)
or Dn == -1
PC++
false
CC ?
DBcc
true
Branch
Successor
PC + d
PC + d
PC++
Instruction format:
15
0
false
14
1
13
0
12
1
11
10
9
condition
6
7
1
1
displacement
8
5
0
4
0
3
1
2
1
0
register
Vorlesung Rechnerarchitektur
Seite 109
Mikroprogrammierung
μ-Programmierung
Definition :
Mikroprogrammierung ist eine Technik für den Entwurf und die Implemtierung der Ablaufsteuerung eines Rechners unter Verwendung
einer Folge von Steuersignalen zur Interpretation fester und dynamisch änderbarer Datenverarbeitungsfunktionen. Diese Steuersignale, welche auf Wortbasis organisiert sind (Mikrobefehle) und in
einem festen oder dynamisch änderbaren Speicher (Mikroprogrammspeicher, Control Memory oder auch Writable Control Store) gehalten werden, stellen die Zustände derjenigen Signale dar, die den
Informationsfluss zwischen den ausführenden Elementen (Hardware) steuern und für getaktete Übergänge zwischen diesen Signalzuständen sorgen. [Oberschelp / Vossen]
Das Konzept der Mikroprogrammierung wurde von M.V. Wilkes 1951 vorgeschlagen.
Man unterscheidet 2 Arten der Mikroprogrammierung:
• horizontale: die einzelnen Bits eines Mikroprogrammwortes (eine Zeile des
Control Stores) entsprechen bestimmten Mikrooperationen. Diese Operationen
können dann parallel angestoßen werden.
• vertikale: die Zuordnung zwischen den Bits eines Mikroprogrammwortes und
den assoziierten Operationen wird durch einen sogenannten Mikrooperationscode (verschlüsselt) bestimmt.
μ-Programmierung ist der Entwurf einer Steuerungssequenz für das μ-Programmsteuerwerk.
Das μ-Programmsteuerwerk kann allgemein als endlicher Automat angesehen werden, der
sich formal wie folgt beschreiben lässt:
A = (Q, Σ, Δ, q 0,F, δ)
Dessen allgem. Übergangsfunktion lautet:
δ:
Q × (Σ ∪ {ε}) → Q × (Δ ∪ {ε})
und nimmt für das Steuerwerk die spezielle Form an:
δ2: B s × ( B i ∪ { ε } ) → B s × ( B o ∪ { ε } )
Bs state vector
Bi input vector
Bo output vector
Die Berechnung des Folgezustands eines aktuellen Zustands erfolgt dabei durch die so genannte Next-State-Logik, die Berechnung des Outputs durch die Output-Logik.
Der abstrakte Begriff des endlichen Automaten ist ein entscheidendes Hilfsmittel für das
Verstehen der Arbeitsweise und der Modellierung von Hardwarestrukturen. Seine Beschreibung erfolgt häufig als Graph (FSM, ’bubble diagram’), um die Zustände und die Transitionen zu veranschaulichen.
Vorlesung Rechnerarchitektur
Seite 110
Mikroprogrammierung
μ-Programmierung
Das folgende μ-Programmsteuerwerk zeigt den prinzipiellen Aufbau. Der Control Vector
stellt die Steuerungsfunktion der Execution Unit dar.
μ Instr.
μ-Programmsteuerwerk
Load
MUX
μPC
next
address
field
CS
ROM
WCS
RAM
Bs
horizontales
μ Programmwort
Bo
Control Vector
nextPC
next PC
Logic
RE
WE
Bi
End
CC
ALU
+-
Logical Unit
& or not
MUL
*
RF
VN
Execution Unit
Die Mikroprogrammierung bietet folgende Vorteile:
• Implementierung eines umfangreichen Befehlssatzes (CISC) mit wenigen Mikrobefehlen bei wesentlich geringeren Kosten
• Mikroinstruktionen und damit der Maschinenbefehlssatz kann verändert werden durch Austausch des Inhaltes des Control Memory (Writable Control Store
WCS)
• vereinfachte Entwicklung und Wartung eines Rechners durch deutlich geringeren Hardware-Aufwand eines Rechners
Sie hat folgenden Nachteil:
• Ausführung einer mikroprogrammierten Operation dauert länger, denn das Mikroprogramm muss schrittweise aus dem Control Memory gelesen werden. Für
jeden Befehl sind meist mehrere ROM-Zugriffe erforderlich.
Vorlesung Rechnerarchitektur
Seite 111
Mikroprogrammierung
μ-Programmierung
Wenn das Steuerwerk schnell (im darauffolgenden Takt) auf die CC-Signale reagieren soll,
wird die Next-PC Logik nicht wie sonst üblich durch einen Addierer ausgeführt, sondern
durch eine einfache Oder-Logik, die die LSBs der Folgeadresse modifizieren. Verwendet
man als Folgeadresse eine Adresse mit den vier LSBs auf "0", so können die Condition Codes 0..3 diese bits zu "1" werden lassen, womit man einen 16-way branch realisieren kann.
Dadurch sind die Adressen der μ-Instruktionen nicht mehr linear im Speicher angeordnet
sondern können beliebig verteilt werden, da jede μ-Instruktionen ihre Folgeadresse im WCS
mitführt.
Benutzt man eine Folgeadresse mit z.B. 0010 als LSBs, so kann man damit den CC1 ausmaskieren, da er durch die Oder-Funktion nicht mehr wirksam werden kann.
MSB
LSB
0
0
0
0
next PC
Logic
0
0
0
0
0
1
CC3
0
0
CC2
0
0
CC1
0
1
CC0
0
0
0
0
Die μ-Programmierung solcher Steuerwerke auf der bit-Ebene ist natürlich viel zu komplex,
als dass man sie von Hand durchführen könnte. Verwendet wird üblicherweise eine symbolische Notation, ähnlich einem Assembler. Diese Notation wird dann durch einen Mikro-Assembler in die Control Words des WCS umgesetzt.
Die explizite Sichtbarkeit des Control Wortes der Execution-Hardware bildet die Grundlage
der VLIW-Architekturen.
Der Test und die Verifikation der μ-Programme ist sehr aufwendig, da sie direkte Steuerungsfunktionen auf der HW auslösen, deren Beobachtung meist nur durch den Anschluss
von LSAs (Logic State Analyzer) möglich ist.
Die μ-Programmierung wurde wegen der höheren Geschwindigkeit bei der Einführung der
RISC-Rechner durch festverdrahtete Schaltwerke (PLAs, Programmable Logic Arrays) abgelöst.
weitere Informationen in: W.Oberschelp, G.Vossen, Rechneraufbau und Rechnerstrukturen, 7.Aufl., R.Oldenbourg Verlag, 1998
Vorlesung Rechnerarchitektur
Seite 112
RISC
Computer Performance Equation
P[MIPS] =
<1
RISC
>1
Superscalar
VLIW
f [MHz] x C i
c
N xN
m
i
fc
clock frequency
Ci
N i
instruction count per clock cycle
N m
memory access factor
average number of clock cycles per instruction
The RISC design philosophy can be summarized as follows:
- pipelining of instructions to achieve an effective single cycle execution (small Ni)
- simple fixed format instructions (typically 32 bits) with only a few addressing
modes
- hardwired control of instruction decoding and execution to decrease the cycle time
- load-store architectures keep the memory access factor Nm small
- Migration of functions to software (Compiler)
The goal of the reduced instruction set is to maximize the speed of the processor by getting
the software to perform infrequent functions and by including only those functions that yield
a net performance gain. Such an instruction set provides the building blocks from which
high-level functions can be synthesized by the compiler without the overhead of general but
complex instructions [fujitsu]
The two basic principles that are used to increase
the performance of a RISC processor are:
pipelining and
the optimization of the memory hierarchy.
Vorlesung Rechnerarchitektur
Seite 113
ILP
Definition :
Instruction level parallelism
It is the parallelism among instructions from ’small’ code areas which are independent of one another.
Exploitation
of ILP
one Form
overlapping of instructions in a
pipeline
⇒ RISC - Processor with nearly one instruction per clock
another Form
parallel execution in multiple
functional units
⇒
superscalar Processor
very long instruction word Processor
(VLIW)
at runtime :
at compile time :
dynamic
instruction
scheduling
static
instruction
scheduling
Vorlesung Rechnerarchitektur
Seite 114
ILP
Instruction level parallelism
Availability of ILP
’small’ code area
1. basic block : a straight-line code sequence with no
branches in except to the entry and no branches out
except the exit
ILP small ! ~ 3,4; <6
2. multiple basic blocks
a) loop iterations
Loop level Parallelism
stems from ’Structured Data Type’ Parallelism
Vector Processing
b) speculative execution
at compile time
trace scheduling
at run time
dynamic branch prediction
+ dynamic speculative execution
Techniques to improve C i (instruction count per clock cycle)
•
•
•
•
•
•
•
•
•
•
•
•
Loop unrolling
pipeline scheduling
dynamic instruction scheduling; scoreboarding
register renaming
dynamic branch prediction
multiple instruction issue
dependence analysis; compiler
instruction reordering
software pipelining
trace scheduling
speculative execution
memory disambiguation; dynamic - static
Vorlesung Rechnerarchitektur
Seite 115
ILP
Example of Instruction Scheduling and Loop Unrolling
Calculate the sum of the elements of a vector with length 100
99
sum =
∑ ai
i=0
C-Program
int main() {
double a[100];
double sum = 0.0;
int i;
}
for (i=0;i<100;i++)
sum += a[i];
f(sum);
// using sum in f avoids heavy optimization
// which results in "do nothing"
Suppose a basic 4-staged pipeline similar to Exercise 2. The pipeline is interlocked by hardware to avoid flow hazards. Forwarding pathes for data and branch condition are included.
The two stages for cache access are only active for LD/ST instructions. The Cache is nonblocking. Misses queue up at the bus interface unit which is not shown in the block diagram.
The cache holds four doubles in one cache entry. The cache entry is filled from main memory using a burst cycle with 5:1:1:1. Integer instructions execute in one clock, fp operations
ADD and MUL requires 3 clocks. The bus-clock is half of the processor clock.
single
instruction
issue
IF
MA1
MA2
int
int
int
RFR
EX
WB
fp
fp
EX1
EX3
EX2
fp
fp
fp
Block-Diagram of a simple pipelined RISC
Vorlesung Rechnerarchitektur
Seite 116
ILP
Example of Instruction Scheduling - Assembler Code
// Initialization
R0
R5
R4
R3
R2
=
=
=
=
=
0;
99;
0;
1000;
800;
F4 = 0.0;;
F2 = 0.0;
//
//
//
//
//
//
always zero
endcount 100-1
loop index
base address of A
address of sum
fkt parameters are normally stored on stack !!
// Accumulator for sum
// Register for loading of Ai
// Assembler Code
Lloop:
Lend:
CMP R4, R5 -> R6;
BEQ R6, Lend;
LD (R3) -> F2;
ADDF F2,F4 -> F4;
ADDI R4, #1 -> R4;
ADDI R3, #8 -> R3;
JMP Lloop;
ST F4 -> (R2);
//
//
//
//
//
//
generate CC and store in R6
branch pred. "do not branch"
load Ai
accumulation of sum
loop index increment
addr computation for next element
It should be mentioned here that local variables of procedures are normally stored in the activation frame on the stack. The addressing of local variables can be found in the assembler
code of the compiled C-routines. For this example, a simplified indirect memory addressing
is used.
Vorlesung Rechnerarchitektur
Seite 117
ILP
Example of Instruction Scheduling without Loop Unrolling
Instruction slot
IF
RFR
CMP
2
BEQ
CMP
3
LD
BEQ
CMP
4
ADD i
LD
BEQ
5
ADD a
ADD i
LD
6
ADDF
ADD a
ADD i
7
stall
ADD a
8
stall
9
stall
stall
11
stall
12
stall
13
stall
14
stall
15
stall
16
CMP
19
BEQ
CMP
LD
22
Stage
LD
LD
ADD i
ADD a
Cache miss
Memory
5:1:1:1
ADDF f4,f2
ADDF
ADDF
BEQ
CMP
ADD i
LD
BEQ
ADD a
ADD i
LD
23
ADDF
ADD a
ADD i
24
JMP
ADDF f4,f2
ADD a
25
CMP
26
BEQ
CMP
28
LD
BEQ
CMP
29
ADD i
LD
BEQ
30
ADD a
ADD i
LD
31
ADDF
32
33
LD f2
DATA
18
Instruction Loop
JMP
21
WB
stall
17
20
EX3
MA2
EX
1
11
EX2
MA1
ADDR
EX1
ADDF
ADDF f4
first Data
available
LD
LD
ADDF
ADD i
Cache hit
ADD a LD f2
ADDF
ADD a
ADD i
ADDF f4,f2
ADD a
ADDF
ADDF f4
LD
LD
ADDF
34
ADD i
ADD a LD f2
ADDF
35
ADDF
36
ADDF f4
37
time
The successor of a Bcc instruction is statically predicted and speculatively executed and can
be annulled if a wrong prediction path is taken.
The execution time of 100 element vector accumulation can be calculated to
100 / 4 x (17 + 3 x 7) = 950 CPU clocks
Vorlesung Rechnerarchitektur
Seite 118
ILP
Example of Instruction Scheduling with Loop Unrolling
EX1
Instruction slot
IF
RFR
EX
BEQ
CMP
3
LD f2,#0
BEQ
CMP
4
ADDF
LD
BEQ
5
stall
LD
6
stall
7
stall
8
stall
9
stall
11
stall
Iteration 0
2
12
stall
15
stall
22
23
Iteration 1
LD f2,#8
ADDF f4,f2
28
ADDR
Cache miss
Memory
5:1:1:1
ADDF f4,f2
LD
ADDF
stall
LD
stall
ADDF f4,f2
ADDF f4,f2
LD
ADDF
stall
LD
ADDF f4,f2
LD
ADDF
stall
LD
ADDF
Cache hit
ADD a, #32
31
ADD i,#4
ADD a
ADDF
32
JMP
ADD i,#4
ADD a
ADDF f4
LD f2,#16
ADDF
LD
30
ADDF f4,f2
ADDF
LD
ADD i,#4
ADDF f4
LD f2,#24
ADDF
ADDF
ADD a
ADDF f4 ADD i,#4
35
36
37
time
Load instructions are not scheduled and interations do not overlapp.
The execution time of 100 element vector accumulation can be calculated to:
100 / 4 x 32 = 800 CPU clocks
first Data
available
ADDF
LD
stall
34
ADDF f4
LD f2,#8
LD
ADDF f4,f2
33
ADDF
LD
LD f2,#24
29
LD f2
ADDF
LD
stall
Iteration 3
26
LD
LD f2,#16
24
25
LD
stall
Iteration 2
21
Instruction Loop 4 times unrolled
14
20
-
stall
stall
19
Stage
stall
13
18
WB
DATA
11
17
EX3
MA2
CMP
1
16
EX2
MA1
Vorlesung Rechnerarchitektur
Seite 119
ILP
Software - Pipelining
Software-Pipelining überlappt die Ausführung von aufeinanderfolgenden Iterationen in ähnlicher Form wie eine Hardware-Pipeline.
Instr.
I1
Instr.
I2
Instr.
I3
Durch das Software-Pipelining wird die Parallelität zwischen unabhängigen Schleifeniterationen ausgenützt.
Instruktion in Schleife
Inm
Index der Schleifeniteration
Anordnung der Instruktionen.
I 11
I1
I 12
I2
I 13
I3
I 21
I 22
I 23
I 31
I 32
I 33
Vorlesung Rechnerarchitektur
Seite 120
ILP
Example of Instruction Scheduling with Loop Unrolling and Software Pipelining
EX1
Instruction slot
IF
RFR
EX
BEQ
CMP
3
LD f2,#0
BEQ
CMP
LD f6, #32
LD
BEQ
7
8
9
11
11
WB
Stage
LD f6, #32
-
LD
LD f6, #32
LD
LD
LD
Cache miss
LD
ADDR ADDR
6
start phase for iteration 0 and 1
2
5
EX3
MA2
CMP
1
4
EX2
MA1
in page
12
13
Memory
14
5:1:1:1
16
LD f4
load Iteration 0 and 1
19
20
21
22
23
26
28
29
30
31
32
calculate Iteration 0 and 1
24
LD f5,#24
prefetch load
LD f3
LD f4
LD f5
to f2, #64
LD f5
ADDF f10,f2
ADDF f11,f6
LD f7
LD f5
ADDF f11,f6
LD f7
ADDF f10,f2
LD f8
ADDF f11,f6
LD f7
LD f9,#56
ADDF f10,f3
LD f8
ADDF f11,f7
LD f9
ADDF f10,f3
ADDF f11,f7
LD f9
ADDF f11,f7
ADDF f10,f5
36
ADDF f11,f9
38
ADD a
39
ADD i
40
JMP
ADDF f11,f6
LD f7
ADDF f10,f2
LD f8
ADDF f11,f6
LD f7
ADDF f10,f3
LD f8
ADDF f11,f6
LD f9
ADDF f10,f3
LD f8
ADDF f11,f7
LD f9
ADDF f10,f3
ADDF f11,f7
ADDF f11,f8
LD f9
ADDF f11,f7
ADDF f10,f4
ADDF f11,f8
ADDF f10,f5
ADDF f11,f9
ADDF f10,f4
ADDF f11,f8
ADDF f11,f7
ADDF f10,f4
ADDF f10,f4
ADDF f11,f9
ADDF f10,f4
ADDF f11,f9
ADDF f10,f11
time
LD f6
ADDF f10,f4
ADDF f10,f5
37
ADDF f10,f2
ADDF f10,f4
ADDF f11,f8
41
LD f5
LD f8,#48
34
LD f4
ADDF f10,f2
ADDF f10,f4
35
LD f4
ADDF f10,f3
ADDF f11,f8
Cache hit
LD f3
LD f5
LD f7,#40
first Data
available
LD f3
LD f4
ADDF f10,f2
ADDF f10,f4
33
LD f2
DATA
18
LD f3
DATA
LD f3
LD f4,#16
17
25
+ page mode
LD f3,#8
15
ADDF f11,f9
global sum
The loop is four times unrolled and instructions from different iterations are scheduled to
avoid memory and cache accesss latencies. Scheduling of load instructions requires a nonblocking cache and a BUI which can handle multiple outstanding memory requests. The latency of the first start-up phase can only be avoided by scheduling the load into code before
the loop start (difficult). Execution time: ~ (100/4 / 2 * 29) +(14 + 7) + 6 = 277
Vorlesung Rechnerarchitektur
Seite 121
ILP
Assembler without optimization
compiled by
.file
"unroll.c"
gcc2_compiled.:
.section
".rodata"
.align 8
.LLC0:
.long
0x0
.long
0x0
.section
".text"
.align 4
.global main
.type
main,#function
.proc
04
main:
!#PROLOGUE# 0
save %sp,-928,%sp
!#PROLOGUE# 1
sethi %hi(.LLC0),%o0
ldd [%o0+%lo(.LLC0)],%o2
std %o2,[%fp-824]
st %g0,[%fp-828]
.LL2:
ld [%fp-828],%o0
cmp %o0,99
ble .LL5
nop
b .LL3
nop
.LL5:
ld [%fp-828],%o0
mov %o0,%o1
sll %o1,3,%o0
add %fp,-16,%o1
add %o0,%o1,%o0
ldd [%fp-824],%f2
ldd [%o0-800],%f4
faddd %f2,%f4,%f2
std %f2,[%fp-824]
.LL4:
ld [%fp-828],%o1
add %o1,1,%o0
mov %o0,%o1
st %o1,[%fp-828]
b .LL2
nop
.LL3:
.LL1:
ret
restore
.LLfe1:
.size
main,.LLfe1-main
.ident "GCC: (GNU) 2.7.2.2"
gcc -S <fn.c>
Vorlesung Rechnerarchitektur
Seite 122
ILP
Assembler with optimization
compiled by gcc -S -o3 -loop_unroll <fn.c>
.file
"unroll.c"
gcc2_compiled.:
.section
".rodata"
.align 8
.LLC1:
.long
0x0
.long
0x0
.section
".text"
.align 4
.global main
.type
main,#function
.proc
04
main:
!#PROLOGUE# 0
save %sp,-912,%sp
!#PROLOGUE# 1
sethi %hi(.LLC1),%o2
ldd [%o2+%lo(.LLC1)],%f4
mov 0,%o1
add %fp,-16,%o0
.LL11:
ldd [%o0-800],%f2
faddd %f4,%f2,%f4
ldd [%o0-792],%f2
faddd %f4,%f2,%f4
ldd [%o0-784],%f2
faddd %f4,%f2,%f4
ldd [%o0-776],%f2
faddd %f4,%f2,%f4
ldd [%o0-768],%f2
faddd %f4,%f2,%f4
ldd [%o0-760],%f2
faddd %f4,%f2,%f4
ldd [%o0-752],%f2
faddd %f4,%f2,%f4
ldd [%o0-744],%f2
faddd %f4,%f2,%f4
ldd [%o0-736],%f2
faddd %f4,%f2,%f4
ldd [%o0-728],%f2
faddd %f4,%f2,%f4
add %o1,10,%o1
cmp %o1,99
ble .LL11
add %o0,80,%o0
std %f4,[%fp-16]
call f,0
ldd [%fp-16],%o0
ret
restore
.LLfe1:
.size
main,.LLfe1-main
.ident "GCC: (GNU) 2.7.2.2"
Vorlesung Rechnerarchitektur
Seite 123
VLIW Concepts
• long (multiple) instruction issue
• static instruction scheduling (at compile time)
- highly optimizing compiler (use of multiple basic blocks)
• simple instruction format
=> simple Hardware structure
- synchronous operation (global clock)
- pipelined Funktion-Units (fixed pipeline length !)
- resolve of resource hazards and data hazards at compile time
• control flow dependencies => limit of performance
• unpredictable latency operation
- memory references ----> fixed response
- dma, interrut (external)
• numerical processors
front door
M0
M1
M2
back door
Advantages
• max n times performance using n Funktional -Units
• simple Hardware-Architecture
Disadvantages
•
•
•
•
•
Mix of Functional-Units is Application dependent
multiple Read/Write Ports
code explosion
static order
stop of latency operations
variable response
(worst case)
M3
Vorlesung Rechnerarchitektur
Seite 124
VLIW Concepts
VLIW :
Trace Scheduling
Speculative Execution
1
2
3
4
basic
block 0
branch 1
1
2
3
4
<
<
<
< speculativ
<
<
<
< speculativ
1
cc
cc
1
2
store operations
3
branch 2
cc
cc
3
4
branch 1 + 2
cc1 + cc2
resource independantdata indipendant instruction
VLIW
FU
1
int
3
2
ld/st
RF
128 register
int
ld/st
fp
4
fp
5
logical/branch
Vorlesung Rechnerarchitektur
Seite 125
Superscalar Processors
Introduction
• more than one instruction per clock cycle ( c i ≥ 1 )
• Basis is provided by pipelining (just as in the classic RISC case)
• additional usage of ILP
Basic Tasks of superscalar Processing:
•
•
•
•
•
Parallel decoding
Superscalar Instruction Issue
Parallel instruction execution
Preserving sequential consistency of execution
Preserving the sequential consistency of exception processing [ACA]
Use several Functional Units (FUs) to execute instructions in parallel:
• several FUs of the same type possible
• different FUs for different classes of instructions (ALU, load/store, fp,...)
• necessary to find enough instructions for these FUs
Prerequisites:
•
•
•
•
fetch, dispatch & issue of enough instructions per cycle
hardware resource for dynamic instruction scheduling (at run time !)
dependency analysis necessary
completion of several instructions
Problems:
• Instruction Fetch
• Find independent instruction - Analysis prior to Instruction Issue, dynamic, speculative
scheduling
• Remove "false" dependencies - "single assignement", re-order buffer, renamed registers,
etc.
• out-of-order problematic - necessary to complete instructions, maintain architectural
statem, RAW
I
in-order
out-of order
Issue/Dispatch
RISC
Tomasulu, Reservation Stations
Execution
RISC
"single Assignement"
Completion
RISC
ROB etc.
Vorlesung Rechnerarchitektur
Seite 126
Scoreboard - CDC6600
With the multiplicity of functional units, and of operand registers and with a simple and
highly efficient addressing scheme, a generalized queue and reservation scheme is practical.
This is called a scoreboard.
The scoreboard maintains a running file of each central register, of each functional unit, and
of each of the three operand trunks (source and destination busses) to and from each unit.
[Th64]
Hazard detection and resolution
• construction of data dependencies
• check of data dependencies
• check of resource dependencies
Hazard avoidance by stalling the later Instructions
Vorlesung Rechnerarchitektur
Seite 127
Superscalar Processors
Example: Dynamic Scheduling with a Scoreboard
Pipelined Processor with multiple FUs
out of order execution
out of order read
execute
FU0
in order write
in order issue
IFetch
RF
read
I Issue
RF
write
FU1
issue
check
write back
FU2
FU3
read
operands
Scoreboard
1. I Issue: If a functional unit for the instruction is free and no other active instruction has the same destination register, the scoreboard issues the instruction to the FU and updates its internal data structure.By
ensuring that no other active FU wants to write its result into the destination register WAW-Hazards are
avoided. The instruction isssue is stalled in the case of a WAW-Hazard or a busy FU.
2. Read operands: The scoreboard monitors the availability of the source operands. A source operand is
available, if no earlier issued active instruction is going to write this register. If the operands are available, the instruction can proceed. This scheme resolves all RAW-Hazards dynamically. It allows instructions to execute ’out of order‘.
3. Execution: The required FU starts the execution of the instruction. When the result is ready, it notifies
the scoreboard that it has completed execution. If an other instruction is waiting on this result, it can be
forwarded to the stalled FU.
4. Write back: On the existance of a WAR-Hazard, the write back of the instruction is stalled, until the
source operand is read by the dependend instruction (preceeding instruction in the order of issue).
[Hennessy, Patterson, Computer Architecture: A Quantitative Approach, 2. Ed. 1996]
Stage
I Issue
activate instruction
read
operands
Wait until
Checks
FU available and
no more than one result
assigned to Rdest*
FU ? available
Source operands
available
Bookkeeping
Fd ? not busy
(FU) <-- busy
(Fd) <-- busy
Fd(FU) <-- ’D’
Fs1(FU) <-- ’S1’
Fs2(FU) <-- ’S2’
Op(FU) <-- ’opc’
Fs1(FU) ? valid
Fs2(FU) ? valid
(Fs1) <-- read
(Fs2) <-- read
s1(FU) <-- (Fs1)
s2(FU) <-- (Fs2)
R <-- s1(FU) op s2(FU)
execute
write back
deactivate instruction
Operations
Result ready
and no WAR-Hazard
Fd ? read
(FU) <-- free
(Fd) <-- valid
Fd <-- R
Vorlesung Rechnerarchitektur
Seite 128
Superscalar Processors
Instruction Fetch, Issue & Dispatch
To provide processor with enough instruction bandwidth use techniques as
• I-Cache
• I-Buffer
• Trace Cache
Necessary to provide the processors with several instructions per cycle!
Possibility to use:
• direct issue (see Scoreboard); resource conflicts and blocking propable
• indirect issue, i.e. issue & dispatch using Reservation stations
out of order dispatch
RS
to FU
RS
to FU
in order issue
IFetch
I Dispatch
issue
RS
to FU
dispatch
check
Instructions issue
Instruction issue normally in order.
If instructions would be immediately enter a FU, head-of-queue-blocking could occur
-> use reservation stations for this
Instruction dispatch
Instructions are dispatched from the Reservation stations; out of order issue. Any instruction
within a reservation stations that is "ready" may be issued.
Vorlesung Rechnerarchitektur
Seite 129
Superscalar Processors
Dynamic Scheduling with Tomasulu (out of order execution)
The main idea behind the Tomasulu algorithm is the introduction of reservation stations.
They are filled with instructions from the decode/issue unit without check for source operand availability. This check is performed in the reservation station logic itself. This allows
issueing instructions whose operands are not readily computed. Issuing this instruction
(which would normally block the issue stage) to a reservation station shuffles the instruction
besides and give the issue unit the possibility to issue the following instruction to another
free reservation station.
A reservation station is a register and a control logic in front of a FU. The register contains
the register file source operand addresses, the register file destination address, the instruction for this FU and the source operands. If the source operands were not avaiable, the instruction waits in the reservation station for its source operands. The control logic compares
the destination tags of all result busses from all FUs. If the destination tag matches to a
misssing operand address, this data value is taken from the result bus into the reservation stations operand data field. A reservation station can forward its instruction to the execution
FU, when all its operands are available (data flow synchronization).
A structure view of the data pathes can be found in the powerPC 620 part.
5 bit
5 bit
opc FU
Tag
Instruction for
FU
s2
D2
source 2 operand
data
source 2 operand
address
5 bit
s1
D1
d
source 1 operand
data
source 1 operand
address
destination operand
address
• Reservation Stations decouple instruction issue from from dependecy checking.
• An actual reservation stations is several entries deep.
• Any of the instructions may be issued to the FU, if all dependencies have been met (i.e.
fully associative)
• Data forwarding occurs outside of FUs
• Result bus from FUs leads back to inputs of the reservation stations
Vorlesung Rechnerarchitektur
Seite 130
Superscalar Processors
Removing "false dependencies" - Register Renaming
Use register renaming to enforce single assignement. Thus, false dependecies are removed.
Type of rename buffers:
• merged architectural and rename register file
architectural and rename register are allocated from a single register file; usage of a mapping table; reclaiming complex
• seperate rename and architectureal register files
deallocation with retirement or reuse
• Holding renamed values in the ROB (Re-order Buffer)
Maintain sequential consistency
• order in which instructions are completed
• order in which memory is accessed
Usage of an ROB for sequential consistency:
• Instruction move from the FUs to the ROB.
• Instructions retire from the ROB to the architectural state if and ony if they are finished
and all previous instructions have been retired.
• ROB implemented as a ring buffer.
• ROB can also be elegantly used for register renaming (see above)
Reorder Buffer
allocate for issued instructions
f
x
f
i
Tail
free for retired instructions
Head
i - issued
x - in execution
f - finished
Vorlesung Rechnerarchitektur
Seite 131
Superscalar Processors
FU Synchronization
Definition
A functional unit is a processing element (PE) that computes some output value based
on its input [Ell86]. It is a part of the CPU that must carry out the instructions (operations)
assigned to it by the instruction unit. Examples of FUs are adders, multipliers, ALUs, register files, memory units, load/store units, and also very complex units like the communication unit, the vector unit or the instruction unit. FUs with internal state information
are carriers of data objects.
In general, we can distinguish five different types of FUs:
1. FU with a single clock tick of execution time
2. FU with n clock ticks of execution time, nonpipelined
3. FU with n clock ticks of execution time, pipelined
4. FU with variable execution time, nonoverlapped
5. FU with variable execution time, overlapped
The example of two FUs given in the following figure serves to illustrate the difference between the terms pipelined and overlapping.
synchr. input X
instr.
f1
FU
f2
f3
instr.
result
type 3 pipelined
execution time: 3 (3 stages)
instr. rate: 1
overlapping factor: none
FU
input is synchronized
by the availability of X
f1
f2
type 5
with overlapping
f3
result
execution time: variable
instr. rate: 1
overlapping factor: 3 (3 input register)
In the FU type 3, instructions are processed in a pipelined fashion. The unit guarantees a result rate of one result per clock tick. This empties the pipeline as fast as it can be filled with
instructions. None of the processing stages can be stopped for synchronization; only the
whole FU can be frozen by disabling the clock signal of the FU.
The FU type 5 has a synchronization input in stage f2 of the processing stages. Assuming
the synchronization input is not ready, processing by f2 must stop immediately. This does
not necessarily stop the instruction input to the instruction input queue. A load-store unit featuring reservation stations is good example of such FU. Only a full queue stops instruction
issue, and this event halts the whole CPU. FU type 5 can be used to model a load-store unit
that synchronizes to external memory control and halts the CPU only in the case where the
number of instructions exceeds the overlapping factor.
An example for FU of type 4 is the iterative floating point division unit.
Vorlesung Rechnerarchitektur
Seite 132
Superscalar Processors
FU Synchronization: Example of FDIV-Unit
When an instruction is issued to an iterative execution unit, this unit requires a number of
clocks to perform the function (e.g. 27 clocks for fdiv). During this time, the FU is marked
as "busy" and no further instruction can be issued to this FU. When the result becomes
available, the result register must be clocked and the FU becomes "ready" again.
instruction register
and source operands
iterative
execution
unit
operands
instruction
internal_I
I_issue
FDIV
result register
result
result_ready
busy
set_ busy
BUSY FF
set_ready
Eq. for busy check:
I_issue = I_for _div_FU & ( /busy + busy & resul_ ready )
CLK
Instr
div 1
div 2
I_issue
div 1
internal_I
busy
result_ready
div 1
result
phases
execution start
load instruction
execution 1
execution 2
execution 26
result to reg
execution 27
execution start
execution 1
load instruction
A new instruction can be issued to the unit at the same clock edge when the internal result
is transferred to the result register (advancement of data in the pipeline stages!). The instruction issue logic checks not only the busy signal but also the result_ready signal and can thus
determine the just going ready ("not busy") FU.
Vorlesung Rechnerarchitektur
Seite 133
Superscalar Processors - Literature
• Tomasulu algorithm (IBM 360/91)
[Tomasulu, R.M., An Efficient Alogrithm for Exploiting Multiple Arithmetic Units, IBM
Journal, Vol. 11, 1967]
• Scoreboard (CDC 6600)
[Th64] Thornton, James.E., Parallel Operation in the Control Data 6600,
Proc. AFIPS, 1964
• Advanced Computer Architecture - A Design Space Approach; Sima D., Fountain T. &
Kacsuk P.; 1997
Vorlesung Rechnerarchitektur
Seite 134
PowerPC 620
Overview of the 620 Processor
The 620 Processor is the first 64-bit Implementation of the PowerPC-Architecture.
-
200 (300) MHz clock frequency
7 million transistor count, power dissipation < 17-23 W @ 150MHz, 3.3V
0.5 μm CMOS technology, 625 pin BGA
max. 400 MFLOPS floating-point performance (dp)
800 MIPS peak instruction execution
four instructions issued per clock, instruction dispatch in order
out-of-order execution, in-order completion
multiple independent functional units, int, ld/st, branch, fp
reservation stations, register renaming with rename buffers
five stages of master instruction pipeline
full instruction hazard detection
separate register files for integer (GPR) and floating point (FPR) data types
branch prediction with speculative execution
static and dynamic branch prediction and branch target instruction buffer
4 pending predicted branches
32k data and 32k instruction cache, 8 set associative, physically addressed
non-blocking load access, 64-bytes cache block size
separate address ports for snooping and CPU acesses
level-2 cache interface to fast SSRAM (1MB - 128MB)
MMU, 64-bit effective address, 40-bit physical address
64-entry fully associative ATC
multiprocessor support, bus snooping for cache coherency (MESI)
pipelined snoop response
dynamic reordering of loads and stores
explicit address and data bus tagging, split-read protocol
128-bit data bus crossbar compatible, cpu id tagged
bus clock = processor clock/n (n = 2,3,4)
Vorlesung Rechnerarchitektur
Seite 135
PowerPC 620
Pipeline Structure
The 620 master instruction pipeline has five stages. Each instruction executed by the processor will flow through at least these stages. Some instructions flow through additional pipeline stages.
The five basic stages are:
-
Fetch
Dispatch
Execute
Complete "in order completion"
Writeback
Integer Instructions
Fetch
Dispatch/
Decode Execute
Load Instructions
Fetch
Dispatch/
DCache Align Complete Writeback
Decode EA
Store Instructions
Fetch
Dispatch/
Decode EA DCache Lockup CompleteStore
Branch Instructions
Fetch
Predict/
Resolve
FP- Instructions
Fetch
Dispatch/ FPR
Decode Accesss FP Mul
Resolve
CompleteWriteback
Complete
FP Add
FP Norm Complete Writeback
Vorlesung Rechnerarchitektur
Seite 136
PowerPC 620
Branch correction
Reorder buffer information
Dispatch unit
with 8 - entry
instruction queue
Instruction dispatch buses
Fetch
unit
Instruction
cache
Completion
unit with
reorder buffer
Register number
Register number
GPR
Register
number
FPR
Register
number
int
fp
Rename
register
GP operand buses
FP operand buses
Reservation
station
XSU0
XSU1
MCFSU
FPU
LSU
GP result buses
FP result buses
Result status buses
Data
cache
BPU
Vorlesung Rechnerarchitektur
Seite 137
PowerPC 620
Data Cache Management Instructions
A very interesting feature of the PowerPC 620 is the availability of five-user-mode instructions which allow the user some explicit control over the caching of data. Similar mechanisms and a compiler implementation are proposed in [Lam, M.; Software Pipelining: An
Effictive Scheduling Technique for VLIW Machines; in: Proc. SIGPLAN ’88, Conf. on
Programming Language Design and Implementation, Jun. 1988, pp. 318-328].To improve
the cache hit rate by compiler-generated instructions, five instructions are implemented in
the CPU, which control data allocation and data write-back in the cache.
DCBT - Data Cache Block Touch
DCBTST - Data Cache Block Touch for Store
DCBZ - Data Cache Block Zero
DCBST - Data Cache Block Store
DCBF - Data Cache Block Flush
EIEIO - Enforce in-Order Execution of I/O
A DCBT is treated like a load without moving data to a register and is a programming hint
intended to bring the addressed block into the data cache for loading. When the block is
loaded, it is marked shared or exclusive in all cache levels. It allows the load of a cache line
to be scheduled ahead of actual use and can hide the latency of the memory system into useful computation. This reduces the stalls due to long latency external accesses.
A DCBTST is treated like a load without moving data to a register and is a program
ming hint intended to bring the addressed block into the data cache for storing. When the
block is loaded, it is marked modified in all cache levels. This can be helpful if it is known
in advance that the whole cache line is overwritten, as it usually is in stack accesses.
If the storage mode is write-back cache-enabled then the addressed block is established in
the data cache by the DCBZ without fetching the block form main storage, and all bytes of
the block are set to zero. The block is marked modified in all cache levels. This instruction
supports for example an OS creating new pages initialized to zero.
The DCBST, or data cache block clean, will write modified cache data to a memory and leave the final cache state marked shared or exclusive. Treated like a load with respect to
address translation and protection.
The DCBF, or data cache block flush, instruction is defined to write modified cache data to
the memory and leave the final cache state marked invalid.
Vorlesung Rechnerarchitektur
Seite 138
PowerPC 620
Storage Synchronization Instructions
The 620 Processor supports atomic operations (read/modify/write) with a pair of instructions. The Load and Reserve (LWARX) and the Store Conditional (STCX) instructions form
a conditional sequence and provides the effect of an atomic RMW cycle, but not with a single atomic instruction. The conditional sequence begins with a Load and Reserve instruction; may be followed by memory accesses and/or computation that include neither a Load
and Reserve nor a Store Conditional instruction; and ends with a Store Conditional instruction with the same target address as the initial Load and Reserve.
These instructions can be used to emulate various synchronization primitives, and to provide
more complex forms of synchronization.
The reservation (it exists only one per processor!!) is set by the lwarx instruction for the
address EA. The conditionality of the Store Conditional instruction’s store is based only on
whether a reservation exists, not on a match between the address associated with the reservation and the address computed form the EA of the Store Conditional instruction. If the
store operation was successful the CR field is set and can be tested by a conditional branch.
A reservation is cleared if any of the following events occurs:
- The processor holding the reservation executes another Load and Reserve instruction; this clears the first reservation and establishes a new one.
- The processor holding the reservation executes a Store Conditional instruction
to any address.
- Another processor or bus master device executes any Store instruction to the
address associated with the reservation.
The reservation granule for the 620 is 1 aligned cache block (64 bytes).
Due to the ability to reorder load and store instruction in the bus interface unit (BIU) the sync
instruction must be used to ensure that the results of all stores into a data structure, performed in in a critical section of a program, are seen by other processors before the data structure is seen as unlocked.
Vorlesung Rechnerarchitektur
Seite 139
PowerPC 620
Storage Synchronization Instructions
This implementation is a clever solution, because the external bus system is not blocked for
the test-and-set sequence. All other bus masters can continue using the bus and the memory
system for other memory accesses. The probability that another processor is accessing the
same address in between lwarx and stwcx is very low, because of the small number of instructions used for the modify phase (number depends on the synchronization primitive).
If the lwarx instruction finds a semaphore which was already set by another processor or process, the next access of the semaphore within the reservation loop is serviced from the cache,
so that no external bus activity is performed by the busy waiting test loop. The release of this
semaphore shows up in the cache by the coherency snoop logic, which usually invalidates
(or updates) the cache line. The next lwarx gets the released semaphore and tries to set it with
the stwcx instruction. The execution of instructions from the critical region can be started
after testing the monitor_flag by the bne instruction. Each critical region must be locked by
such a binary semaphore or by a higher-level construct. The semaphore is released by storing
a "0" to the semaphore address. The coherence protocol guarantees that this store is presented on the bus, although the cache may be in copy-back mode. The following instruction sequence implements a binary semaphore sequence (fetch-and-store). It is assumed that the
address of the semaphore is in Ra, the value to set the semaphore in R2.
link
reservation
L1 :
lwarx
stwcx.
bne
R1 <- (Ra)
R2 -> (Ra)
L1
broken
reservation
loop
set not
successful
lwarx
read semaphore
set reservation
stwcx
if there was no external
cycle to this address
store new semaphore
value from R2 to memory
bne
branch on SC false
set successful
cmpi
bne
R1,$0
L1
semaphore
set (1)
cmpi
bne
test semaphore value
branch on semaphore value
semaphore unset (0)
enter critical region
critical region
critical region
isync completes all
outstanding instructions
leave critical region
isync
st
R1 -> (Ra)
st
reset to old semophore value
Vorlesung Rechnerarchitektur
Seite 140
PowerPC 620
Storage Synchronization Instructions
Hardware resources for the supervision of the reservation.
CPU
0
EQ_BMR
or stwcx
monitor flag cleared
disturbedlwarx-stwcx
sequence
(don’t execute stwcx)
physical address
stwcx lwarx
address
path
only
BM REG
LD_BMR
lwarx load word and reserve
=
ADR REG
1
EQ_BMR
monitor
fllag
link
monitor
fsm
monitor flag set
(execute stwcx)
address check path
ADDR
external address bus
32
These functional components are mapped to the on-chip cache functions. This causes the reservation granule to be one cache line. The snooping of the cache provides the address comparison and the address tag entry contains the address of the reservation.
The following instruction sequence implements an optimized version of the binary semaphore instruction sequence, which increases the propability to run an undisturbed lwarxstwcx sequence.
L1 : ld R1 <- (Ra)
atomic by
reservation
read semaphore using normal load into R1
cmp R1 , R2
compare semaphore value to R2
beq L1
loop back if semaphore not free (1)
lwarx
R1 <- (Ra)
set the reservation to start the atomic operation (1)
when there is a high propabiltity to succeed
stwcx.
R2 -> (Ra)
bne
L1
store conditional semaphore from R2
if there was no external
cycle to this address
branch back if reservation failed (likely not taken)
cmp
R1 , $0
test semaphore value again
bne
L1
branch back on ’semaphore set’ due to
intermediate access from another processor
critical region
st R0 = 0 -> (Ra)
reset semaphore value (0)
It is assumed that the address of the semaphore is in Ra, the value to compare with in R2
Vorlesung Rechnerarchitektur
Seite 141
Synchronization
Definition :
Synchronization is the enforcement of a defined logical order between events.
This establishes a defined time-relation between distinct places, thus defining
their behaviour in time.
Definition :
A process is a unit of activity which executes programs or parts thereof in a strictly sequential manner, requiring exactly one processor [Gil93]. It consists of the
program code, the associated data, an address space, and the actual internal state.
Definition :
A thread or lightweight process is a strictly sequential thread of control (program
code) like a little mini-process. It consists of a part of the program code, featuring
only one entry and one exit point, the associated local data and the actual internal
state. In contrast to a process, a thread shares its address space with other threads
[Tan92].
Es gibt zwei unterschiedliche Situationen, die eine Synchronisation zwischen Prozessen erfordern.
- - die Benutzung von shared resources und shared data structures.
• Die Prozesse müssen in der Lage sein die gemeinsamen Resourcen zu beanspruchen, und ohne Beeinflussung voneinander benutzen zu können.
• mutual exclusion
- - die Zusammenarbeit von Prozessen bei der Abarbeitung von einer Task
• Die Prozesse müssen bei der Abarbeitung der Teilaufgaben in eine zeitlich korrekte Reihenfolge gebracht werden. Die Datenabhängigkeit zwischen Prozessen die Daten erzeugen (producer) und denen die Daten verbrauchen
(consumer) muß gelöst werden
• process synchronisation -> RA-2
Mutual exclusion
Ein bestimmtes Objekt kann zu jedem Zeitpunkt höchstens von einem Prozeß okkupiert
sein.
Zur Veranschaulichung der Problematik soll folgendes Beispiel dienen.
Mehrere Prozesse benutzen einen Ausgabe-Kanal. Die Anzahl der ausgegebenen Daten
soll in der Variable count gezählt werden.
Dazu enthält jeder Prozeß die Anweisung
count := count + 1
nach seiner Datenausgabe.
Vorlesung Rechnerarchitektur
Seite 142
Synchronization
Diese Anweisung wird in folgende Maschinenanweisungen übersetzt .
ld count , R1
add R1 , #1
store R1 , count
Diese Sequenz befindet sich in jedem der Prozesse, welche parallel ihre Instruktionen abarbeiten. Dadurch könnte folgende Sequenz von Instruktionen entstehen.
P1:
P2:
P2:
P2:
P1:
P1:
ld count , R1
ld count , R1
add R1 , #1
store R1 , count
add R1 , 1
store R1 , count
Damit ist count nur um 1 erhöht worden, obwohl die Instruktionssequenz zweimal durchlaufen wurde und count um 2 erhöht sein sollte.
Um dieses Problem zu vermeiden führt man den Begriff des kritischen Bereichs (critical region) ein.
Definition :
Eine "critical region" ist eine Reihenfolge von Anweisungen, die unmittelbar,
d.h. ohne Unterbrechung, von einem Prozeß ausgeführt wird. Der Eintritt in einen solchen kritischen Bereich wird nur einem Prozeß gestattet.
Andere Prozesse, die diesen Bereich ebenfalls benutzen wollen, müssen warten bis der belegende Prozeß den Bereich wieder verläßt und damit zur Benutzung freigibt.
Möglichkeiten zur Realisierung der critical region:
-
Software-Lösung (zu komplex !)
Interrupt-Abschaltung (single processor)
binary semaphores
einfache kritische Bereiche
Monitors
Vorlesung Rechnerarchitektur
Seite 143
Synchronization
Interrupt - Abschaltung
Ist in einem System nur ein Prozessor vorhanden, der die Prozesse ausführt, so kann man die
mutual exclusion durch das Abschalten der Interrupts erreichen. Sie sind die einzige Quelle,
durch die eine Instruktionssequenz unterbrochen werden kann.
Praktische Ausführung : der kritische Bereich wird als Interrupt-Handler geschrieben.
Eintritt durch Auslösen eines Traps. Interrupts sperren. Kritischen Bereich bearbeiten.
Interrupt freigeben. Rücksprung zum Trap.
Unteilbare Operationen (ATOMIC OPERATIONS)
Der Basismechanismus zur Synchronisation ist die unteilbare Operation (atomic operation)
Die unteilbare Operation zur Manipulation einer gemeinsamen Variablen bildet die Grundlage für die korrekte Implementierung der mutual exclusion
Goal: generating a logical sequence of operations of different
• threads
• processes
• processors
Implementation: Atomic operations
•
•
•
•
•
•
•
•
read-modify-write
test-and-set (machine instruction)
load-incr-store
load-decr-store
loadwith-store conditional
reservation
lock/unlock
fetch-and-add
Vorlesung Rechnerarchitektur
Seite 144
Synchronization
Atomic operations
Sequential consistency assumes that all globally observable architectural state changes appear in the same order when observed by all other processors.
The sequential consistency requires a mechanism for implementation.
synchronization instructions
read - modify - write
load (read)
modify
"Bus"
memory
semaphore
write
Atomic
Zugriff verhindern
Bus reserviert
gesamter Speicher blockiert
Cacheline snooping
Zugriff auf diese Adresse verh. (?)
Vorlesung Rechnerarchitektur
Seite 145
Synchronization
Atomic operations (Example 1)
Multiprocessor Systems using a bus as Processor-Memory interconnect network can use a
very simple mechanism to guarantee atomic (undividable) operations on semaphore variables residing in shared memory.
The bus as the only path to the memory can be blocked for the time of a read-modify-write
(RMW) sequence against intervening transactions from other processors. The CAS2 Instruction of the MC68020 is an example of such a simple mechanism to implement the test-andset sequence as an non-interruptable machine instruction. Sequence of operations of CAS2
instruction
1. read of semaphore value from Mem(Ra) into register Rtemp
2. compare value in Rtemp to register Rb
3. conditional store of new sempahore to Mem(Ra) from register Rc
Atomic operations (Example 2)
Complex instruction with memory reference and lock prefix (seit i486)
lock ; incl %0; sete %1
Atomically increments variable by 1 and returns true if the result is zero, or false for all other
cases.
Vorlesung Rechnerarchitektur
Seite 146
Synchronization
Atomic operations (Example 3)
lock
begin interlocked sequence
Causes the next data load or store that appears on the bus to assert the LOCK # signal,
directing the external system to lock that location by preventing locked reads, locked
writes, and unlocked writes to it from other processors. External interrupts are disabled
from the first instruction after the lock until the location is unlocked.
unlock
end interlocked sequence
The next load or store (regardless of whether it hits in the cache) deasserts the LOCK #
signal, directing the external system to unlock the location, interrupts are enabled when
the load or store executes.
These instructions allow programs running either user or supervisor mode to perform atomic
read-modify-write sequences in multiprocessor and multithread systems. The lock protocol
requires the following sequence of activities:
• lock
• Any load instruction that happens on the bus starts the atomic cycle. This load does not
have to miss the cache; it is forced to the bus.
• unlock
• Any store instruction terminates the atomic cycle.
There may be other instructions between any of these steps. The bus is locked after step 2
and remains locked until step 4. Step 4 must follow step 1 by 30 instructions or less: otherwise an instruction trap occurs.
The sequence must be restartable from the lock instruction in case a trap occurs. Simple
read-modify-write sequences are automatically restartable. For sequences with more than
one store, the software must ensure that no traps occur after the first non-reexecutable store.
Vorlesung Rechnerarchitektur
Seite 147
Synchronization
Atomic operations (Example 4)
Starting with the memory read operation the RMW_ signal from the CPU tells the external
arbitration logic not to rearbitrate the bus to any other CPU. Completing the RMW-instruction by the write to memory releases the RMW_ signal permitting general access to the bus
(and to the memory).
- test-and-set (two paired machine instructions)
These two instructions are paired together to form the atomic operation in the sense, that if
the second instruction (the store conditional) was successful, the sequence was atomic. The
processor bus is not blocked between the two instructions. The address of the semaphore is
supervised by the "reservation", set by the first instruction (load-and-reserve). [see also
PowerPC620 Storage Synchronization]
Vorlesung Rechnerarchitektur
Seite 148
Synchronization
binary Semaphores (Dijkstra 1965)
Definition :
Die binäre Semaphore ist eine Variable, die nur die zwei Werte ’0’
oder ’1’ annehmen kann. Es gibt nur die zwei Operationen P und V
auf dieser Variablen.
’0’ entspricht hier einer gesetzten Semaphore, die den Eintritt in den kritischen Bereich verbietet
1. die P-Operation - Proberen te verlangen (wait) [P]
der Wert der Variablen wird getestet
- ist er ’1’, so wird er auf ’0’ gesetzt und die kritische Region kann betreten werden.
- ist er ’0’, so wird der Prozess in eine Queue eingereiht, die alle Prozesse enthält, die auf dieses Ereignis (binäre Semaphore = ’1’) warten.
2. die V-Operation - verhogen (post) to increase a semaphore [V]
die Queue wird getestet
- ist ein Prozess in der Queue, wird er gestartet. Es wird bei mehreren
Prozessen genau einer ausgewählt (z.B. der erste)
- ist die Queue leer, wird die Variable auf ’1’ gesetzt
P acts as a gate to limit access to tasks.
einfache kritische Bereiche
Um die korrekte Benutzung von Semaphoren zu erleichtern, wurde von Hoare der einfache
kritische Bereich eingeführt:
with Resource do S
Die kritischen Statements S werden durch P und V Operationen geklammert und garantieren
damit die mutual exclusion auf den in Resource zusammengefaßten Daten (shared Data).
Damit wird die Kontrolle über die Eintritts- und Austrittsstellen der mutual exclususion vom
Compiler übernommen. Durch die Deklaration der shared variables in der Resource ist es
dem Compiler auch möglich, Zugriffsverletzungen zu erkennen und damit die gemeinsamen
Daten zu schützen.
Dadurch reduzieren sich die Fehler, die der Programmierer durch die alleinige Benutzung
der P und V Operationen in Programme einbauen könnte, erheblich.
Vorlesung Rechnerarchitektur
Seite 149
Synchronization
Monitore
Die Operationen auf shared data sind i.a. über das gesamte Programm verteilt. Diese Verteilung macht die Benutzung der gemeinsamen Datenstrukturen unübersichtlich und fehleranfällig. Faßt man die gemeinsamen Variablen und die auf ihnen möglichen Operationen in ein
Konstrukt zusammen und stellt die mutual exclusion bei der Ausführung des Konstrukts sicher, so erhält man einen
monitor
oder auch secretary genannt.
The basic concepts of monitors are developed by C.A.R. Hoare.
Definition :
Ein "monitor" ist eine Zusammenfassung von critical regions in einer Prozedur. Der Aufruf der Prozedur ist immer nur einem Prozess
möglich.
Somit ist der Eintritt in den Monitor mit der P-Operation äquivalent und der Austritt mit der V-Operation.
Beispiel:
monitor MONITORNAME;
/* declaration of local data */
procedure PROCNAME(parameter list);
begin
/* procedure body */
end
begin
/* init of local data */
end
Java provides a few simple structures for synchronizing the activities of threads. They are
all based on the concepts of monitors. A monitor is essentially a lock. The lock is attached
to a resource that many threads may need to access, but that should be accessed by only one
thread at a time.
The synchronized keyword marks places (variables) where a thread must acquire the
lock before proceeding. [Patrick Niemeyer, Joshua Peck, Exploring JAVA, O’Reilly, 1996]
class Spreadsheet {
int cellA1, cellA2, cellA3;
synchronized int sumRow() {
// synchronized method
return cellA1 + cellA2 + cellA3;
}
synchronized void setRow( int a1, int a2, int a3 ) {
cellA1 = a1;
cellA2 = a2;
cellA3 = a3;
}
...
}
In this example, synchronized methods are used to avoid race conditions on variables cellAx.
Vorlesung Rechnerarchitektur
Seite 150
Interconnection Networks (Bus)
Bus Systems
A special case of a dynamic (switched) interconnection network is a bus. It consists of a
bundle of signal lines used for transmission of information between different places in a
computer system. Signals are typically bundled concerning
their functionality: e.g.:
- Address Bus, Data Bus, Synchronization Bus, Interrupt Bus ...
or by the hardware unit connected to the bus:
- Processor Bus, Memory Bus, I/O Bus; Peripheral Bus ...
e.g. Processor
VCC
Transceiver (bidirectional Port)
pull-up
EN0
Master
EN1
EN2
Master/Slave
TS-Driver
e.g. Address Bus
Bus Signal Lines
n
Slave
Receiver
Key
means three state switched connection
means fixed input from bus connection
Memory
Memory
Three-state driver can be used for the dynamic switch. As the name three-state suggests, a
TS-Driver has 3 output states:
- drive high "1";
- drive low "0";
- no drive - high Z output.
If all drivers of a bus signal line are disabled (high Z), the signal line is floating (should be
avoided by a pull-up resistor). More about the technology can be learned in lecture: "Digitale
Schaltungstechnik"
At a time, only one master can be active on the bus. Only one three state (tri-state) driver is
allowed to drive the bus signal lines. Enabling more than one driver may damage the system
and must be avoided under all conditions. This is called ’bus contention’. Though an access
mechanism for a bus with multiple master is required, called ’arbitration’.
Vorlesung Rechnerarchitektur
Seite 151
Interconnection Networks (Bus)
Bus Arbitation
When a processor wants to access the bus, it sets its BREQ (bus request signal) and waits for
the arbiter to grant access to the bus, signaled by BG (bus grant).
The hardware unit (arbiter) samples all BREQx signals from all clients and then generates a
single token which is signaled to one bus client. The ownership of the token defines a client
to be bus master. This token permits the master to enable the TS-Driver and become the active master of the bus. At this time, all other units are slaves on the bus and can receive the
information driven by the master. Because all slaves can take in the actual bus data by their
receivers, a broadcast communication can be performed in every bus cycle (the most important advantage of a bus, beside its simplicity).
Processor 1
Processor 0
ADDR_out
BREQ1 ADDR_out
BREQ0
snoop_ADDR_in
snoop_ADDR_in
32 x
EN1
Master 0 EN0
Master 1
BG1
BG0
Address Bus Signal Lines
32
Slaves
Arbiter
Memory
Memory
The arbiter gets the bus request signals from all masters and decides which master is granted
access to the bus. Simultaneous requests will be served one after the other. A synchronous
(clocked) arbiter can be realized by a simple finite state machine (FSM).
default
Idle
BREQ0 & ~BREQ1
no Grant
BREQ1
~BREQ0 & ~BREQ1
default
default
Grant0
BG0
~BREQ0 & BREQ1
Grant1
BG1
BREQ0 & ~BREQ1
Metastable behavior of the arbiter FFs can (and should !!) be avoided by deriving the request
signals in a synchronous manner (using the same clock).
Vorlesung Rechnerarchitektur
Seite 152
Bus
Basic Functions
Bussysteme können für die verschiedensten Aufgaben in einem Rechnersystem bestimmt
sein:
-
Adressbus
Datenbus
I/O-Bus - Memory-Bus - Prozessor-Bus
Interrupt-Bus
Synchronisationsbus
Fehleranalyse/-behandlung
+ 5V
Ain
Bin
500 Ω
Beispiel: Synchronisationsbus
Aout
O.D.
O.D.
Bout
Die Ausführung von Bussystemen ist abhängig von ihrer Anwendung, da sie normalerweise
für ihre Anwendung optimiert werden.
Wichtige Kenndaten eines Bussystems sind:
- Wortbreite
- Datenübertragungsrate
- Übertragungsverfahren + Protokoll
- synchrone
- asynchrone
- Hierarchiestufen
- cachebus
- operand --- result --- Processorbus L2 (bus) interface
- Memory
- I/O Bus
- Peripherie USB
- LAN Kommunikationsbusse
Vorlesung Rechnerarchitektur
Seite 153
Bus
Protocol
- Framing
- Command, Address, burst, length
- Type
- Transaction-based
- Split-phase transactions
- Packet-based
- Flowcontrol
- asynchron: handshake
- synchron: valid/stop, wait/disconnect, credit-based
- Data integrity and reliability
- Detection, Correction, Hamming, parity
- Cyclic Redundancy Check (CRC), re-transmission
- Advanced Features
- Embedded clock (8b/10b)
- DC-free (8b/10b)
- Virtual channels
- Quality of service (QoS)
X
DAV_
DACK_
tpd
Handshake protocoll
Vorlesung Rechnerarchitektur
Seite 154
Bus
Bus zur
Verbindung von mehreren Platinen untereinander
Backplane-Verdrahtung, global (VME, Futurebus+, XD-BUS)
Verbindung von Bausteinen innerhalb einer Platine
Peripheriebus, Prozessorbus, lokal (PCI-Bus, S-Bus, M-Bus)
Verbindung von Systemen untereinander
Workstationvernetzung (SCI-Interface, Ethernet ...)
Geschwindigkeit eines Bussystems
wird durch mehrere Faktoren begrenzt:
•
•
•
•
•
Laufzeit der Signale auf den Leitungen
Eingangskapazität der Ports
Verzögerungszeit der Ports
Buszykluszeit Busclock
Overhead (Protokoll)
BTL
+pd
slot
B.P.L.
Bussysteme
• Backplane B.S.
- passiven Backplane
- aktiven Backplane
BTL
VME TTL
cmos 2 x 96 pins
2 x 52 I/O
1 Gbit/s
Chip Interconnects
Peripheral Chip Interconnect
2.1. PCI-Bus (33 MHz)
66 MHz
(132 MHz)?
(128 MHz)?
no termination
1 stot + 1 chip
I stubs
I/O
Vorlesung Rechnerarchitektur
Seite 155
Bus
Bussysteme können als Pipeline ausgeführt werden, um die Datentransferrate zu erhöhen
(i860XP, Power PC620, XD-Bus, ...)
Eine mögliche Aufteilung in die Phasen:
-
Arbitrierung
Adressierung
Datentransport
Statusrückmeldung
Da diese Phasen parallel auf verschiedenen Leitungsgruppen ausgeführt werden können,
kann die Datentransferrate um den Faktor 4 größer werden.
Vorlesung Rechnerarchitektur
Seite 156
PCI - Peripheral Component Interface
Die wichtigsten Eigenschaften des Peripheral Component Interconnect (PCI) Bus:
• 32 Bit Daten– und Adreßbus.
• Niedrige Kosten durch ASIC Implementierung.
• Transparente Erweiterung von 32 Bit Datenpfad (132 MB/s Spitzenwert) auf 64
Bit Datenpfad (264 MB/s Spitzenwert).
• Variable Burstlänge.
• Synchrone Busoperationen bis 33 MHz.
• Überlappende Arbitrierung durch einen zentralen Arbiter.
• Daten– und Adreßbus im Multiplexverfahren zur Reduzierung der Anschlußpins.
• Erlaubt Selbstkonfiguration der PCI Komponenten durch vordefinierte Konfigurationsregister.
• Plug and Play fähig.
• Prozessorunabhängig. Unterstützt zukünftige Prozessorfamilien (durch Hostbridge oder direkte Implementierung).
• Unterstützt 64 Bit Adressierung.
• Spezifikation für 5V und 3.3V.
• Multimasterfähig. Erlaubt peer–to–peer Zugriffe von jedem beliebigen PCI
Master zu jedem PCI Master/Slave.
• Hierarchischer Aufbau von mehreren PCI Bus Ebenen.
• Parity für Adressen und Daten.
• PCI Komponenten sind kompatibel zu vorhandener Treiber– und Applikationssoftware
Nach der im April 1993 vorgestellten Revision 2.0 der PCI Spezifikation ist bereits eine Erweiterung auf den Revisionsstand 2.1 in Arbeit, in der als wesentlichste Änderung die Erhöhung der maximalen Taktfrequenz von 33 MHz auf 66 MHz vorgesehen ist, was noch
einmal eine weitere Verdopplung der maximalen Transferrate auf 528 MB/s bei 64 Bit Datenpfad bedeutet.
Vorlesung Rechnerarchitektur
Seite 157
PCI - Peripheral Component Interface
Notwendige Pins
Adressen
und Daten
Interface–
Kontrolle
A/D[31:0]
A/D[63:32]
C/BE[3:0]#
C/BE[7:4]#
PAR
PAR64
REQ64#
ACK64#
FRAME#
TRDY#
IRDY#
STOP#
DEVSEL#
IDSEL
Fehler–
meldungen
PERR#
SERR#
Arbitrierung
(nur Master)
REQ#
GNT#
LOCK
PCI
DEVICE
CPU
INTA#
INTB#
INTC#
INTD#
SBO#
SDONE
TDI
TDO
TCK
TMS
TRST#
CLK
RST#
System
CPU
Optionale Pins
CPU
64 Bit
Erweiterung
Interface–
Kontrolle
Interrupts
Cache–
Unterstützung
JTAG
(IEEE 1149.1)
CPU
HOST–BUS
Memory
PCI Arbiter
Hostbridge
LAN
PCI BUS
Graphic
SCSI
I/O
Subsystem
Vorlesung Rechnerarchitektur
Seite 158
PCI - Peripheral Component Interface
Adressierung
Der physikalische Adreßraum des PCI Busses besteht aus drei Adreßbereichen:
- Memory Address Space
- I/O Address Space
- Configuration Address Space
A/D[31:0] = 00000000h–FFFFFFFFh
C/BE[3:0]# =
0110
0111
1100
1110
1111
Memory
Address Space 4GB
A/D[31:0] = 00000000h–FFFFFFFFh
C/BE[3:0]# =
0010
0011
I/O
Address Space 4GB
A/D[31:0] = 00000000h–FFFFFFFFh
C/BE[3:0]# =
1010
1011
Configuration 4GB
Address Space
Vorlesung Rechnerarchitektur
Seite 159
PCI - Peripheral Component Interface
Bus Commands
Die Bus Commands zeigen dem Target die Art des Zugriffes an, die der Master anfordert
und bestimmen den Adreßraum, in den die Adresse fällt. Sie werden während der Adreßphase auf den C/BE[3:0]# Leitungen codiert und gelten für die gesamte nachfolgende Transaktion. Die Codes der einzelnen Bus Commands stehen in Tabelle 1.
Definition der Bus Commands
C/BE[3:0]#
Command Type
0000
Interrupt Acknowledge
0001
Special Cycle
0010
I/O Read
0011
I/O Write
0100
Reserved
0101
Reserved
0110
Memory Read
0111
Memory Write
1000
Reserved
1001
Reserved
1010
Configuration Read
1011
Configuration Write
1100
Memory Write Multiple
1101
Dual Address Cycle
1110
Memory Read Line
1111
Memory Write and Invalidate
Vorlesung Rechnerarchitektur
Seite 160
PCI - Peripheral Component Interface
Buszyklen
1
2
3
4
5
6
7
8
9
CLK
FRAME#
wait
IRDY#
A/D
C/BE#
wait
TRDY#
DEVSEL#
data1
data2 data3
Adreß
phase
data4
Datenphasen
Buszyklus
1
2
3
4
5
6
7
8
CLK
FRAME#
wait
IRDY#
A/D
wait
TRDY#
wait
C/BE#
DEVSEL#
data1
data2 data3
9
Vorlesung Rechnerarchitektur
Seite 161
PCI - Peripheral Component Interface
Anders als bei Zugriffen im Memory– oder I/O Adreßraum, die durch die Adresse auf
A/D[31:0] und den Bus Commands eindeutig adressiert werden, erfolgt die Adressierung
des Devices beim Konfigurationszyklus durch ein weiteres Signal: IDSEL, das die Funktion
eines Chip Select hat. Jedes Device hat sein eigenes IDSEL, das nur während der Adreßphase, wenn die Bus Commands C/BE[3:0]# ein Configuration Read oder Write signalisieren,
abgetastet wird. Bei allen anderen Zugriffen ist der Pegel von IDSEL bedeutungslos. Adressiert werden die Konfigurationsregister Doppelwortweise. Ein Doppelwort entspricht 32
Bit. (Die Adreßbits A/D[1:0] werden dadurch nicht zur Adreßdecodierung benötigt.) Die
Auswahl der Bytes innerhalb des adressierten Doppelwortes erfolgt mit Hilfe der Byte Enables C/BE[3:0]#
1
2
3
4
5
CLK
FRAME#
IRDY#
A/D
C/BE#
101x
TRDY#
DEVSEL#
IDSEL
Der PCI Bus unterscheidet zwei Typen von Konfigurationszyklen, die durch die Bitkombinationen in A/D[1:0] gekennzeichnet werden.
• Konfigurationszyklen vom Typ 0 (A/D[1:0] = ‘00‘) sind alle Zyklen, mit denen
die Bridge diejenigen Devices ansprechen will, die sich auf dem Bus befinden,
der der Bridge zugeordnet ist. Eine Bridge ist ein Device, das die Verbindung
zwischen verschiedenen Busebenen (bzw. Bussystemen) herstellt.
• Typ 1 (A/D[1:0] = ‘01‘) gilt für Konfigurationszyklen die Devices betreffen,
die in untergeordneten PCI Bushierarchien liegen.
Vorlesung Rechnerarchitektur
Seite 162
PCI - Peripheral Component Interface
Die Informationen, die in der Doppelwortadresse A/D[32:2] enthalten sind, sind abhängig
vom Typ des Konfigurationszyklusses.
31
11 10
87
Function
Number
Reserved
2 1 0
Register
Number
0 0
Typ 0
31
24 23
Reserved
16 15
Bus
Number
11 10
Device
Number
87
Function
Number
2 1 0
Register
Number
0 1
Typ 1
Adreßformate von Konfigurationszyklen
Die Bitkombinationen in A/D[31:11] sind beim Typ 0, und die Bitkombinationen in
A/D[31:24] beim Typ 1 ohne Bedeutung.
Bus Number
Bestimmt die Busnummer des Busses, auf dem sich das zu konfigurierende Device befindet.
PCI erlaubt durch eine hierarchisch gestaffelte Anordnung von verschiedenen PCI Bussen,
bis zu 256 Busebenen herzustellen.
Device Number
Bestimmt eines der 32 möglichen Zieldevices auf jeder Busebene, für das der Konfigurationszyklus bestimmt ist.
Function Number
Adressiert eine der maximal 8 verschiedenen Funktionen eines Multifunktionsdevice.
Register Number
Doppelwortadresse des 64 Doppelworte umfassenden Konfigurationsregisters.
Vorlesung Rechnerarchitektur
Seite 163
PCI - Peripheral Component Interface
Host/System–Bus
Device Number
CPU
Host–to–PCI
Bridge
Memory
Function Number
Function Number
Register
Number
Register
Number
Bus Number
PCI Bus (Number x)
Device Number
PCI–to–PCI
Bridge
Function Number
Bus Number
Register
Number
PCI Bus (Number y)
Device Number
Device Number
Function Number
Function Number
Function Number
Function Number
Register
Number
Register
Number
Register
Number
Register
Number
Hierarchische PCI Busstruktur.
Vorlesung Rechnerarchitektur
Seite 164
PCI - Peripheral Component Interface
Device Number
PCI Bridge
Decoder
PCI Bus
IDSEL
IDSEL
PCI Slot 1
IDSEL
PCI Slot 2
PCI Slot 3
Getrennte IDSEL Leitungen
PCI Bus
A/D[x]
A/D[y]
IDSEL
A/D[z]
IDSEL
PCI Slot 1
IDSEL
PCI Slot 2
PCI Slot 3
IDSEL aus A/D[13:11]
31
24 23
Reserved
16 15
Bus
Number
31
11 10
Device
Number
11 10
1 aus 21
87
Function
Number
2 1 0
Register
Number
87
Function
Number
2 1 0
Register
Number
Abbildung von IDSEL auf die oberen Adreßbits.
0 0
Vorlesung Rechnerarchitektur
Seite 165
PCI - Peripheral Component Interface
Konfiguration
00h
31
16 15
PCI Specification
(Revision 2.0)
defined
Configuration Space
Header
0
Device ID
Vendor ID
00h
Status
Command
04h
Class Code
BIST
Header Type
Latency Timer
Revision ID
08h
Cache Line
Size
0Ch
10h
14h
18h
Base Address Registers
1Ch
Vendor
defined
Configuration
Registers
FFh
20h
24h
Max_Lat
Reserved
28h
Reserved
2Ch
Expansion ROM Base Address
30h
Reserved
34h
Reserved
38h
Min_Gnt
Interrupt Pin
Configuration Space Header
Interrupt Line
3Ch
Vorlesung Rechnerarchitektur
Seite 166
PCI - Peripheral Component Interface
Base Class
Bedeutung
00h
Für Devices, die vor der Fertigstellung der
Base Class Codes Definition gebaut wurden.
01h
Massen Speicher Controller
02h
Netzwerk Controller
03h
Display Controller
04h
Multimedia Controller
05h
Memory Controller
06h
Bridge Device
07h–FEh
FFh
Reserved
Für Devices, die in keine der oben genannten Basis Klassen eingeordnet werden können.
Base Classes
Base Class
06h
Sub Class
Prog.If.
Bedeutung
00h
00h
Host bridge
01h
00h
ISA bridge
02h
00h
EISA bridge
03h
00h
MC bridge
04h
00h
PCI–to–PCI bridge
05h
00h
PCMCIA bridge
80h
00h
Andere Bridge Devices
Base Class 06h und deren Sub Classes
Vorlesung Rechnerarchitektur
Seite 167
PCI - Peripheral Component Interface
31/63
3 2 1 0
0
Prefetchable
Type
Memory space indicator
31
1 0
1
Reserved
I/O space indicator
Layout der Base Address Register
Größe des
Adreßraumes
(hier 1MB)
Base Address
32
11111111111100000000000000000000
32
Set on write
FFFFFFFFh
Q J
Q K
Reset on read
32
CLK
Registermodell der Base Address Register
Interrupt Pin
read : In diesem Register steht, welchen Interrupt Pin das Device benutzt. Der Dezimalwert 1 bedeutet INTA#, 2 INTB#, 3 INTC# und 4 INTD#. Mit dem Dezimalwert 0 zeigt
das Device an, daß es keine Interrupts benutzt.
Interrupt Line
r/w : Der Wert dieses Registers gibt an, mit welchem System Interrupt Pin das Device
Interrupt Pin verbunden ist. Die Konfigurationssoftware kann mit Hilfe dieses Wertes
zum Beispiel Prioritäten festlegen. Die Werte dieses Registers sind systemabhängig.
Vorlesung Rechnerarchitektur
Seite 168
PCI - Peripheral Component Interface
Arbitrierung
1
2
3
4
address
data
5
6
7
address
data
CLK
REQ# 1
REQ# 2
GNT# 1
GNT# 2
FRAME#
A/D
Zugriff
Master 1
Zugriff
Master 2
Arbitrieung
PCI Bus
IRDY#
FRAME#
Master
0
REQ0_
GTE_
PCI Arbiter
gtimer
GNT0_
REQ1_
GT0_
GNT1_
REQ2_
GNT2_
PCI Arbiter
Master
1
Master
2
Vorlesung Rechnerarchitektur
Seite 169
Modern Peripheral Interfaces
What are the available Interfaces for peripheral devices ?
standard
-
proprietary
• PCI-X, PCIe, HT, (cHT)
• System bus, Integrated Solution,
Features
PCI-X
PCI-Express PCIe
Hypertransport HT
64 + 39 = 101
4,8,16,32,64
26,36,57,105,199
yes Address/Data
yes
yes, message orient.
data width
32/64 bit
2,4,8,16,32 bit
usage
I/O-Bus
Peripheral-Extension
I/O-Bus
Peripheral-Extension
Link width
2,4,8,16,32 bit
I/O-Bus**
Peripheral-Extension
operation mode
fully synchronous, clocked
source synchronous
8B/10B coded data
source synchronous
1 x clock pro Byte
number of
signal lines
multiplexed
operation
data transmission
0 - 33/64 MHz
100/133 (266*) MHz
CMOS-Level
reflective wave signalling
CMOS-Level
clock synchronous
CML-Level
serial, differential
coded
embedded clock
termination
no
100-110 Ohm
burst transfers
yes, many modes
4x burst, arbirary length
yes,
message transfers
200 - 800 MHz
(1-1.6GHz)
U-Level 600mV
NRZ, serial, differential
DDR double data rate
packetized
100 Ohm
on chip, overdamped
yes, comand +
message transfers
Split Transactions
yes
yes
yes
2x2,5Gbit/s@2bit
10GB@32bit
0,2GB@200MHz-2bit
12,8GB@1600MHz-32bit
max.no of devices
533MB@66MHz-64bit
1GB@133MHz-64bit
Bridge + 4,2,1 Devices
1 I/O-Device @133MHz
point to point
point to point, bidir.
max length of
signal lines
aprox. 10cm
aprox. 3-10cm at FR4***
aprox. 3-10cm at FR4
Standard
Industrie (Intel)
+ IEEE
Industrie (Intel)
+ Konsortium
Industrie (AMD)
+ Konsortium
Spec. page no.
aprox. 220
aprox. 420
Web Infos
www.pcisig.org
www.intel.com
clock frequency
signal transmission
max. Bandwidth
*) DDR double data rate transfer
2,5 GHz
**) extended version for CPU Interconnect
with Cache Coherency-Protocoll
aprox. 330
www.hypertransport
.org
***) PCB material
Vorlesung Rechnerarchitektur
Seite 170
PCI-X
Peripheral Component Interconnect
Features:
Available in many node computer. Servers use switched architectures. Synchronous interface controlled by a single ended clock.
In the 133MHz mode, there is only one IO-device allowed on the "bus" (bridge-to-device).
In the future, it will be replaced by PCIe because of reduced pin count and higher bandwidth.
The PCI-bus cycle shows the overhead associated with a burst transfer without target wait
states. (2 clk arbitration + 2 clk address/attribute + 2 clk target response and turn around +
n* data phase of 8B, at n=4 => 6 to 4 at a data size of 32B.
133MHz
7,5ns
n1/2 is reached at 6 data transfer cycles with 8B each. n1/2 = 48B.
The peak bandwidth is 1GB/s. ’Real’ bandwidth is around 900 MB/s for long bursts.
Vorlesung Rechnerarchitektur
Seite 171
PCI-Express (PCIe)
SERDES based peripheral interconnects
Performance:
• Low-overhead, low-latency communications to maximize application payload bandwidth
and link efficiency
• High-bandwidth per pin to minimize pin count per device and connector interface
• Scalable performance via aggregated lanes and signaling frequency
x1, x2, x4, x8 and x16, Gen1 2.5Gb/s, Gen2 5Gb/s, Gen3 8Gb/s (intro in 2010)
The fundamental PCI Express Link consists of two, low-voltage, differentially driven signal
pairs: a transmit pair and a receive pair. Combining many lanes (single diff. pair.) together
provides a high bandwidth, e.g. x16 is a bidir. link with 2.5 Gb/s, delivering a raw data rate
of 2x 40Gb/s = 10GB/s
PCIe shows a lower latency than PCI-X because typically it comes directly from the ’root
complex’ (north bridge). Serializer latency is very implementation dependent.
[PCIeSpec]
x16
A Switch is defined as a logical assembly of multiple virtual PCI-to-PCI bridge devices.
Advanded Switching Interconnect (ASI) is based on the physical layer of PCIe and is aimed
at the interconnect ’in the rack/cabinet’. ASI did not get any real market share. :-(
Vorlesung Rechnerarchitektur
Seite 172
Hypertransport (HT)
(1)
AMD’s IO-HT is an open standard, cHT only for large customer. HyperTransport is intended to support ’in-the-box’ connectivity (motherboard).
The architecture of the HyperTransport I/O link can be mapped into five different layers
[HT-WP]:
The physical layer defines the physical and electrical characteristics of the protocol. This
layer interfaces to the physical world and includes data, control, and clock lines.
The data link layer includes the initialization and configuration sequence, periodic cyclic
redundancy check (CRC), disconnect/reconnect sequence, information packets for flow control and error management, and doubleword (32bits) framing for other packets (data packet
sizes from 4 - 64 Bytes).
The protocol layer includes the commands, the virtual channels in which they run, and the
ordering rules that govern their flow.
The transaction layer uses the elements provided by the protocol layer to perform actions,
such as reads and writes.
The session layer includes rules for negotiating power management state changes, as well
as interrupt and system management activities.
Processor
AGP
Processor
NorthMemC Bridge
HT-IO
Memory
NIC
System
Area
Network
HT-IO
Bridge
Tunnel LAN
downstream
upstream
DISK
Cave super IO
HT supports several methodes of data transfers between devices: PIO, DMA, peer-to-peer
(this involves the bridge device for connecting the upstream with the downstream link). Interrupts are signalled by sending interrupt messages [HT-MS].
Vorlesung Rechnerarchitektur
Seite 173
Hypertransport
(2)
A HT bus uses coupon-based flow control to avoid receiver overrun. Coupons (credits) flow
back with NOP control packets (idle link signaling) of 4 Bytes.
Control and data packets are distinguished by the CTL signal line. Control packets are separated into request, response and information packets. Read requests carry a 40 bit address
and are executed as split transactions. A sized read request is a 8B packet which is responded
with a read response packet of 4B and a data packet of 4-64B (Overhead of 12 to 64B, best
case).
Packet
Control packet
Information Request
4B
4,8B
Data packet
4-64B
Response
4B
The physical layer uses a modified LVDS differential signaling.
LVDS signal transmission and termination
Control packet
Data packet
bidirectional HT bus
Control packets may be inserted into data packets at any 4B boundary. Only one data packet
is active at a time. The CTL signal distinguishes between control and data packets.
Vorlesung Rechnerarchitektur 1
Seite 174
I/O-devices
I/O-devices: Basics
An I/O-device is a resource of a computer system, which implements the function of a specific I/O-interface.
An I/O-interface is used to establish communication between a computer system and the outside world (may be another computer system or a peripheral device like a printer).
CPU
Cache
System
Bus
Addr
Data
I/O
bridge
I/O
device
Addr
Device Command Register
CR
Device Status Register
SR
Data
I/O Bus
typically no cache coherency
Device Data Register (read)
DRR
I/O Interface
generic I/O device block diagram
Device Data Register (write) DRW
The minimal set of registers are:
•
•
•
•
control register CR,
status register SR,
data register read DWR,
data register write DRW.
The CR is used to bring the device into a specific operating mode. The content of the register
is very device specific. An enable bit for the activation of the output register and the input
register is normally included.
The status register signals the internal state of the device, e.g. if the transmitter register is
empty (TX_EMPTY) or the receiver register is full (RX_FULL).
Vorlesung Rechnerarchitektur 1
Seite 175
I/O-devices
The data registers are used to transfer data into and out of the device typically using programmed I/O (PIO).
In order to access the device it must be placed in the address space of the system.
• memory mapped device
• I/O space for access
Using special I/O-instructions for access to the device directly selects a predefined address
space, not accessible by other instruction types. => restrictons in use.
As the name suggests, a memory mapped device is placed in the memory address space of
the processor. All instructions can be used to access the device (typically load/store). Special
care must be taken if the processor can reorder load/store instructions. Caching of this
address space should be turned off.
Address Space
0xFFFF_FFFF
I/O-Space
I/O Space
0xFFFF_FFFF
Device A
Device Registers
0xF000_0000
free
Space
Device B
Memory
Space
0x0000_0000
0xF000_0000
address space partitioning for memory mapped devices (example!)
Vorlesung Rechnerarchitektur 2
Seite 176
Direct Memory Access (DMA)
DMA Basics
Definition :
A direct memory access (DMA) is an operation in which data is copied
(transported) from one resource to another resource in a computer system without the involvement of the CPU.
The task of a DMA-controller (DMAC) is to execute the copy operation of data from one
resource location to another. The copy of data can be performed from:
- I/O-device to memory
- memory to I/O-device
- memory to memory
- I/O-device to I/O-device
A DMAC is an independent (from CPU) resource of a computer system added for the concurrent execution of DMA-operations. The first two operation modes are ’read from’ and
’write to’ transfers of an I/O-device to the main memory, which are the common operation
of a DMA-controller. The other two operations are slightly more difficult to implement and
most DMA-controllers do not implement device to device transfers.
DMA
controller
CPU
Arbiter
Addr
Data
Memory
I/O
device
ACK
REQ
simplified logical structure of a system with DMA
The DMAC replaces the CPU for the transfer task of data from the I/O-device to the main
memory (or vice versa) which otherwise would have been executed by the CPU using the
programmed input output (PIO) mode. PIO is realized by a small instruction sequence executed by the processor to copy data. The ’memcpy’ function supplied by the system is such
a PIO operation.
The DMAC is a master/slave resource on the system bus, because it must supply the addresses for the resources being involved in a DMA transfer. It requests the bus whenever a data
value is available for transport, which is signaled from the device by the REQ signal.
The functional unit DMAC may be integrated into other functional units in a computer system, e.g. the memory controller, the south bridge, or directly into an I/O-device.
Vorlesung Rechnerarchitektur 2
Seite 177
Direct Memory Access (DMA)
DMA Operations
A lot of different operating modes exist for DMACs. The simplest one ist the single block
transfer copying a block of data from a device to memory. For the more complex operations
please refer to the literature [Mot81]. Here, only a short list of operating modes is given:
-
single block transfer
chained block transfers
linked block transfers
fly-by transfers
All these operations normally access the block of data in a linear sequence. Nevertheless,
there are more usefull access functions possible, as there are:
constant stride, constant stride with offset, incremental stride, ...
DMAC
CPU
Memory
DMA Command Register
Device Base Register
2a.
1.
2b.
3.
Block Length Register
Block
Length
Mem Base Register
5.
Temporary Data Register
Mem Base Addr
1a.
Memory
Block
6.
Descriptor
1b.
4.
I/O
device
Command
Area
Device Data Register
Execution of a DMA-operation (single block transfer)
The CPU prepares the DMA-operation by the construction of a descriptor (1), containing all
necessary information for the DMAC to independently perform the DMA-operation (offload engine for data transfer). It initalizes the operation by writing a command to a register
in the DMAC (2a) or to a special assigned memory area (command area), where the DMAC
can poll for the command and/or the descriptor (2b). Then the DMAC addresses the device
data register (3) and read the data into a temporary data register (4). In another bus transfer
cycle, it addresses the memory block (5) and writes the data from the temporary data register
to the memory block (6).
Vorlesung Rechnerarchitektur 2
Seite 178
Direct Memory Access (DMA)
DMA Operations
The DMAC increments the memory block address and continue with this loop until the
block length is reached. The completion of the DMAoperation is signaled to the processor
by sending an IRQ signal or by setting a memory semaphore variable, which can be tested
by the CPU.
multiple channels
physical addressing, address translation
snooping for cache coherency
DMA control signals (REQ, ACK) are used to signal the availability of values in the I/Odevice for transportation.
DMAC is using bus bandwidth which may slow down processor execution by bus conflicts
(solution for high performance systems: use xbar as interconnect!)
Memory
DMA Descriptor
Source Mem Base Pointer
Memory
Block
Destination Mem Base Pointer
Command
Block Length
padded
Memory
Block
[Flik] Mikroprozessortechnik, CISC, RISC Systemaufbau Assembler und C, Flik,Thomas, Springer Verlag, 6.Aufl. 2001.
Vorlesung Rechnerarchitektur 2
Seite 179
Completion signaling of an operation
Completion Signaling
For all communication function it is important to know, when an operation is completed.
Signalling this event to the ’process being interested’ in this information is very difficult.
The most common way is to throw an interrupt, which stops normal processing of a CPU
and activates the interrupt handler. Beside the fact that interrupt processing has speed up significantly in the last years, it need to save the CPU state of the actual running process. This
produces an overhaed and much worse is that in the newest processors the register file is
larger than ever.
Design decisions:
• IRQ | polling at device register | replication/mirroring in main memory | notification queue | thread scheduling
• Application Processor-Communication Processor model, active messages,
NICs like Infiniband use the concept of a notification queue. For every communication instruction a corresponding entry in the completion notification queue is written, when the operation has finished. This can be tested by the user process owning the queue.
Vorlesung Rechnerarchitektur 2
Seite 180
Direct Memory Access (DMA)
left intentional blank