Multicore processors

Transcription

Multicore processors
Multicore processors
Dezső, Sima
Created by XMLmind XSL-FO Converter.
Multicore processors
Dezső, Sima
Szerzői jog © 2013 Typotex Kiadó
Kivonat
Az informatika egyik jellemző trendje a processzorok illetve processzor architektúrák nagyiramú fejlődése.
2005-2006-tól kezdődően teljes mértékben tért nyertek a többmagos processzorok, melyek napjaink notebookjainak, asztali gépeinek, szervereinek jellemző processzor típusává váltak. E processzor kategóriában vezető
szerepet játszik az Intel cég mintegy 80%-os világpiaci részesedéssel, míg e terület második helyen álló
szereplője az AMD. A fentiekre tekintettel a kidolgozott tananyag az Intel és AMD alaparchitektúráira fókuszál,
bemutatva az egymást követő processzorcsaládok utasításkészletének, mikroarchitektúrájának,
disszipációkezelési technikáinak illetve rendszerarchitektúrájának a fejlődését, valamint a megjelent
processzorok kínálatát. A tananyag alapvető szempontja a többmagos processzorok területén bekövetkezett
fejlődés kidomborítása, megismertetése a hallgatókkal.
Gacsal József, Intel Hungary, Üzletfejlesztési igazgató
Created by XMLmind XSL-FO Converter.
Tartalom
Part I. Introduction .............................................................................................................................. 1
1. Introduction ........................................................................................................................... 3
1. Foreword ...................................................................................................................... 3
2. The mobile boom and its consequences to computer architectures ............................. 3
3. Consequences of the low power requirement of mobile devices for Intel and AMD .. 5
4. Foreseeable market situation ....................................................................................... 5
5. Intel’s response to the mobile challenge ...................................................................... 5
6. Evolution of Intel’s basic architectures [2] .................................................................. 5
7. AMD’s response to the mobile challenge .................................................................... 6
8. Evolution of AMD’s basic architectures ..................................................................... 6
9. Overview of Intel’s and AMD’s actual processor lines ............................................... 6
10. Scope of these slides .................................................................................................. 7
11. Reasons for this decision .......................................................................................... 7
2. References ............................................................................................................................. 8
Part II. Intel’s Core 2-based processor lines ....................................................................................... 9
1. Introduction ......................................................................................................................... 15
1. The evolution of Intel’s basic microarchitectures ...................................................... 15
2. Intel’s Tick-Tock model ............................................................................................ 15
3. Basic architectures and their related shrinks .............................................................. 16
2. The Core 2 line .................................................................................................................... 19
1. 2.1 Introduction ......................................................................................................... 19
2. 2.2 Major innovations of the Core 2 line ................................................................... 20
2.1. 2.2.1 Wide execution ..................................................................................... 20
2.1.1. 4-wide core ....................................................................................... 20
2.1.2. Enhanced execution resources .......................................................... 22
2.1.3. Performance leadership changes between Intel and AMD ............... 23
2.1.4. Example 1: DP web-server performance comparison (2003) ........... 24
2.1.5. Example 2: Summary assessment of extensive benchmark tests contrasting
dual Opterons vs dual Xeons (2003) [7] ..................................................... 24
2.1.6. Example: DP web-server performance comparison (2006) .............. 25
2.2. 2.2.2 Smart L2 cache ..................................................................................... 25
2.2.1. Shared L2 cache ................................................................................ 25
2.2.2. Benefits of shared caches .................................................................. 26
2.2.3. Drawbacks of shared caches ............................................................. 26
2.3. 2.2.3 Smart memory accesses ....................................................................... 26
2.3.1. Hardware prefetchers [9] .................................................................. 26
2.3.2. Intensive use of hardware prefetchers [11] ....................................... 27
2.3.3. Hardware prefetchers within the Core 2 microarchitecture .............. 27
2.4. 2.2.4 Enhanced digital media support ........................................................... 27
2.4.1. Widening the width of FP/SSE Execution units from 64-bit to 128-bit 27
2.4.2. Overview of the x86 ISA extensions in Intel’ processor lines .......... 28
2.4.3. Achieved performance boost in Core 2 for gaming apps .................. 30
2.5. 2.2.5 Intelligent Power management ............................................................. 31
2.5.1. Ultra fine grained power control ....................................................... 31
2.5.2. Platform Thermal Control ................................................................ 31
2.5.3. Possible solution for the Platform Thermal Control Manager [88] .. 32
3. 2.3 Overview of Core 2 based processor lines ........................................................... 32
3. The Penryn line ................................................................................................................... 34
1. 3.1 Introduction ......................................................................................................... 34
1.1. Penryn ........................................................................................................... 34
2. 3.2 Key enhancements of Penryn line ....................................................................... 35
2.1. 3.2.2 More advanced power management ..................................................... 36
2.1.1. Deep Power Down technology (DPD) .............................................. 36
2.1.2. Enhanced Dynamic Acceleration Technology (EDAT) (for mobiles) 38
2.1.3. Overall performance achievements with Penryn (1) ........................ 39
3. 3.3 Overview of Penryn based processor lines .......................................................... 40
iii
Created by XMLmind XSL-FO Converter.
Multicore processors
4. The Nehalem line ................................................................................................................ 42
1. 4.1 Introduction to the 1. generation Nehalem line (Bloomfield) ............................. 42
1.1. Die shot of the 1. generation Nehalem desktop processor (Bloomfield) [45] 43
2. 4.2 Major innovations of the 1. generation Nehalem line [54] .................................. 44
2.1. 4.2.1 Simultaneous Multithreading (SMT) ................................................... 45
2.1.1. Performance gains of SMT ............................................................... 46
2.2. 4.2.2 New cache architecture ........................................................................ 47
2.2.1. Distinguished features of Nehalem’s cache architecture .................. 47
2.3. 4.2.3 Integrated memory controller ............................................................... 48
2.3.1. Main features .................................................................................... 49
2.3.2. Benefit of integrated memory controllers ......................................... 49
2.3.3. Drawback of integrated memory controllers .................................... 49
2.3.4. Non Uniform Memory Access (NUMA) ......................................... 49
2.3.5. Memory latency comparison: Nehalem vs Penryn ........................... 50
2.4. 4.2.4 QuickPath Interconnect bus (QPI) ...................................................... 51
2.4.1. Signals of the QuickPath Interconnect bus (QPI bus) ...................... 52
2.4.2. QuickPath Interconnect bus (QPI) .................................................... 52
2.4.3. QPI based DP and MP server system architectures ......................... 52
2.4.4. Comparison of the transfer rates of the QPI, FSB and HT buses .... 53
2.4.5. The notion of “Uncore” .................................................................... 53
2.5. 4.2.5 Enhanced power management .............................................................. 54
2.5.1. Nehalem’s Turbo Mode .................................................................... 55
2.5.2. ACPI states [26] ............................................................................... 55
2.6. 4.2.6 New socket ........................................................................................... 58
3. 4.3 Major innovations of the 2. generation Nehalem line (Lynnfield) (1) [46] ......... 58
3.1. Major innovations of the 2. generation Nehalem line (Lynnfield) (2) [46] ... 58
3.2. Evolution of providing PCIe lanes for graphics ............................................ 59
3.3. Evolution of the topology and type of available PCIe lanes for graphics cards 59
3.4. Major innovations of the 2. generation Nehalem line (Lynnfield) (3) [46] ... 60
3.5. Die photos of the 1. and 2. gen. Nehalem desktop chips ............................... 60
5. The Nehalem-EX line ......................................................................................................... 62
1. 5.1 Introduction ......................................................................................................... 62
1.1. Overview of the Nehalem-EX based processor lines (based on [44]) ........... 62
2. 5.2 Major innovations of the Nehalem-EX processors .............................................. 62
2.1. 5.2.1 Overview ............................................................................................. 62
2.2. 5.2.2 Native 8 cores with 24 MB L3 cache (LLC) [55] ................................ 62
2.2.1. Die micrograph of the 8 core Nehalem-EX (Xeon 7500/Beckton) MP server
[71], [72] .................................................................................................... 63
2.3. 5.2.3 On-die ring interconnect bus [56] ........................................................ 63
2.4. 5.2.4 Serial memory channels [55] ................................................................ 64
2.5. 5.2.5 Scalable platform configurations [55] .................................................. 65
3. 5.3 Performance features of the 8-core Nehalem-EX based Xeon 7500 vs the Penryn based
6-core Xeon 7400 [67] ................................................................................................... 66
6. The Westmere line .............................................................................................................. 67
1. 6.1 Introduction ......................................................................................................... 67
1.1. Westmere 2-core and 6-core die plots [57] ................................................... 67
2. 6.2 Key enhancements of the Westmere lines vs. the Nehalem lines [44] ................ 68
2.1. Overview of the Westmere lines ................................................................... 68
3. 6.3 Dual-core Westmere-based mobile/desktop lines ............................................... 68
3.1. 6.3.1 Overview .............................................................................................. 68
3.2. 6.3.2 Innovations and enhancements of the dual-core mobile/desktop lines . 69
3.2.1. 6.3.2.1 Overview .............................................................................. 69
3.2.2. 6.3.2.2 In-package integrated CPU/GPU for the 2 core mobile and desktop
segments .................................................................................................... 69
3.2.3. 6.3.2.3 Enhanced Turbo Boost technology in the mobile Arrandale line [57]
..................................................................................................................... 73
4. 6.4 The six core Westmere-based desktop line .......................................................... 73
4.1. Platform and main features of the six core Westmere-based desktop line [93] 74
5. 6.5 Six core Westmere-EP server lines ...................................................................... 74
5.1. Native 6 cores with 12 MB L3 cache (LLC) for UP/DP servers [58] ........... 74
iv
Created by XMLmind XSL-FO Converter.
Multicore processors
5.2. Overview of the models of the Westmere-EP based Xeon 5600 family [94] 74
5.3. Example Westmere-EP DP server platform [57] .......................................... 75
7. The Westmere-EX line ....................................................................................................... 76
1. 7.1 Introduction ......................................................................................................... 76
2. 7.2 Key enhancement of the Westmere-EX line vs. the Nehalem-EX server line [95] 76
3. 7.3 Selected details of the Westmere-EX processors ................................................. 77
3.1. 7.3.1 Native 10 cores with 30 MB L3 cache (LLC) [60] .............................. 77
3.2. 7.3.2 Basic building blocks of the Westmere-EX processor (10 cores/30 MB L3
cache) (LLC) [60] ................................................................................................ 77
3.3. 7.3.3 Interconnection of the basic building blocks of the Westmere-EX processors
[60] ....................................................................................................................... 78
8. The Sandy Bridge line ......................................................................................................... 79
1. 8.1 Introduction ......................................................................................................... 79
1.1. Overview of the Sandy Bridge family ........................................................... 79
1.2. Overview of the Sandy Bridge based processor lines ................................... 79
1.3. Main functional units of Sandy Bridge [96] .................................................. 80
2. 8.2 Major innovations of the Sandy Bridge line vs. the 1. generation Nehalem line [61]
81
2.1. 8.2.1 Overview .............................................................................................. 81
2.2. 8.2.2 Extension of the ISA (of the cores) by the AVX instruction set .......... 82
2.3. 8.2.2 Extension of the ISA (of the cores) by the AVX instruction set (Based on
[18]) ..................................................................................................................... 82
2.3.1. The AVX extension includes [97]: ................................................... 83
2.3.2. Implementation of AVX ................................................................... 84
2.3.3. Subsequent evolution of AVX [97] .................................................. 84
2.4. 8.2.3 New microarchitecture of the cores ..................................................... 85
2.5. 8.2.4 On die ring interconnect bus [66] ......................................................... 85
2.5.1. Main features of the on-die interconnect bus [64] ............................ 86
2.6. 8.2.5 On die graphics unit [99] ...................................................................... 86
2.6.1. Support of both media and graphics processing by the graphics unit [99]
87
2.6.2. Main features of the on die graphics unit [99] .................................. 87
2.6.3. Specification data of the HD 2000 and HD 3000 graphics [100] .... 88
2.6.4. Performance comparison of the Sandy Bridge’s graphics: gaming [101]
88
2.7. 8.2.6 Enhanced Turbo Boost technology [64] ............................................... 89
2.7.1. Intelligent power sharing between the cores and the integrated graphics [64]
..................................................................................................................... 90
3. 8.3 Example for a Sandy Bridge based desktop platform with the H67 chipset [102] 91
4. 8.4 The E3-1200 UP server line [103] ....................................................................... 92
9. The Sandy Bridge-E line ..................................................................................................... 93
1. 9.1 Introduction ......................................................................................................... 93
1.1. Overview of the Sandy Bridge-E based processor lines ............................... 93
1.2. Comparison of die parameters of recent DT processors [77] ........................ 93
2. 9.2 Differences to the original Sandy Bridge line ...................................................... 93
2.1. 9.2.1 Overview ............................................................................................. 93
2.1.1. 9.2.2 6 cores, no integrated graphics ............................................... 94
2.1.2. 9.2.3 4 parallel memory channels instead of 2 available in the Sandy Bridge
lines ........................................................................................................... 95
2.1.3. 9.2.4 40 PCIe 2. gen. lanes to connect multiple graphics cards to the
processor ..................................................................................................... 96
2.2. 9.2.2 LGA-2011 socket instead of the LGA-1155 used in the original Sandy Bridge
line ....................................................................................................................... 98
2.2.1. Main features of the Sandy Bridge-E line vs the Sandy Bridge line [77]
98
2.2.2. Example for a Sandy Bridge-E/X79 based 4-way SLI multi graphics card
configuration .............................................................................................. 99
10. The Sandy Bridge-EN/EP line ........................................................................................ 100
1. 10.1 Introduction ..................................................................................................... 100
1.1. Overview of the Sandy Bridge-EN/EP lines ............................................... 100
v
Created by XMLmind XSL-FO Converter.
Multicore processors
1.2. Improvements of the microarchitecture of the Sandy Bridge-EN/EP processors
[107] ................................................................................................................... 100
1.3. Die shot of the Xeon E5-2600 [107] ........................................................... 101
1.4. The interconnection ring connecting main units of the processor [107] ..... 101
2. 10.2 Main enhancements of the Sandy Bridge-EN line over the previous Westmere-EP
Xeon 5600 line [108] ................................................................................................... 102
3. 10.3 Main enhancements of the Sandy Bridge-EP line over the Sandy Bridge-EN line [108]
...................................................................................................................................... 102
3.1. Feature comparison Westmere-EP 5600, Sandy Bridge-EN (E5-2400) and Sandy
Bridge-EP (E5-2600) [108] ............................................................................... 102
3.2. Comparison of the dual socket (DP) Sandy Bridge-EN and Sandy Bridge-EP
platforms [109] ................................................................................................... 103
3.3. The dual socket (DP) Xeon E5-2600 (Sandy Bridge-EP) Romley platform [110]
104
3.4. The quad socket (MP) Xeon E5-2600 (Sandy Bridge-EP) Romley platform [110]
105
4. 10.4 Main features of selected E5-EP models [111] ................................................ 105
5. 10.5 More details on the Romley server platform ................................................... 106
5.1. The Patsburg (C600) chipset ...................................................................... 106
6. 10.6 Performance comparison Sandy Bridge-EP vs. Westmere-EP X5680 [112] ... 107
6.1. Summary assessment of the performance comparison ................................ 108
6.2. Historical increase of the integer performance of 2 Socket (2S) configurations [113]
............................................................................................................................ 108
7. 10.7 Intel’s Xeon E5 family server roadmap [114] ................................................ 109
11. The Ivy Bridge line ......................................................................................................... 110
1. 11.1 Introduction ..................................................................................................... 110
1.1. Overview of the Ivy Bridge family-1 .......................................................... 110
1.2. Overview of the Ivy Bridge family-2 .......................................................... 110
1.3. Contrasting the Sandy Bridge and Ivy Bridge dies [81] .............................. 112
1.4. Main implementation parameters of recent processors [81] ....................... 112
1.5. Overview of the Ivy Bridge based processor lines ...................................... 112
2. 11.2 Major innovations of Ivy Bridge [80] .............................................................. 113
2.1. 11.2.1 Overview .......................................................................................... 113
2.2. 11.2.2 The 22 nm tri-gate process technology within Intel’s technology roadmap
[82] ..................................................................................................................... 114
2.2.1. The traditional planar transistor [82] .............................................. 114
2.2.2. The 22 nm Tri-Gate transistor-1 [82] ............................................. 115
2.2.3. The 22 nm Tri-Gate transistor-2 [82] ............................................. 116
2.2.4. Transistor characteristics [82] ......................................................... 117
2.2.5. Transistor gate delay [82] ............................................................... 118
2.2.6. Intel’s 22 nm manufacturing fabs [82] ........................................... 119
2.2.7. Ivy Bridge chips on a 300 mm wafer [82] ..................................... 120
2.3. 11.2.3 Supervisory Mode Execute Protection [83] ..................................... 121
2.4. 11.2.4 Next generation processor graphics and media [81] ........................ 121
2.4.1. Overview of video interfaces of computing devices to external displays
122
3. 11.3 Main features of Ivy Bridge-based first introduced processors ....................... 122
3.1. 11.3.1 Main features of the first introduced Ivy Bridge-based desktop models [116]
............................................................................................................................ 122
3.2. 11.3.2 Main features of the first introduced Ivy Bridge-based mobile models [116]
123
4. 11.4 Ivy Bridge-based desktop platform [81] .......................................................... 123
5. 11.5 Performance assessment of the desktop models .............................................. 124
5.1. 11.5.1 CPU performance of the highest clocked Ivy Bridge model Core i7-3770K
[81] (Higher is better) ....................................................................................... 124
5.2. 11.5.2 Relative GPU performance (with games DX9/DX10/DX11) of the highest
performance Ivy Bridge DT model Core i7-3770K (Resolutions 1440x900, 1680x1050)
[81] (Higher is better) ....................................................................................... 125
5.2.1. Increasing performance of Intel’s integrated graphics [117] ......... 125
vi
Created by XMLmind XSL-FO Converter.
Multicore processors
6. 11.6 Main features of first introduced Ivy Bridge-based Xeon E3-12xx v2 models [118]
126
7. 11.7 The Ivy Bridge-based Xeon E3-1200 v2 platform (called the Bromolow refresh server
platform) [119] ............................................................................................................ 126
12. The Ivy Bridge-E line ...................................................................................................... 128
1. 12.1 Introduction ..................................................................................................... 128
1.1. Overview of the Ivy Bridge-E based processor lines .................................. 128
2. 12.2 Differences to the previous Sandy Bridge-E line [132] ................................... 128
2.1. Overview of providing PCIe lanes on Intel desktop processors .................. 128
2.2. Die plot of an Ivy Bridge-E processor [133] ............................................... 129
2.3. Main features of Ivy Bridge-E models [131] .............................................. 129
3. 12.3 Example for an Ivy Bridge-E based desktop platform with the X79 chipset [134] 130
4. 12.4 Performance increase achieved by the Ivy Bridge-E line vs. the previous Sandy
Bridge-E line [135] ...................................................................................................... 130
13. The Ivy Bridge-EN/EP lines ........................................................................................... 132
1. 13.1 Introduction ..................................................................................................... 132
1.1. Overview of the Ivy Bridge-EN/EP lines .................................................... 132
1.2. Die layouts [137] ......................................................................................... 132
1.3. Die shot of the ten-core Ivy Bridge-EP processor [138] ............................. 133
2. 13.2 Main enhancements of the Ivy Bridge-EP-based Xeon E5-2600 v2 line vs. the Sandy
Bridge-EP-based Xeon E5-2600 line [138] ................................................................. 134
2.1. Comparison of main features of the Ivy Bridge-EP-based Xeon E5-2600 v2 line vs.
the Sandy Bridge-EP-based Xeon E5-2600 line [138] ...................................... 134
3. 13.3 Main features of specific models of the Xeon E5-2600 v2 series [139] .......... 135
4. 13.4 Main features of specific models of the Xeon E5-1600 v2 series [140] .......... 135
5. 13.5 The Romley server platform [138] .................................................................. 136
5.1. Intel Xeon E5 family server roadmap [136] ................................................ 136
14. The Ivy Bridge-EX line ................................................................................................... 138
1. 14.1 Introduction ..................................................................................................... 138
1.1. Ivy Bridge-EX ............................................................................................. 138
2. 14.2 Main features of the Ivy Bridge-EX line [142] ................................................ 138
15. The Haswell line ............................................................................................................. 139
1. 15.1 introduction ...................................................................................................... 139
1.1. Overview of Haswell-based processor lines (Based on [120]) .................... 139
1.2. Die plot of a Haswell processor [121] ......................................................... 139
1.3. Sub-families of Haswell [144] ................................................................... 140
2. 15.2 Key enhancements of the Haswell cores [80] .................................................. 141
2.1. Buffer sizes of subsequent generations of Core processors [80] ................. 141
2.2. Cache sizes, latencies and bandwidth values of subsequent Core generations [122]
142
2.3. Issue rate and execution unit enhancements of Haswell [80] ...................... 142
3. 15.3 ISA enhancements of the Haswell cores [80] .................................................. 143
3.1. Evolution of the AVX ISA extension [97] .................................................. 143
3.2. Enhancements of AVX2 [97] ...................................................................... 144
3.3. FMA and peak FLOPs of Haswell [97] ...................................................... 144
4. 15.4 Main innovations of the Haswell processor .................................................... 145
4.1. 15.4.1 Overview .......................................................................................... 145
4.2. 15.4.2 Enhanced graphics ........................................................................... 145
4.2.1. Main enhancements of the Iris Pro and Iris graphics units [123] ... 145
4.2.2. Performance boost provided by the Iris Pro/Iris graphics vs. the previous
generation [123] ....................................................................................... 146
4.2.3. Graphics performance increase of subsequent Core generations [117] 147
4.3. 15.4.3 On-package eDRAM cache [117] .................................................... 147
4.3.1. Principle of operation [117] ............................................................ 148
4.3.2. Implemented on-package eDRAM [124] ....................................... 148
4.3.3. Memory latency vs. access range in a memory system with eDRAM cache
(L4) [117] ................................................................................................. 149
5. 15.5 Main features of the Haswell line of mobile and desktop processors ............. 150
5.1. 15.5.1 Example 1: Main features of Haswell-based mobile Core i7 M-Series
processors [125] ................................................................................................. 150
vii
Created by XMLmind XSL-FO Converter.
Multicore processors
5.2. 15.5.2 Example 2: Main features of Haswell-based Core i7 desktop processors
[126] ................................................................................................................... 150
6. 15.6 Haswell-based desktop platform [145] ............................................................ 151
7. 15.7 Integer and FP performance of subsequent generations of Core processors [127] 152
8. 15.8 Graphics performance of subsequent generations of Core processors [127] .. 152
9. 15.9 Main features of Haswell-based Xeon E3-12xx v3 line of server processors [128] 153
9.1. Main features of subsequent generations of E3-1200 Xeon processors [129] 153
10. 15.10 Haswell-based Xeon E3-1200 v3 server platform [130] .............................. 154
16. The Haswell-E line .......................................................................................................... 155
1. 16.1 Introduction ..................................................................................................... 155
2. 16.2 Differences to the previous Ivy Bridge-E line [143] ........................................ 155
2.1. 16.2.1 Overview .......................................................................................... 155
2.2. 16.2.2 The Haswell-E processor [143] ........................................................ 155
2.3. 16.2.3 The Wellsburg-X PCH [143] ........................................................... 156
2.4. 16.2.4 DDR4 memory [143] ....................................................................... 156
2.5. 16.2.4 DDR4 memory [143] ....................................................................... 157
17. 17. References ................................................................................................................ 158
Part III. AMD’s high performance oriented Family 15h (Bulldozer-based) processor lines .......... 166
1. Overview of AMD’s high performance oriented Family 15h (Bulldozer-based) processor lines
173
1. Overview of AMD’s high performance oriented Family 15h processor lines (based on [1])
173
2. Performance increase of AMD’s DP servers up to the Bulldozer-based Interlagos [18] 174
3. AMD’s projection to increase performance in post Bulldozer architectures [19] ... 174
4. Recent roadmaps of AMD’s basic lines [2] ............................................................. 174
5. Introduction to the Family 15h lines of processors, designated also as the Bulldozer lines
175
6. The compute module of the Family 15h processors ................................................ 176
7. Shared and dedicated components of the Bulldozer cores ....................................... 178
8. Design philosophy of using compute modules in Bulldozer-based designs ............ 178
8.1. Main design aspects-1 [3] ........................................................................... 178
9. Design philosophy of using compute modules ........................................................ 179
9.1. Main design aspects-2 [3] ........................................................................... 179
10. Example: Clock speed gain achieved by the 1. generation Bulldozer design vs. the
previous K10.5 design-1 .............................................................................................. 179
11. a) Servers ............................................................................................................... 179
12. Main operational parameters of AMD’s K10.5 Istambul-based DP servers (Lisbon) [13]
180
13. Main operational parameters of AMD’s Family 15h-based DP servers (Valencia) [13]
180
14. b) Desktops ............................................................................................................ 181
15. Main features of AMD’s K10.5-based Phenom™ II X6 desktop processors [14] 181
16. Main features of AMD’s 1. generation Bulldozer-based FX desktop processors [14] 181
17. Example: Clock speed gain achieved by the 1. generation Bulldozer design vs. the
previous K10.5 design - Summary .............................................................................. 182
18. The width of the Bulldozer cores ........................................................................... 182
2. First generation Family 15h Bulldozer-based processor lines ........................................... 183
1. 2.1 Overview of Family 15h Bulldozer-based processor lines [3] ........................... 183
1.1. AMD’s Bulldozer-based server and desktop lines – Overview-1 (based on [1]) 183
1.2. Brand names of AMD’s Bulldozer-based server and desktop lines ............ 184
1.3. Positioning AMD’s Bulldozer-based server lines ....................................... 184
1.4. Positioning AMD’s Bulldozer-based desktop lines .................................... 185
2. 2.2 The Bulldozer Compute Module ....................................................................... 185
2.1. 2.2.1 Overview of the Bulldozer Compute Module .................................... 185
2.1.1. The Bulldozer Compute module ..................................................... 185
2.1.2. Principle of operation of a Bulldozer module [4] ........................... 186
2.2. 2.2.2 ISA extensions introduced in the Bulldozer design ........................... 186
2.2.1. New Bulldozer instructions and their possible use [15] ................. 186
2.2.2. Introduction of ISA x86 extensions by Intel vs. AMD ................... 187
viii
Created by XMLmind XSL-FO Converter.
Multicore processors
2.2.3. Comparison of FP-capabilities of Buldozer, Magny Course and Sandy
Bridge [16] .............................................................................................. 188
2.2.4. Compiler support of Bulldozers new instructions [15] ................... 189
2.3. 2.2.3 The microarchitecture of the Bulldozer Compute Module ................. 189
2.3.1. AMD’s Bulldozer module contrasted with two cores of Magny Course [4]
190
2.3.2. The microarchitecture of a Bulldozer core [10] .............................. 190
2.3.3. Block diagram of Intel’s Core 2 microarchitecture [11] ................ 191
2.3.4. Block diagram of AMD’s K8 microarchitecture [11] ..................... 191
2.3.5. The microarchitecture of a Bulldozer core [10] .............................. 192
2.3.6. The microarchitecture of of Intel’s Sandy Bridge cores [17] ......... 193
2.3.7. The microarchitecture of Intel’s Westmere cores [10] ................... 194
2.4. 2.2.4 Assessing the performance potential of the Bulldozer module-1 [3] . 195
2.4.1. Contrasting the execution resources of the Bulldozer core with previous
designs ...................................................................................................... 196
2.5. 2.2.4 Assessing the performance potential of the Bulldozer module-2 [3] . 196
2.5.1. Contrasting the FP execution resources of the Bulldozer core with previous
designs ...................................................................................................... 197
2.5.2. Contrasting the FP execution resources of the Bulldozer core with previous
designs ...................................................................................................... 197
2.5.3. Comparing Bulldozer’s per module and Sandy Bridge’s per core available
256-bit execution resources-1 [17] .......................................................... 199
2.5.4. Comparing Bulldozer’s per module and Sandy Bridge’s per core available
256-bit execution resources-1 [17] .......................................................... 200
2.6. 2.2.4 Assessing the performance potential of the Bulldozer module-3 [3] . 200
2.6.1. Cache/main memory latencies of K10/K10.5, Bulldozer and Sandy Bridge
processors [3] ........................................................................................... 200
2.6.2. Cache sizes of K10/K10.5, Bulldozer and Sandy Bridge processors 201
2.6.3. AMD’s projection to increase performance in post Bulldozer architectures
[19] ........................................................................................................... 201
3. 2.3 The Orochi die ................................................................................................... 202
3.1. The floorplan of the Orochi die ................................................................... 202
3.2. The North Bridge of Orochi [21] ................................................................ 203
3.3. Block diagram of the Orochi die ................................................................. 204
4. 2.4 New power management features of the Bulldozer design ................................ 204
4.1. AMD’s power management techniques K8 – 1. gen. Family 15h (Bulldozer) (based
on [4]) ............................................................................................................... 204
4.2. New power management features of the Bulldozer design ......................... 205
4.3. TDP Power Cap [23] ................................................................................... 205
4.4. Module C6 state [24], [6] ............................................................................ 205
4.5. Module level VSS power gating ................................................................. 206
4.6. Benefit of module level power gating (C6) vs. C1E state [7] ..................... 207
4.7. Contrasting the Smart Fetch technique with entering the Module C6 state [7] 208
4.8. LV-DDR3 support ....................................................................................... 208
5. 2.5 Bulldozer-based server lines .............................................................................. 209
5.1. 2.5.1 Overview of the Bulldozer-based server lines .................................... 209
5.1.1. Overview of the Bulldozer-based server lines-1 (Based on [1]) .... 209
5.1.2. Overview of the Bulldozer-based server lines-2 (Based on [1]) .... 210
5.2. 2.5.2 The Bulldozer-based Interlagos MP server line ................................. 211
5.2.1. Positioning the Bulldozer-based Interlagos MP server line ............ 211
5.2.2. Block diagram of Interlagos [6] ...................................................... 211
5.2.3. Example: Interlagos-based MP system [6] .................................... 212
5.2.4. Performance increase of AMD’s MP servers up to the Bulldozer-based
Interlagos [18] .......................................................................................... 213
5.2.5. Performance/Watt evolution of AMD’s server lines [2] ................. 213
5.2.6. Main features of Bulldozer-based Interlagos MP server lines [13] 214
5.2.7. Comparing main features of Bulldozer-based lines with the previous
generation [4] ........................................................................................... 214
5.2.8. Performance assessment of Family 15h Bulldozer-based MP servers [13]
214
ix
Created by XMLmind XSL-FO Converter.
Multicore processors
5.2.9. Throughput results of the Open Source server workload runs [26] 215
5.2.10. Response time results of the Open Source server workload runs [26] 215
5.2.11. Power consumption results of the Open Source server workload runs [26]
216
5.2.12. Assessing the benchmark results gained for the Bulldozer-based Interlagos
6276 server .............................................................................................. 216
5.3. 2.5.3 The Turbo core technology of Bulldozer-based MP servers .............. 216
5.3.1. Principle of operation [6] ................................................................ 217
5.3.2. Full and half load turbo frequencies of Family 15h Bulldozer-based
Interlagos MP servers [13] ....................................................................... 217
5.4. 2.5.4 Bulldozer-based DP (Valencia) and UP (Zurich) server lines ............ 218
5.4.1. AMD’s 2012 – 2013 server roadmap [2] ........................................ 218
5.4.2. The Family 15h Bulldozer-based DP system (Valencia) [6] .......... 218
5.4.3. Example Family 15h Bulldozer-based DP system (Valencia) [6] .. 219
5.4.4. Main parameters of the Family 15h Bulldozer-based Valencia DP server
line [13] .................................................................................................... 219
5.4.5. Main parameters of the Family 15h Bulldozer-based Zurich UP server line
[13] ........................................................................................................... 220
5.4.6. AMD’s 2012 – 2013 server roadmap [2] ........................................ 220
5.4.7. Recent roadmaps of AMD’s basic lines [27] .................................. 221
6. 2.6 The Bulldozer-based Zambezi DT line .............................................................. 221
6.1. 2.6.1 Overview of the Bulldozer-based Zambezi high performance desktop line [1]
............................................................................................................................ 221
6.1.1. Brand name of the Bulldozer-based high performance Zambezi desktop line
................................................................................................................... 222
6.1.2. Positioning the Bulldozer-based Zambezi high performance desktop line
222
6.1.3. The Family 15h Bulldozer-based high performance Zambezi desktop line
[6] ............................................................................................................. 223
6.1.4. Die plot of Zambezi [28] ................................................................ 223
6.1.5. Key parameters of the Family 15h Bulldozer-based Zambezi desktop line
[29] ........................................................................................................... 224
6.1.6. System example of a Zambezi desktop system (Scorpius platform) [30]
224
6.2. 2.6.2 The Turbo core technology of the Bulldozer-based Zambezi desktop line 225
6.2.1. Contrasting AMD’s 1. and 2. gen. Turbo core implementations [36] 225
6.2.2. AMD’s 2. generation Turbo core technology ................................. 226
6.2.3. Principle of operation [6] ................................................................ 226
6.2.4. Nominal, 8-core Turbo, and 4-core max. Turbo frequencies of the Zambezi
DT [29] ..................................................................................................... 227
6.2.5. Example for the operation of AMD’s 2. generation Turbo core technology
[37] ........................................................................................................... 227
6.2.6. Example: Running a single threaded workload on the 8150 Zambezi DT
with Turbo core enabled [36] ................................................................... 228
6.2.7. Run time reduction achieved by enabling Turbo core for a single threaded
workload running on an FX-8150 (Zambezi) [38] ................................... 228
6.2.8. Run time reduction achieved by enabling Turbo core for a multi-threaded
workload running on an FX-8150 (Zambezi) [38] ................................... 229
6.2.9. Contrasting the operation of AMD’s 2. gen. Turbo core with that of Intel’s
Turbo Boost technology, as implemented in Sandy Bridge-based desktops (i52500K) [36] .............................................................................................. 230
6.2.10. Principle of operation of Intel’s Deep Power Down technology [39] 231
6.2.11. a) Precursor of Intel’s Turbo Boost: EDAT-2 .............................. 231
6.2.12. b) Intel’s 1. gen. Turbo Boost ....................................................... 231
6.2.13. c) Intel’s enhanced 1. gen. Turbo Boost ....................................... 232
6.2.14. Available Turbo Boost bins (133 MHz) for the 1. and 2. gen. Nehalem
processors [38] .......................................................................................... 232
6.2.15. d) Intel’s 2. gen. (Next gen.) Turbo Boost (Dynamic Turbo Boost) 232
6.2.16. Contrasting the introduction of Intel’s and AMD’s Turbo and Power gating
technologies .............................................................................................. 233
x
Created by XMLmind XSL-FO Converter.
Multicore processors
6.2.17. Evolution of Intel’s Turbo technology [34] .................................. 234
6.3. 2.6.3 Performance assessment of the Bulldozer-based Zambezi desktop line 234
6.3.1. Summary benchmark results including all tests excl. games [32] .. 234
6.3.2. Summary performance assessment of Zambezi-1 .......................... 235
6.3.3. Summary benchmark results including all tests excl. games [32] .. 235
6.3.4. Summary performance assessment of Zambezi-2 .......................... 236
6.3.5. Summary benchmark results including all tests excl. games [32] .. 236
6.3.6. Example: Impact of Windows 7’s scheduling policy to the activation of
Max. Turbo mode [9] ............................................................................... 237
6.3.7. Summary assessment of the benchmark results of the Zambezi FX 8150 line
[32] ........................................................................................................... 239
6.3.8. Summary assessment of all Bulldozer based designs ..................... 239
6.3.9. Remark – AMD’s reorganization after the Bulldozer disaster ....... 239
3. Second generation Family 15h Piledriver-based processor lines ...................................... 240
1. 3.1 Overview of the Pilediriver-based processor lines (based on [1]) .................... 240
1.1. Brand names of Piledriver-based processor lines ........................................ 240
2. Piledriver-based processor lines .............................................................................. 241
3. 3.2 The Piledriver Compute Module ....................................................................... 241
3.1. 3.2.1 Overview of the Piledriver Compute Module .................................... 241
3.2. 3.2.1 Piledriver’s performance enhancements vs. Bulldozer [54] ............... 242
3.2.1. Piledriver’s performance enhancements vs. the (Fam. 12h) Husky and
Bulldozer cores [55] ................................................................................. 242
3.3. 3.2.3 Piledriver’s power management enhancement vs. Bulldozer – The RCM
technology [63] .................................................................................................. 243
3.3.1. 3.2.2.1 A brief introduction into clock distribution networks [57] . 243
3.3.2. 3.2.3.2 Principle of the Resonant Clock Mesh (RCM) technology 247
3.3.3. 3.2.3.3 The evolution of implementing RCM ................................. 254
3.3.4. Main features of AMD’s Bulldozer- and Piledriver based Opteron server
lines [65] ................................................................................................... 255
3.3.5. Plans to implement Cyclos’s RCM in ARM Cortex-A15 [66] ....... 256
4. 3.3 Piledriver-based GPU-less processor lines ........................................................ 256
4.1. 3.3.1 Overview of the Piledriver-based GPU-less processor lines-1 ........... 256
4.1.1. Comparing the Bulldozer-based and Piledriver-based 4-module (8 cores)
dies [6], [54] ............................................................................................ 257
4.1.2. Main functional blocks of a Piledriver-based GPU-less processor die [54]
258
4.2. 3.3.2 The Abu Dhabi Opteron 6300 server line .......................................... 258
4.2.1. Main functional blocks of the dual-chip Opteron 6300 (Abu Dhabi) 4P
server processor [67] ................................................................................ 259
4.2.2. Die plot of the dual-chip Opteron 6300 (Abu Dhabi) server processor [68]
260
4.2.3. Model numbers and main features of the Opteron 6300 (Abu Dhabi) 4P line
[69] ........................................................................................................... 260
4.2.4. Comparison of the Bulldozer-based Opteron 6200 and the Piledriver-based
Opteron 6300 server lines [67] ................................................................. 261
4.3. 3.3.3 The Vishera high performance FX desktop line ................................. 261
4.3.1. Main functional blocks of the high performance Vishera FX desktop line
[54] ........................................................................................................... 262
4.3.2. Die plot of the high performance Vishera FX desktop line [54] ..... 262
4.3.3. Model numbers and main features of the high performance Vishera FX
desktop line [60] ....................................................................................... 263
4.3.4. Comparing main features of AMD’s Vishera and Zambezi FX desktop lines
[49] ........................................................................................................... 263
4.3.5. Main features of the 9-Series chipset supporting the high performance
Vishera DT [70] ........................................................................................ 264
4.3.6. AMD’s high-performance processor roadmap from 10/2011 [44] . 264
5. 3.4 Piledriver-based Trinity APU lines .................................................................... 265
5.1. 3.4.1 Overview of the Piledriver-based Trinity APU lines ......................... 265
5.1.1. Piledriver-based Trinity APU lines ................................................ 265
5.2. 3.4.2 The Trinity APU die .......................................................................... 265
xi
Created by XMLmind XSL-FO Converter.
Multicore processors
5.2.1. AMD’s Trinity APU die [71] ........................................................ 266
5.2.2. Comparing die plots of AMD’s Llano and Trinity dies [72] .......... 266
5.2.3. Improvements of the Piledriver APU family over the Llano APU family
267
5.2.4. a) Enhancements of the microarchitecture of the Trinity APU [73] 267
5.2.5. b) Improvement of the power management ................................... 267
5.2.6. The Turbo Core technology of the Llano APU [74], [75] .............. 268
5.2.7. Illustration of the operation of the Turbo Core Technology 3.0 of the Trinity
APU [77] .................................................................................................. 270
5.3. 3.4.3 The Trinity mainstream desktop APU line ......................................... 271
5.3.1. Positioning the Trinity mainstream desktop APU line [51] ........... 272
5.3.2. Main components of the Trinity mainstream desktop APU [78] .... 272
5.3.3. Model numbers and main features of the mainstream Trinity desktop APU
line [78] (Virgo platform) ......................................................................... 273
5.3.4. The new FM2 socket of the Trinity mainstream desktop APU line [78]
273
5.3.5. System architecture of the mainstream Trinity desktop APU with the A85X
FCH [79] ................................................................................................... 274
5.3.6. Performance increase achieved over the previous A-Series Llano APU line
[78] .......................................................................................................... 274
5.4. 3.4.4 The Trinity mobile APU line ............................................................. 275
5.4.1. Positioning the Trinity mobile APU line-1 [51] ............................ 275
5.4.2. Positioning the Trinity mobile APU line-2 [52] ............................ 276
5.4.3. Model numbers and main features of the Trinity mobile APU line [80]
(Comal platform) ...................................................................................... 276
5.4.4. The Comal mobile platform including the (Piledriver-based) Trinity APU
and the A70M/A60M FCH [52] ............................................................... 277
6. 3.5 Piledriver-based Richland APU lines ................................................................ 277
6.1. 3.5.1 Overview of the Piledriver-based Trinity APU lines ......................... 277
6.1.1. Positioning the Trinity mainstream desktop and mobile APU lines [52]
278
6.1.2. Die shot of the Richland APU [81] ................................................. 278
6.1.3. Key features of the Richland mobile APU line as exposed by AMD [82]
279
6.1.4. Major improvements of the Richland mobile APU line discussed [83], [84]
279
6.1.5. Principle of operation of the Temperature Smart Turbo Core (TSTC)
technique-1 ............................................................................................... 280
6.1.6. Principle of operation of the Temperature Smart Turbo Core (TSTC)
technique-2 [85] ....................................................................................... 280
6.1.7. Comparing clock frequencies of the Richland and the Trinity APU lines
[86] ........................................................................................................... 281
6.1.8. Principle of operation of the Temperature Smart Turbo Core (TSTC)
technique-3 [85] ....................................................................................... 281
6.1.9. Introducing additional frequency/voltage operating points ............ 281
6.1.10. An innovative suite of apps. available typically on the Richland A8 and
A10 models [87] ....................................................................................... 282
6.1.11. AMD Face Login [88] .................................................................. 282
6.1.12. AMD Gesture Control [88] ........................................................... 283
6.1.13. AMD Screen Mirror [88] .............................................................. 283
6.1.14. AMD optimized games [88] ......................................................... 283
6.2. 3.5.2 The Richland mainstream desktop APU line ..................................... 283
6.2.1. Overview of the Richland mainstream desktop APU line .............. 283
6.2.2. Positioning the Richland mainstream desktop and mobile APU lines [52]
284
6.2.3. Model numbers and expected key features of the Richland desktop APU
line [89] (Elite Performance platform) ..................................................... 284
6.3. 3.5.3 The Richland mobile APU line .......................................................... 285
6.3.1. Positioning the Richland mobile APU line [52] ............................ 285
xii
Created by XMLmind XSL-FO Converter.
Multicore processors
6.3.2. Model numbers and expected main features of the Richland mobile APU
line [84] (Elite performance APU platform) ............................................. 286
6.3.3. AMD’s graphics performance figures of the Richland mobile APU line vs.
Intel’s Ivy Bridge-based mobile processors [83] ..................................... 286
4. Third generation Family 15h Steamroller-based processor lines ...................................... 288
1. 4.1 Overview of Family 15h Steamroller-based processor lines (based on [1]) ..... 288
1.1. Brand names of Family 15h Steamroller-based processor lines .................. 288
1.2. Overview of AMD’s Family 15h Steamroller-based processor lines .......... 289
2. 4.2 The Steamroller Compute Module .................................................................... 289
2.1. Planned introduction of the Steamroller compute module .......................... 289
2.2. Preview of the Steamroller compute module (CM) ................................... 290
2.3. Block diagram of the Steamroller compute module [45] ............................ 290
2.4. Improvements of the front-end part of the Steamroller compute module [45] 290
2.5. Improving integer scheduling, integer execution and reducing average load latency
in the Steamroller compute module [45] ............................................................ 291
2.6. Improving the power efficiency (performance/Watt figure) of the Steamroller
compute module [45] ......................................................................................... 291
2.7. Comparing the block diagrams of three generations of the Family 15h Bulldozer
design-1 .............................................................................................................. 292
2.8. Improvements made in the microarchitecture of the Steamroller compute module
292
3. 4.3 Steamroller-based Opteron server lines ............................................................. 293
3.1. Overview of AMD’s Family 15h Steamroller-based processor lines .......... 294
3.2. 4.3.1 Overview of Steamroller-based server lines (based on [1]) .............. 294
3.2.1. Bringing forward the introduction of the Steamroller based server line 294
3.2.2. AMD’s server roadmap from 2/2012 [27] ...................................... 294
3.2.3. AMD’s indication of introducing the Streamroller based server line already
in 2013 [50] .............................................................................................. 295
4. 4.4 Overview of Steamroller-based Kaveri desktop and mobile APU lines (based on [1])
295
4.1. AMD’s Family 15h Steamroller-based mobile APU lines (based on [1]) .. 296
4.2. Positioning the Steamrolller-based Kaveri APU line as mainstream desktop line
[51] .................................................................................................................... 297
4.3. Positioning the Steamroller-based Kaveri APU as performance/mainstream mobile
line [51] ............................................................................................................. 297
4.4. Revised positioning the Steamroller-based Kaveri APU line [52] ............. 298
4.5. Overview of AMD’s Family 15h Steamroller-based APU lines ................. 298
4.6. Main components of Kaveri APUs ............................................................. 299
4.7. Architectural integration of the CPU and the GPU in Kaveri APU lines .... 299
4.8. Evolution of HSA in subsequent mobile APU lines [48] ............................ 299
4.9. GPU co-processing without pointers and data sharing – Without HSA [91] 299
4.10. GPU co-processing with pointers and data sharing – With HSA [91] ...... 300
4.11. Data transfers in the memory hierarchy of the Llano APU [53] ............... 301
5. References ......................................................................................................................... 302
xiii
Created by XMLmind XSL-FO Converter.
Az ábrák listája
1.1. ..................................................................................................................................................... 3
1.2. ..................................................................................................................................................... 3
1.3. ..................................................................................................................................................... 4
1.4. ..................................................................................................................................................... 4
1.5. ..................................................................................................................................................... 5
1.6. ..................................................................................................................................................... 5
1.7. ..................................................................................................................................................... 5
1.8. ..................................................................................................................................................... 6
1.9. ..................................................................................................................................................... 6
1.10. ................................................................................................................................................... 7
1.1. Intel’s Tick-Tock development model (Based on [1]) ............................................................... 15
1.2. Overview of Intel’s Tick-Tock model (Based on [3]) ............................................................... 15
1.3. ................................................................................................................................................... 16
1.4. Intel’s plan to develop their manufacturing technology and processor lines revealed at a shareholder’s
meeting back in 4/2006 [74] ............................................................................................................. 16
1.5. Intel’s plan to develop their manufacturing technology and processor lines revealed at the IDF Spring
2007 in 4/2007 [75] .......................................................................................................................... 17
1.6. Intel’s design principles for developing microprocessors, revealed at their shareholder’s meeting in
4/2006 [74] ....................................................................................................................................... 18
2.1. ................................................................................................................................................... 19
2.2. Key features of the Core 2 microarchitecture [16] .................................................................... 19
2.3. Block diagram of Intel’s Core 2 microarchitecture [4] .............................................................. 20
2.4. Block diagram of Intel’s Pentium 4 microarchitecture [5] ........................................................ 20
2.5. Block diagram of AMD’s K8 microarchitecture[4] ................................................................... 21
2.6. Issue ports and execution units of the Core 2 [4] ....................................................................... 22
2.7. Issue ports and execution unit of the Pentium 4 [9] ................................................................... 22
2.8. Block diagram of AMD’s K8 microarchitecture [4] .................................................................. 23
2.9. DP web server performance comparison: AMD Opteron 248 vs. Intel Xeon 2.8 [6] ................ 24
2.10. ................................................................................................................................................. 24
2.11. DP web server performance comparison: AMD Opteron 275/280 vs. Intel Xeon 5160 [8] .... 25
2.12. Core’s shared L2 cache vs. Pentium 4’s private L2 caches ..................................................... 25
2.13. ................................................................................................................................................. 26
2.14. Hardware prefetchers within the Core 2 microarchitecture [11] .............................................. 27
2.15. Widening the FP/SSE Execution Units from 64-bit to 128-bit [12] ........................................ 28
2.16. Intel’s x86 ISA extensions - the SIMD register space (based on [18]) BMA .......................... 28
2.17. SIMD execution resources in Intel’s basic processors (based on [18]) ................................... 28
2.18. Overview of Intel’s x86 ISA extensions (based on [18]) ......................................................... 29
2.19. Intel’s x86 ISA extensions - the operations introduced (based on [17]) .................................. 29
2.20. ................................................................................................................................................. 30
2.21. Achieved performance boost in Core2 for gaming vs AMD’s Athlon 64 FX60 [13] .............. 30
2.22. The operation of the Ultra fine grained power control – an example [11] ............................... 31
2.23. Principle of the Platform Thermal Control [11] , [20] ............................................................. 31
2.24. The aSC7621 hardware monitor with fan control and PECI from Andigilog .......................... 32
2.25. ................................................................................................................................................. 32
3.1. Dynamic and static power dissipation trends in chips [21] ........................................................ 34
3.2. Structure of a high-k + metal transistor [23] .............................................................................. 34
3.3. Benefits of high-k + metal gate transistors [23], [24] ................................................................ 35
3.4. The 45 nm Penryn is a shrink of the 65 nm Core 2 with a few enhancements [25] .................. 35
3.5. Key enhancements introduced into Penryn’s microarchitecture vs. the Core (based on [25]) .. 35
3.6. Intel’s x86 ISA extensions - the operations introduced (based on [17]) .................................... 36
3.7. Intel’s Deep Power Down technology [26] ............................................................................... 36
3.8. Operation of Intel’s Deep Power Down technology [27] .......................................................... 37
3.9. Power reduction achieved by the Deep Power Down Technology [27] .................................... 37
3.10. Principle of the Enhanced Dynamic Acceleration Technology [27] ........................................ 38
3.11. Performance improvements of Penryn vs. Core at the same clock frequency [26] ................. 39
3.12. ................................................................................................................................................. 40
xiv
Created by XMLmind XSL-FO Converter.
Multicore processors
4.1. ................................................................................................................................................... 42
4.2. Design objective of Nehalem [1] ............................................................................................... 42
4.3. ................................................................................................................................................... 42
4.4. ................................................................................................................................................... 43
4.5. Die photo of the Bloomfield/Gainestown chip .......................................................................... 45
4.6. Simultaneous Multithreading (SMT) of Nehalem [1] ................................................................ 45
4.7. Performance gains achieved by Nehalem’ SMT [1] .................................................................. 46
4.8. The 3-level cache architecture of Nehalem (based on [1]) ........................................................ 47
4.9. ................................................................................................................................................... 47
4.10. ................................................................................................................................................. 48
4.11. Integrated memory controller of Nehalem [33] ....................................................................... 48
4.12. ................................................................................................................................................. 49
4.13. Non Uniform Memory Access (NUMA) in multi-socket servers [1] ...................................... 49
4.14. Memory latency comparison: Nehalem vs. Penryn [1] ............................................................ 50
4.15. The low cost (<600 $) Timna PC [40] ..................................................................................... 51
4.16. Point of attaching memory ....................................................................................................... 51
4.17. Signals of the QuickPath Interconnect bus (QPI-bus) [22] ...................................................... 52
4.18. QPI based DP and MP server system architectures [31], [33] ................................................. 52
4.19. ................................................................................................................................................. 53
4.20. Interpretation of the notion “Uncore” [1] ................................................................................ 53
4.21. Use of integrated power gates [32] .......................................................................................... 54
4.22. Overview of the Power Control unit [32] ................................................................................ 54
4.23. ................................................................................................................................................. 55
4.24. Turbo mode uses the available power headroom in processor package power limits [52] ...... 57
4.25. New LGA sockets .................................................................................................................... 58
4.26. ................................................................................................................................................. 58
4.27. ................................................................................................................................................. 59
4.28. Main options of providing PCIe lanes on the processor for graphics cards in DT systems ..... 59
4.29. ................................................................................................................................................. 59
4.30. ................................................................................................................................................. 60
4.31. ................................................................................................................................................. 60
5.1. ................................................................................................................................................... 62
5.2. ................................................................................................................................................... 62
5.3. ................................................................................................................................................... 63
5.4. ................................................................................................................................................... 64
5.5. ................................................................................................................................................... 64
5.6. ................................................................................................................................................... 64
5.7. ................................................................................................................................................... 65
5.8. ................................................................................................................................................... 66
6.1. ................................................................................................................................................... 67
6.2. ................................................................................................................................................... 67
6.3. ................................................................................................................................................... 68
6.4. ................................................................................................................................................... 68
6.5. ................................................................................................................................................... 69
6.6. Westmere-based dual-core mobile and desktop platform .......................................................... 70
6.7. ................................................................................................................................................... 71
6.8. The Clarksdale processor with in-package integrated graphics along with the H57 chipset [91] 71
6.9. ................................................................................................................................................... 72
6.10. ................................................................................................................................................. 72
6.11. ................................................................................................................................................. 73
6.12. ................................................................................................................................................. 74
6.13. ................................................................................................................................................. 74
6.14. ................................................................................................................................................. 75
6.15. Westmere-EP 6-core DP server platform ................................................................................ 75
7.1. ................................................................................................................................................... 76
7.2. ................................................................................................................................................... 76
7.3. ................................................................................................................................................... 77
7.4. ................................................................................................................................................... 78
8.1. Intel’s Tick-Tock development model (Based on [1]) ............................................................... 79
8.2. ................................................................................................................................................... 79
xv
Created by XMLmind XSL-FO Converter.
Multicore processors
8.3. ................................................................................................................................................... 79
8.4. ................................................................................................................................................... 80
8.5. ................................................................................................................................................... 81
8.6. ................................................................................................................................................... 82
8.7. ................................................................................................................................................... 83
8.8. ................................................................................................................................................... 83
8.9. ................................................................................................................................................... 84
8.10. Microarchitecture of the cores of Sandy Bridge [64] ............................................................... 85
8.11. ................................................................................................................................................. 85
8.12. ................................................................................................................................................. 86
8.13. Evolution of graphics implementation from Westmere to Sandy Bridge [99] ......................... 86
8.14. ................................................................................................................................................. 87
8.15. ................................................................................................................................................. 87
8.16. ................................................................................................................................................. 88
8.17. ................................................................................................................................................. 88
8.18. ................................................................................................................................................. 89
8.19. ................................................................................................................................................. 89
8.20. ................................................................................................................................................. 90
8.21. ................................................................................................................................................. 91
8.22. ................................................................................................................................................. 91
8.23. ................................................................................................................................................. 92
9.1. ................................................................................................................................................... 93
9.2. ................................................................................................................................................... 93
9.3. ................................................................................................................................................... 94
9.4. ................................................................................................................................................... 94
9.5. ................................................................................................................................................... 95
9.6. ................................................................................................................................................... 95
9.7. The Sandy Bridge-E platform with the X79 chipset [78] .......................................................... 96
9.8. ................................................................................................................................................... 97
9.9. ................................................................................................................................................... 98
9.10. ................................................................................................................................................. 98
9.11. ................................................................................................................................................. 98
9.12. ................................................................................................................................................. 99
10.1. ............................................................................................................................................... 100
10.2. ............................................................................................................................................... 100
10.3. ............................................................................................................................................... 101
10.4. ............................................................................................................................................... 101
10.5. ............................................................................................................................................... 102
10.6. ............................................................................................................................................... 103
10.7. The original Sandy Bridge processor [109] ........................................................................... 104
10.8. ............................................................................................................................................... 104
10.9. ............................................................................................................................................... 105
10.10. ............................................................................................................................................. 105
10.11. The Romley server platform [107] ...................................................................................... 106
10.12. Intel's Patsburg chipset diagram [107] ................................................................................ 106
10.13. ............................................................................................................................................. 107
10.14. ............................................................................................................................................. 109
10.15. ............................................................................................................................................. 109
11.1. ............................................................................................................................................... 110
11.2. ............................................................................................................................................... 110
11.3. ............................................................................................................................................... 110
11.4. ............................................................................................................................................... 112
11.5. ............................................................................................................................................... 112
11.6. ............................................................................................................................................... 112
11.7. ............................................................................................................................................... 113
11.8. ............................................................................................................................................... 114
11.9. ............................................................................................................................................... 114
11.10. ............................................................................................................................................. 115
11.11. ............................................................................................................................................. 116
11.12. ............................................................................................................................................. 117
xvi
Created by XMLmind XSL-FO Converter.
Multicore processors
11.13. .............................................................................................................................................
11.14. .............................................................................................................................................
11.15. .............................................................................................................................................
11.16. .............................................................................................................................................
11.17. .............................................................................................................................................
11.18. .............................................................................................................................................
11.19. .............................................................................................................................................
11.20. .............................................................................................................................................
11.21. .............................................................................................................................................
11.22. .............................................................................................................................................
11.23. .............................................................................................................................................
11.24. .............................................................................................................................................
11.25. .............................................................................................................................................
11.26. .............................................................................................................................................
12.1. ...............................................................................................................................................
12.2. ...............................................................................................................................................
12.3. ...............................................................................................................................................
12.4. ...............................................................................................................................................
12.5. ...............................................................................................................................................
12.6. ...............................................................................................................................................
13.1. ...............................................................................................................................................
13.2. ...............................................................................................................................................
13.3. ...............................................................................................................................................
13.4. ...............................................................................................................................................
13.5. ...............................................................................................................................................
13.6. ...............................................................................................................................................
13.7. ...............................................................................................................................................
13.8. ...............................................................................................................................................
14.1. ...............................................................................................................................................
15.1. Intel’s Tick-Tock development model (Based on [1]) ...........................................................
15.2. ...............................................................................................................................................
15.3. ...............................................................................................................................................
15.4. ...............................................................................................................................................
15.5. ...............................................................................................................................................
15.6. ...............................................................................................................................................
15.7. ...............................................................................................................................................
15.8. ...............................................................................................................................................
15.9. ...............................................................................................................................................
15.10. .............................................................................................................................................
15.11. .............................................................................................................................................
15.12. .............................................................................................................................................
15.13. .............................................................................................................................................
15.14. .............................................................................................................................................
15.15. .............................................................................................................................................
15.16. .............................................................................................................................................
15.17. .............................................................................................................................................
15.18. .............................................................................................................................................
15.19. .............................................................................................................................................
15.20. .............................................................................................................................................
15.21. .............................................................................................................................................
15.22. .............................................................................................................................................
15.23. .............................................................................................................................................
15.24. .............................................................................................................................................
15.25. .............................................................................................................................................
15.26. .............................................................................................................................................
15.27. .............................................................................................................................................
16.1. ...............................................................................................................................................
16.2. ...............................................................................................................................................
16.3. ...............................................................................................................................................
16.4. ...............................................................................................................................................
xvii
Created by XMLmind XSL-FO Converter.
118
119
120
121
121
122
123
123
123
124
125
125
126
126
128
128
129
130
130
131
132
132
133
134
135
135
136
136
138
139
139
139
140
141
141
142
142
143
143
144
144
145
145
146
147
148
148
149
150
150
151
152
152
153
154
154
155
155
156
156
Multicore processors
16.5. ...............................................................................................................................................
1.1. .................................................................................................................................................
1.2. .................................................................................................................................................
1.3. .................................................................................................................................................
1.4. .................................................................................................................................................
1.5. .................................................................................................................................................
1.6. .................................................................................................................................................
1.7. .................................................................................................................................................
1.8. .................................................................................................................................................
1.9. .................................................................................................................................................
1.10. ...............................................................................................................................................
1.11. ...............................................................................................................................................
1.12. ...............................................................................................................................................
1.13. ...............................................................................................................................................
1.14. ...............................................................................................................................................
1.15. ...............................................................................................................................................
2.1. .................................................................................................................................................
2.2. .................................................................................................................................................
2.3. .................................................................................................................................................
2.4. .................................................................................................................................................
2.5. .................................................................................................................................................
2.6. .................................................................................................................................................
2.7. .................................................................................................................................................
2.8. .................................................................................................................................................
2.9. .................................................................................................................................................
2.10. Overview of Intel’s x86 ISA extensions (based on [44]) .......................................................
2.11. ...............................................................................................................................................
2.12. ...............................................................................................................................................
2.13. ...............................................................................................................................................
2.14. ...............................................................................................................................................
2.15. ...............................................................................................................................................
2.16. ...............................................................................................................................................
2.17. ...............................................................................................................................................
2.18. ...............................................................................................................................................
2.19. ...............................................................................................................................................
2.20. ...............................................................................................................................................
2.21. ...............................................................................................................................................
2.22. ...............................................................................................................................................
2.23. ...............................................................................................................................................
2.24. ...............................................................................................................................................
2.25. ...............................................................................................................................................
2.26. ...............................................................................................................................................
2.27. ...............................................................................................................................................
2.28. ...............................................................................................................................................
2.29. ...............................................................................................................................................
2.30. ...............................................................................................................................................
2.31. ...............................................................................................................................................
2.32. ...............................................................................................................................................
2.33. ...............................................................................................................................................
2.34. ...............................................................................................................................................
2.35. ...............................................................................................................................................
2.36. ...............................................................................................................................................
2.37. ...............................................................................................................................................
2.38. ...............................................................................................................................................
2.39. ...............................................................................................................................................
2.40. ...............................................................................................................................................
2.41. ...............................................................................................................................................
2.42. ...............................................................................................................................................
2.43. ...............................................................................................................................................
2.44. ...............................................................................................................................................
xviii
Created by XMLmind XSL-FO Converter.
157
173
173
174
174
175
175
176
177
178
179
180
180
181
182
182
183
183
184
184
185
186
186
186
187
188
188
189
190
190
191
191
192
193
194
196
197
197
198
198
199
200
201
201
202
202
203
204
204
206
207
207
208
209
209
210
211
211
212
213
Multicore processors
2.45. ............................................................................................................................................... 213
2.46. ............................................................................................................................................... 214
2.47. ............................................................................................................................................... 214
2.48. ............................................................................................................................................... 215
2.49. ............................................................................................................................................... 215
2.50. ............................................................................................................................................... 216
2.51. ............................................................................................................................................... 217
2.52. ............................................................................................................................................... 217
2.53. ............................................................................................................................................... 218
2.54. ............................................................................................................................................... 218
2.55. ............................................................................................................................................... 219
2.56. ............................................................................................................................................... 219
2.57. ............................................................................................................................................... 220
2.58. ............................................................................................................................................... 220
2.59. ............................................................................................................................................... 221
2.60. ............................................................................................................................................... 222
2.61. ............................................................................................................................................... 222
2.62. ............................................................................................................................................... 222
2.63. ............................................................................................................................................... 223
2.64. ............................................................................................................................................... 223
2.65. ............................................................................................................................................... 224
2.66. ............................................................................................................................................... 224
2.67. ............................................................................................................................................... 225
2.68. ............................................................................................................................................... 226
2.69. ............................................................................................................................................... 227
2.70. ............................................................................................................................................... 227
2.71. ............................................................................................................................................... 228
2.72. ............................................................................................................................................... 228
2.73. ............................................................................................................................................... 229
2.74. ............................................................................................................................................... 230
2.75. ............................................................................................................................................... 231
2.76. ............................................................................................................................................... 232
2.77. ............................................................................................................................................... 232
2.78. ............................................................................................................................................... 233
2.79. ............................................................................................................................................... 234
2.80. ............................................................................................................................................... 234
2.81. ............................................................................................................................................... 235
2.82. ............................................................................................................................................... 236
2.83. ............................................................................................................................................... 237
2.84. ............................................................................................................................................... 238
2.85. ............................................................................................................................................... 239
3.1. ................................................................................................................................................. 240
3.2. ................................................................................................................................................. 240
3.3. ................................................................................................................................................. 241
3.4. ................................................................................................................................................. 242
3.5. ................................................................................................................................................. 242
3.6. ................................................................................................................................................. 242
3.7. ................................................................................................................................................. 243
3.8. ................................................................................................................................................. 244
3.9. ................................................................................................................................................. 245
3.10. ............................................................................................................................................... 245
3.11. Distribution of power consumption in a Bulldozer processor [60] ........................................ 245
3.12. ............................................................................................................................................... 246
3.13. Use of clock gating to switch off temporarily not used units in a grid-based clock distribution
network [57] ................................................................................................................................... 247
3.14. ............................................................................................................................................... 247
3.15. ............................................................................................................................................... 248
3.16. ............................................................................................................................................... 248
3.17. ............................................................................................................................................... 249
3.18. ............................................................................................................................................... 249
xix
Created by XMLmind XSL-FO Converter.
Multicore processors
3.19. ...............................................................................................................................................
3.20. ...............................................................................................................................................
3.21. ...............................................................................................................................................
3.22. ...............................................................................................................................................
3.23. ...............................................................................................................................................
3.24. ...............................................................................................................................................
3.25. ...............................................................................................................................................
3.26. ...............................................................................................................................................
3.27. ...............................................................................................................................................
3.28. ...............................................................................................................................................
3.29. ...............................................................................................................................................
3.30. Sub-families of the Opteron 6300 (Abu Dhabi) server line [51] ...........................................
3.31. ...............................................................................................................................................
3.32. ...............................................................................................................................................
3.33. ...............................................................................................................................................
3.34. ...............................................................................................................................................
3.35. ...............................................................................................................................................
3.36. ...............................................................................................................................................
3.37. ...............................................................................................................................................
3.38. ...............................................................................................................................................
3.39. ...............................................................................................................................................
3.40. ...............................................................................................................................................
3.41. ...............................................................................................................................................
3.42. ...............................................................................................................................................
3.43. ...............................................................................................................................................
3.44. ...............................................................................................................................................
3.45. ...............................................................................................................................................
3.46. Simplified layout of the digital power monitoring system of the Llano APU [75] ................
3.47. Simplified layout of the digital power monitoring system of the Trinity APU [76] ..............
3.48. Example for the operation of the AMD Turbo Core Technology 3.0 [55] ............................
3.49. ...............................................................................................................................................
3.50. ...............................................................................................................................................
3.51. ...............................................................................................................................................
3.52. ...............................................................................................................................................
3.53. ...............................................................................................................................................
3.54. ...............................................................................................................................................
3.55. ...............................................................................................................................................
3.56. ...............................................................................................................................................
3.57. ...............................................................................................................................................
3.58. ...............................................................................................................................................
3.59. ...............................................................................................................................................
3.60. ...............................................................................................................................................
3.61. ...............................................................................................................................................
3.62. ...............................................................................................................................................
3.63. ...............................................................................................................................................
3.64. ...............................................................................................................................................
3.65. ...............................................................................................................................................
3.66. ...............................................................................................................................................
3.67. ...............................................................................................................................................
3.68. Additional frequency/voltage points (P points) introduced in the Richland APU [85] .........
3.69. ...............................................................................................................................................
3.70. ...............................................................................................................................................
3.71. ...............................................................................................................................................
3.72. ...............................................................................................................................................
3.73. ...............................................................................................................................................
3.74. ...............................................................................................................................................
3.75. ...............................................................................................................................................
3.76. ...............................................................................................................................................
3.77. ...............................................................................................................................................
3.78. ...............................................................................................................................................
xx
Created by XMLmind XSL-FO Converter.
250
251
251
253
253
253
254
255
256
257
258
259
259
259
260
260
261
261
262
262
263
263
264
264
265
266
266
267
268
269
269
270
271
272
272
273
273
274
274
275
275
276
276
277
277
278
278
279
280
281
281
282
283
284
284
285
285
286
286
287
Multicore processors
4.1. .................................................................................................................................................
4.2. .................................................................................................................................................
4.3. .................................................................................................................................................
4.4. .................................................................................................................................................
4.5. .................................................................................................................................................
4.6. .................................................................................................................................................
4.7. .................................................................................................................................................
4.8. .................................................................................................................................................
4.9. .................................................................................................................................................
4.10. ...............................................................................................................................................
4.11. ...............................................................................................................................................
4.12. ...............................................................................................................................................
4.13. ...............................................................................................................................................
4.14. ...............................................................................................................................................
4.15. ...............................................................................................................................................
4.16. ...............................................................................................................................................
4.17. ...............................................................................................................................................
4.18. ...............................................................................................................................................
4.19. ...............................................................................................................................................
4.20. ...............................................................................................................................................
4.21. ...............................................................................................................................................
4.22. ...............................................................................................................................................
4.23. ...............................................................................................................................................
xxi
Created by XMLmind XSL-FO Converter.
288
288
289
289
290
290
291
291
292
294
294
294
295
296
296
297
298
298
298
299
299
300
301
Part I. rész - Introduction
Created by XMLmind XSL-FO Converter.
Tartalom
1. Introduction ....................................................................................................................................
1. Foreword ...............................................................................................................................
2. The mobile boom and its consequences to computer architectures .......................................
3. Consequences of the low power requirement of mobile devices for Intel and AMD ............
4. Foreseeable market situation .................................................................................................
5. Intel’s response to the mobile challenge ...............................................................................
6. Evolution of Intel’s basic architectures [2] ...........................................................................
7. AMD’s response to the mobile challenge .............................................................................
8. Evolution of AMD’s basic architectures ...............................................................................
9. Overview of Intel’s and AMD’s actual processor lines ........................................................
10. Scope of these slides ...........................................................................................................
11. Reasons for this decision ....................................................................................................
2. References ......................................................................................................................................
2
Created by XMLmind XSL-FO Converter.
3
3
3
5
5
5
5
6
6
6
7
7
8
1. fejezet - Introduction
1. Foreword
The course „Multicore processors” intends to present the basic multicore architectures of the leading processor
manufacturers Intel and AMD used widely in engineering.
It focuses mainly on the microarchitecture of dominant multicore processor families emphasizing incentives and
implications of major steps of the evolution.
We note that a section on Intel’s many core Xeon Phi family can be found in the Outlook chapter of the course
on „GPGPUs and their programming”.
2. The mobile boom and its consequences to
computer architectures
In the second half of the 2000’s mobile devices (smartphones, tablets) emerged very rapidly.
Nevertheless, for mobile devices low power operation is an ultimate paradigm.
This differs sharply from the design paradigm of conventional devices, such as desktops or servers, as depicted
below.
1.1. ábra -
As far as low power CPU microarchitectures (CPU cores) concerns, low power operation raises two basic
requirements:
a) Low power CPUs need to have “narrow” microarchitectures
b) Low power CPUs need to have relative low basic clock frequencies
as briefly discussed next.
a) Low power CPUs need to have “narrow” microarchitectures (e.g. 2-wide microarchitectures).
Example: Microarchitectures of ARM CPUs underlying tablets and smartphones [1]
1.2. ábra -
3
Created by XMLmind XSL-FO Converter.
Introduction
By contrast, typical microarchitectures of traditional processors have wide microarchitectures, as the next
example shows.
Example: Basic layout of the microarchitecture of Intel’s Core 2 – Haswell processors underlying laptops, PCs
and servers.
1.3. ábra -
b) Low power CPUs need to have relative low basic clock frequencies
1.4. ábra -
Here take into account that D = const x fc x V2, in addition higher fc requires higher V.
4
Created by XMLmind XSL-FO Converter.
Introduction
3. Consequences of the low power requirement of
mobile devices for Intel and AMD
1.5. ábra -
4. Foreseeable market situation
1.6. ábra -
5. Intel’s response to the mobile challenge
Introduction of the Atom line of processors in 4/2008
6. Evolution of Intel’s basic architectures [2]
1.7. ábra -
5
Created by XMLmind XSL-FO Converter.
Introduction
7. AMD’s response to the mobile challenge
Introduction of the Bobcat line of processors in 1/2011.
8. Evolution of AMD’s basic architectures
1.8. ábra -
9. Overview of Intel’s and AMD’s actual processor
lines
1.9. ábra -
6
Created by XMLmind XSL-FO Converter.
Introduction
10. Scope of these slides
From all of the above processor lines the slides presented focus on two high performance/power oriented
families, as indicated below.
1.10. ábra -
11. Reasons for this decision
a) Engineering disciplines make use recently typically of laptops, desktops and servers.
These computers are usually built up on Core 2 or Bulldozer processors.
b) Intel’s Itanium processors target mission critical servers, they are not widely used and they approach their
end-of-life as in future they will miss Microsoft’s OS support.
Core 2-based processor lines are presented in Part II whereas Bulldozer-based lines in Part III.
7
Created by XMLmind XSL-FO Converter.
2. fejezet - References
[1.1]
Goto
H.,
ARM
Cortex
–
A
http://pc.watch.impress.co.jp/video/pcw/docs/423/409/p1.pdf
Family
Architecture,
2010,
[1.2] Smith S.L., Intel Strategy & Technology Update, Barclays Capital Global Technology Conf., Dec. 2011,
http://files.shareholder.com/downloads/INTC/1576180143x0x526852/c9868a3a-494e-4506-bcc6a631aca1fd75/Steve%20Smith%20Barclays%20Dec%202011.pdf
8
Created by XMLmind XSL-FO Converter.
Part II. rész - Intel’s Core 2-based
processor lines
Created by XMLmind XSL-FO Converter.
Tartalom
1. Introduction .................................................................................................................................. 15
1. The evolution of Intel’s basic microarchitectures ............................................................... 15
2. Intel’s Tick-Tock model ...................................................................................................... 15
3. Basic architectures and their related shrinks ....................................................................... 16
2. The Core 2 line ............................................................................................................................. 19
1. 2.1 Introduction ................................................................................................................... 19
2. 2.2 Major innovations of the Core 2 line ............................................................................. 20
2.1. 2.2.1 Wide execution .............................................................................................. 20
2.1.1. 4-wide core ................................................................................................. 20
2.1.2. Enhanced execution resources ................................................................... 22
2.1.3. Performance leadership changes between Intel and AMD ......................... 23
2.1.4. Example 1: DP web-server performance comparison (2003) .................... 24
2.1.5. Example 2: Summary assessment of extensive benchmark tests contrasting dual
Opterons vs dual Xeons (2003) [7] ...................................................................... 24
2.1.6. Example: DP web-server performance comparison (2006) ....................... 25
2.2. 2.2.2 Smart L2 cache .............................................................................................. 25
2.2.1. Shared L2 cache ......................................................................................... 25
2.2.2. Benefits of shared caches ........................................................................... 26
2.2.3. Drawbacks of shared caches ...................................................................... 26
2.3. 2.2.3 Smart memory accesses ................................................................................. 26
2.3.1. Hardware prefetchers [9] ............................................................................ 26
2.3.2. Intensive use of hardware prefetchers [11] ................................................ 27
2.3.3. Hardware prefetchers within the Core 2 microarchitecture ........................ 27
2.4. 2.2.4 Enhanced digital media support ..................................................................... 27
2.4.1. Widening the width of FP/SSE Execution units from 64-bit to 128-bit ..... 27
2.4.2. Overview of the x86 ISA extensions in Intel’ processor lines ................... 28
2.4.3. Achieved performance boost in Core 2 for gaming apps ........................... 30
2.5. 2.2.5 Intelligent Power management ...................................................................... 31
2.5.1. Ultra fine grained power control ................................................................ 31
2.5.2. Platform Thermal Control .......................................................................... 31
2.5.3. Possible solution for the Platform Thermal Control Manager [88] ............ 32
3. 2.3 Overview of Core 2 based processor lines .................................................................... 32
3. The Penryn line ............................................................................................................................. 34
1. 3.1 Introduction ................................................................................................................... 34
1.1. Penryn ..................................................................................................................... 34
2. 3.2 Key enhancements of Penryn line ................................................................................. 35
2.1. 3.2.2 More advanced power management .............................................................. 36
2.1.1. Deep Power Down technology (DPD) ....................................................... 36
2.1.2. Enhanced Dynamic Acceleration Technology (EDAT) (for mobiles) ....... 38
2.1.3. Overall performance achievements with Penryn (1) ................................. 39
3. 3.3 Overview of Penryn based processor lines .................................................................... 40
4. The Nehalem line .......................................................................................................................... 42
1. 4.1 Introduction to the 1. generation Nehalem line (Bloomfield) ...................................... 42
1.1. Die shot of the 1. generation Nehalem desktop processor (Bloomfield) [45] ......... 43
2. 4.2 Major innovations of the 1. generation Nehalem line [54] ............................................ 44
2.1. 4.2.1 Simultaneous Multithreading (SMT) ............................................................. 45
2.1.1. Performance gains of SMT ........................................................................ 46
2.2. 4.2.2 New cache architecture .................................................................................. 47
2.2.1. Distinguished features of Nehalem’s cache architecture ............................ 47
2.3. 4.2.3 Integrated memory controller ........................................................................ 48
2.3.1. Main features .............................................................................................. 49
2.3.2. Benefit of integrated memory controllers .................................................. 49
2.3.3. Drawback of integrated memory controllers .............................................. 49
2.3.4. Non Uniform Memory Access (NUMA) .................................................. 49
2.3.5. Memory latency comparison: Nehalem vs Penryn ..................................... 50
2.4. 4.2.4 QuickPath Interconnect bus (QPI) ................................................................ 51
10
Created by XMLmind XSL-FO Converter.
Intel’s Core 2-based processor lines
2.4.1. Signals of the QuickPath Interconnect bus (QPI bus) ............................... 52
2.4.2. QuickPath Interconnect bus (QPI) ............................................................. 52
2.4.3. QPI based DP and MP server system architectures ................................... 52
2.4.4. Comparison of the transfer rates of the QPI, FSB and HT buses .............. 53
2.4.5. The notion of “Uncore” .............................................................................. 53
2.5. 4.2.5 Enhanced power management ....................................................................... 54
2.5.1. Nehalem’s Turbo Mode ............................................................................. 55
2.5.2. ACPI states [26] ......................................................................................... 55
2.6. 4.2.6 New socket .................................................................................................... 58
3. 4.3 Major innovations of the 2. generation Nehalem line (Lynnfield) (1) [46] ................... 58
3.1. Major innovations of the 2. generation Nehalem line (Lynnfield) (2) [46] ............ 58
3.2. Evolution of providing PCIe lanes for graphics ..................................................... 59
3.3. Evolution of the topology and type of available PCIe lanes for graphics cards ..... 59
3.4. Major innovations of the 2. generation Nehalem line (Lynnfield) (3) [46] ............ 60
3.5. Die photos of the 1. and 2. gen. Nehalem desktop chips ........................................ 60
5. The Nehalem-EX line ................................................................................................................... 62
1. 5.1 Introduction ................................................................................................................... 62
1.1. Overview of the Nehalem-EX based processor lines (based on [44]) .................... 62
2. 5.2 Major innovations of the Nehalem-EX processors ....................................................... 62
2.1. 5.2.1 Overview ...................................................................................................... 62
2.2. 5.2.2 Native 8 cores with 24 MB L3 cache (LLC) [55] .......................................... 62
2.2.1. Die micrograph of the 8 core Nehalem-EX (Xeon 7500/Beckton) MP server [71],
[72] ...................................................................................................................... 63
2.3. 5.2.3 On-die ring interconnect bus [56] .................................................................. 63
2.4. 5.2.4 Serial memory channels [55] ......................................................................... 64
2.5. 5.2.5 Scalable platform configurations [55] ........................................................... 65
3. 5.3 Performance features of the 8-core Nehalem-EX based Xeon 7500 vs the Penryn based 6-core
Xeon 7400 [67] ....................................................................................................................... 66
6. The Westmere line ........................................................................................................................ 67
1. 6.1 Introduction ................................................................................................................... 67
1.1. Westmere 2-core and 6-core die plots [57] ............................................................. 67
2. 6.2 Key enhancements of the Westmere lines vs. the Nehalem lines [44] .......................... 68
2.1. Overview of the Westmere lines ............................................................................. 68
3. 6.3 Dual-core Westmere-based mobile/desktop lines ........................................................ 68
3.1. 6.3.1 Overview ....................................................................................................... 68
3.2. 6.3.2 Innovations and enhancements of the dual-core mobile/desktop lines .......... 69
3.2.1. 6.3.2.1 Overview ........................................................................................ 69
3.2.2. 6.3.2.2 In-package integrated CPU/GPU for the 2 core mobile and desktop
segments .............................................................................................................. 69
3.2.3. 6.3.2.3 Enhanced Turbo Boost technology in the mobile Arrandale line [57] 73
4. 6.4 The six core Westmere-based desktop line ................................................................... 73
4.1. Platform and main features of the six core Westmere-based desktop line [93] ...... 74
5. 6.5 Six core Westmere-EP server lines ............................................................................... 74
5.1. Native 6 cores with 12 MB L3 cache (LLC) for UP/DP servers [58] ..................... 74
5.2. Overview of the models of the Westmere-EP based Xeon 5600 family [94] ......... 74
5.3. Example Westmere-EP DP server platform [57] .................................................... 75
7. The Westmere-EX line ................................................................................................................ 76
1. 7.1 Introduction ................................................................................................................... 76
2. 7.2 Key enhancement of the Westmere-EX line vs. the Nehalem-EX server line [95] ....... 76
3. 7.3 Selected details of the Westmere-EX processors .......................................................... 77
3.1. 7.3.1 Native 10 cores with 30 MB L3 cache (LLC) [60] ........................................ 77
3.2. 7.3.2 Basic building blocks of the Westmere-EX processor (10 cores/30 MB L3 cache)
(LLC) [60] ..................................................................................................................... 77
3.3. 7.3.3 Interconnection of the basic building blocks of the Westmere-EX processors [60]
78
8. The Sandy Bridge line .................................................................................................................. 79
1. 8.1 Introduction ................................................................................................................... 79
1.1. Overview of the Sandy Bridge family .................................................................... 79
1.2. Overview of the Sandy Bridge based processor lines ............................................. 79
1.3. Main functional units of Sandy Bridge [96] ........................................................... 80
11
Created by XMLmind XSL-FO Converter.
Intel’s Core 2-based processor lines
2. 8.2 Major innovations of the Sandy Bridge line vs. the 1. generation Nehalem line [61] ... 81
2.1. 8.2.1 Overview ....................................................................................................... 81
2.2. 8.2.2 Extension of the ISA (of the cores) by the AVX instruction set .................... 82
2.3. 8.2.2 Extension of the ISA (of the cores) by the AVX instruction set (Based on [18]) 82
2.3.1. The AVX extension includes [97]: ............................................................. 83
2.3.2. Implementation of AVX ............................................................................ 84
2.3.3. Subsequent evolution of AVX [97] ............................................................ 84
2.4. 8.2.3 New microarchitecture of the cores .............................................................. 85
2.5. 8.2.4 On die ring interconnect bus [66] .................................................................. 85
2.5.1. Main features of the on-die interconnect bus [64] ...................................... 86
2.6. 8.2.5 On die graphics unit [99] ............................................................................... 86
2.6.1. Support of both media and graphics processing by the graphics unit [99] . 87
2.6.2. Main features of the on die graphics unit [99] ........................................... 87
2.6.3. Specification data of the HD 2000 and HD 3000 graphics [100] .............. 88
2.6.4. Performance comparison of the Sandy Bridge’s graphics: gaming [101] .. 88
2.7. 8.2.6 Enhanced Turbo Boost technology [64] ........................................................ 89
2.7.1. Intelligent power sharing between the cores and the integrated graphics [64] 90
3. 8.3 Example for a Sandy Bridge based desktop platform with the H67 chipset [102] ........ 91
4. 8.4 The E3-1200 UP server line [103] ................................................................................ 92
9. The Sandy Bridge-E line .............................................................................................................. 93
1. 9.1 Introduction ................................................................................................................... 93
1.1. Overview of the Sandy Bridge-E based processor lines ......................................... 93
1.2. Comparison of die parameters of recent DT processors [77] ................................. 93
2. 9.2 Differences to the original Sandy Bridge line ............................................................... 93
2.1. 9.2.1 Overview ...................................................................................................... 93
2.1.1. 9.2.2 6 cores, no integrated graphics ......................................................... 94
2.1.2. 9.2.3 4 parallel memory channels instead of 2 available in the Sandy Bridge lines
.............................................................................................................................. 95
2.1.3. 9.2.4 40 PCIe 2. gen. lanes to connect multiple graphics cards to the processor
96
2.2. 9.2.2 LGA-2011 socket instead of the LGA-1155 used in the original Sandy Bridge line
98
2.2.1. Main features of the Sandy Bridge-E line vs the Sandy Bridge line [77] .. 98
2.2.2. Example for a Sandy Bridge-E/X79 based 4-way SLI multi graphics card
configuration ........................................................................................................ 99
10. The Sandy Bridge-EN/EP line .................................................................................................. 100
1. 10.1 Introduction ............................................................................................................... 100
1.1. Overview of the Sandy Bridge-EN/EP lines ......................................................... 100
1.2. Improvements of the microarchitecture of the Sandy Bridge-EN/EP processors [107]
100
1.3. Die shot of the Xeon E5-2600 [107] ..................................................................... 101
1.4. The interconnection ring connecting main units of the processor [107] ............... 101
2. 10.2 Main enhancements of the Sandy Bridge-EN line over the previous Westmere-EP Xeon
5600 line [108] ...................................................................................................................... 102
3. 10.3 Main enhancements of the Sandy Bridge-EP line over the Sandy Bridge-EN line [108] 102
3.1. Feature comparison Westmere-EP 5600, Sandy Bridge-EN (E5-2400) and Sandy BridgeEP (E5-2600) [108] .................................................................................................... 102
3.2. Comparison of the dual socket (DP) Sandy Bridge-EN and Sandy Bridge-EP platforms
[109] ............................................................................................................................ 103
3.3. The dual socket (DP) Xeon E5-2600 (Sandy Bridge-EP) Romley platform [110] 104
3.4. The quad socket (MP) Xeon E5-2600 (Sandy Bridge-EP) Romley platform [110] 105
4. 10.4 Main features of selected E5-EP models [111] ......................................................... 105
5. 10.5 More details on the Romley server platform ............................................................. 106
5.1. The Patsburg (C600) chipset ............................................................................... 106
6. 10.6 Performance comparison Sandy Bridge-EP vs. Westmere-EP X5680 [112] ............ 107
6.1. Summary assessment of the performance comparison ......................................... 108
6.2. Historical increase of the integer performance of 2 Socket (2S) configurations [113]
108
7. 10.7 Intel’s Xeon E5 family server roadmap [114] .......................................................... 109
11. The Ivy Bridge line ................................................................................................................... 110
12
Created by XMLmind XSL-FO Converter.
Intel’s Core 2-based processor lines
1. 11.1 Introduction ............................................................................................................... 110
1.1. Overview of the Ivy Bridge family-1 ................................................................... 110
1.2. Overview of the Ivy Bridge family-2 ................................................................... 110
1.3. Contrasting the Sandy Bridge and Ivy Bridge dies [81] ....................................... 112
1.4. Main implementation parameters of recent processors [81] ................................. 112
1.5. Overview of the Ivy Bridge based processor lines ............................................... 112
2. 11.2 Major innovations of Ivy Bridge [80] ....................................................................... 113
2.1. 11.2.1 Overview ................................................................................................... 113
2.2. 11.2.2 The 22 nm tri-gate process technology within Intel’s technology roadmap [82]
114
2.2.1. The traditional planar transistor [82] ........................................................ 114
2.2.2. The 22 nm Tri-Gate transistor-1 [82] ....................................................... 115
2.2.3. The 22 nm Tri-Gate transistor-2 [82] ....................................................... 116
2.2.4. Transistor characteristics [82] .................................................................. 117
2.2.5. Transistor gate delay [82] ......................................................................... 118
2.2.6. Intel’s 22 nm manufacturing fabs [82] ..................................................... 119
2.2.7. Ivy Bridge chips on a 300 mm wafer [82] ............................................... 120
2.3. 11.2.3 Supervisory Mode Execute Protection [83] ............................................... 121
2.4. 11.2.4 Next generation processor graphics and media [81] .................................. 121
2.4.1. Overview of video interfaces of computing devices to external displays 122
3. 11.3 Main features of Ivy Bridge-based first introduced processors ................................. 122
3.1. 11.3.1 Main features of the first introduced Ivy Bridge-based desktop models [116] 122
3.2. 11.3.2 Main features of the first introduced Ivy Bridge-based mobile models [116] 123
4. 11.4 Ivy Bridge-based desktop platform [81] ................................................................... 123
5. 11.5 Performance assessment of the desktop models ........................................................ 124
5.1. 11.5.1 CPU performance of the highest clocked Ivy Bridge model Core i7-3770K [81]
(Higher is better) ......................................................................................................... 124
5.2. 11.5.2 Relative GPU performance (with games DX9/DX10/DX11) of the highest
performance Ivy Bridge DT model Core i7-3770K (Resolutions 1440x900, 1680x1050) [81]
(Higher is better) ......................................................................................................... 125
5.2.1. Increasing performance of Intel’s integrated graphics [117] ................... 125
6. 11.6 Main features of first introduced Ivy Bridge-based Xeon E3-12xx v2 models [118] 126
7. 11.7 The Ivy Bridge-based Xeon E3-1200 v2 platform (called the Bromolow refresh server
platform) [119] ...................................................................................................................... 126
12. The Ivy Bridge-E line ............................................................................................................... 128
1. 12.1 Introduction ............................................................................................................... 128
1.1. Overview of the Ivy Bridge-E based processor lines ............................................ 128
2. 12.2 Differences to the previous Sandy Bridge-E line [132] ............................................ 128
2.1. Overview of providing PCIe lanes on Intel desktop processors ........................... 128
2.2. Die plot of an Ivy Bridge-E processor [133] ........................................................ 129
2.3. Main features of Ivy Bridge-E models [131] ........................................................ 129
3. 12.3 Example for an Ivy Bridge-E based desktop platform with the X79 chipset [134] ... 130
4. 12.4 Performance increase achieved by the Ivy Bridge-E line vs. the previous Sandy Bridge-E
line [135] ............................................................................................................................... 130
13. The Ivy Bridge-EN/EP lines ..................................................................................................... 132
1. 13.1 Introduction ............................................................................................................... 132
1.1. Overview of the Ivy Bridge-EN/EP lines ............................................................. 132
1.2. Die layouts [137] .................................................................................................. 132
1.3. Die shot of the ten-core Ivy Bridge-EP processor [138] ....................................... 133
2. 13.2 Main enhancements of the Ivy Bridge-EP-based Xeon E5-2600 v2 line vs. the Sandy BridgeEP-based Xeon E5-2600 line [138] ....................................................................................... 134
2.1. Comparison of main features of the Ivy Bridge-EP-based Xeon E5-2600 v2 line vs. the
Sandy Bridge-EP-based Xeon E5-2600 line [138] ..................................................... 134
3. 13.3 Main features of specific models of the Xeon E5-2600 v2 series [139] .................... 135
4. 13.4 Main features of specific models of the Xeon E5-1600 v2 series [140] .................... 135
5. 13.5 The Romley server platform [138] ............................................................................ 136
5.1. Intel Xeon E5 family server roadmap [136] ......................................................... 136
14. The Ivy Bridge-EX line ............................................................................................................ 138
1. 14.1 Introduction ............................................................................................................... 138
1.1. Ivy Bridge-EX ...................................................................................................... 138
13
Created by XMLmind XSL-FO Converter.
Intel’s Core 2-based processor lines
2. 14.2 Main features of the Ivy Bridge-EX line [142] ......................................................... 138
15. The Haswell line ....................................................................................................................... 139
1. 15.1 introduction ............................................................................................................... 139
1.1. Overview of Haswell-based processor lines (Based on [120]) ............................. 139
1.2. Die plot of a Haswell processor [121] .................................................................. 139
1.3. Sub-families of Haswell [144] ............................................................................. 140
2. 15.2 Key enhancements of the Haswell cores [80] ........................................................... 141
2.1. Buffer sizes of subsequent generations of Core processors [80] .......................... 141
2.2. Cache sizes, latencies and bandwidth values of subsequent Core generations [122] 142
2.3. Issue rate and execution unit enhancements of Haswell [80] ............................... 142
3. 15.3 ISA enhancements of the Haswell cores [80] ............................................................ 143
3.1. Evolution of the AVX ISA extension [97] ........................................................... 143
3.2. Enhancements of AVX2 [97] ............................................................................... 144
3.3. FMA and peak FLOPs of Haswell [97] ............................................................... 144
4. 15.4 Main innovations of the Haswell processor ............................................................. 145
4.1. 15.4.1 Overview ................................................................................................... 145
4.2. 15.4.2 Enhanced graphics ..................................................................................... 145
4.2.1. Main enhancements of the Iris Pro and Iris graphics units [123] ............. 145
4.2.2. Performance boost provided by the Iris Pro/Iris graphics vs. the previous
generation [123] ................................................................................................ 146
4.2.3. Graphics performance increase of subsequent Core generations [117] .... 147
4.3. 15.4.3 On-package eDRAM cache [117] .............................................................. 147
4.3.1. Principle of operation [117] ..................................................................... 148
4.3.2. Implemented on-package eDRAM [124] ................................................. 148
4.3.3. Memory latency vs. access range in a memory system with eDRAM cache (L4)
[117] ................................................................................................................... 149
5. 15.5 Main features of the Haswell line of mobile and desktop processors ...................... 150
5.1. 15.5.1 Example 1: Main features of Haswell-based mobile Core i7 M-Series processors
[125] ............................................................................................................................ 150
5.2. 15.5.2 Example 2: Main features of Haswell-based Core i7 desktop processors [126] 150
6. 15.6 Haswell-based desktop platform [145] ...................................................................... 151
7. 15.7 Integer and FP performance of subsequent generations of Core processors [127] .... 152
8. 15.8 Graphics performance of subsequent generations of Core processors [127] ............ 152
9. 15.9 Main features of Haswell-based Xeon E3-12xx v3 line of server processors [128] .. 153
9.1. Main features of subsequent generations of E3-1200 Xeon processors [129] ...... 153
10. 15.10 Haswell-based Xeon E3-1200 v3 server platform [130] ....................................... 154
16. The Haswell-E line ................................................................................................................... 155
1. 16.1 Introduction ............................................................................................................... 155
2. 16.2 Differences to the previous Ivy Bridge-E line [143] ................................................. 155
2.1. 16.2.1 Overview ................................................................................................... 155
2.2. 16.2.2 The Haswell-E processor [143] ................................................................. 155
2.3. 16.2.3 The Wellsburg-X PCH [143] ..................................................................... 156
2.4. 16.2.4 DDR4 memory [143] ................................................................................. 156
2.5. 16.2.4 DDR4 memory [143] ................................................................................. 157
17. 17. References ......................................................................................................................... 158
14
Created by XMLmind XSL-FO Converter.
1. fejezet - Introduction
Remarks
1. As release dates we indicate the dates of the first processor shipments rather than the dates of the processor
announcements in a considered line.
Subsequently shipped new models of the lines are not taken into account in order to keep the overviews
comprehensible.
2. The core numbers on the slides reflect the max. number of cores.
Usually, manufacturers provide also processors with less than the max. number of cores.
1. The evolution of Intel’s basic microarchitectures
1.1. ábra - Intel’s Tick-Tock development model (Based on [1])
2. Intel’s Tick-Tock model
1.2. ábra - Overview of Intel’s Tick-Tock model (Based on [3])
15
Created by XMLmind XSL-FO Converter.
Introduction
3. Basic architectures and their related shrinks
Considered from the Pentium 4 Prescott (the third core of Pentium 4) on.
1.3. ábra -
Remarks
1. In 2003 Intel shifted the focus of their processor development from the pure performance goal to the aspect
of performance per watt, as stated in a slide from 4/2006, see below.
2. It is fascinating how far-sighted Intel planned the development of both their process and processor
technology, and how accurately these plans were met.
1.4. ábra - Intel’s plan to develop their manufacturing technology and processor lines
revealed at a shareholder’s meeting back in 4/2006 [74]
16
Created by XMLmind XSL-FO Converter.
Introduction
3. Intel renamed their Gesher architecture (meaning bridge in Hebrew) to Sandy Bridge at the IDF Spring 2007
(4/2007), as an ill fated political party in Israel held the same name [75].
The Sandy Bridge architecture has been developed in Intel’s Development Center in Haifa, Israel, starting in
2006, after completing the development of Core 2.
1.5. ábra - Intel’s plan to develop their manufacturing technology and processor lines
revealed at the IDF Spring 2007 in 4/2007 [75]
17
Created by XMLmind XSL-FO Converter.
Introduction
4. At their shareholder’s meeting in 4/2006 Intel also revealed their principles for designing microprocessors
[74].
Possible the most important design decision was to develop a single micro architecture for all high volume
market segments.
1.6. ábra - Intel’s design principles for developing microprocessors, revealed at their
shareholder’s meeting in 4/2006 [74]
Note
The policy of „one micro-architecture for all high volume market segments” was later changed, since in 2008,
more or less along with their Nehalem processor, Intel also introduced their low power Atom processor line for
the mobile market segment.
18
Created by XMLmind XSL-FO Converter.
2. fejezet - The Core 2 line
1. 2.1 Introduction
2.1. ábra -
2.2. ábra - Key features of the Core 2 microarchitecture [16]
19
Created by XMLmind XSL-FO Converter.
The Core 2 line
2. 2.2 Major innovations of the Core 2 line
2.1. 2.2.1 Wide execution
• 4-wide core
• Enhanced execution resources
2.1.1. 4-wide core
4-wide front end and retire unit
This is the key benefit of the Core family
By contrast
both Intel’s previous Pentium 4 family and AMD’s K8 have 3-wide cores.
2.3. ábra - Block diagram of Intel’s Core 2 microarchitecture [4]
2.4. ábra - Block diagram of Intel’s Pentium 4 microarchitecture [5]
20
Created by XMLmind XSL-FO Converter.
The Core 2 line
2.5. ábra - Block diagram of AMD’s K8 microarchitecture[4]
21
Created by XMLmind XSL-FO Converter.
The Core 2 line
2.1.2. Enhanced execution resources
The Core 2 has three complex SSE units
By contrast
• The Pentium 4 provides a single complex SSE unit and a second simple SSE unit performing only SSE move
and store operations.
• AMD’s K8 has two SSE units
2.6. ábra - Issue ports and execution units of the Core 2 [4]
2.7. ábra - Issue ports and execution unit of the Pentium 4 [9]
22
Created by XMLmind XSL-FO Converter.
The Core 2 line
2.8. ábra - Block diagram of AMD’s K8 microarchitecture [4]
2.1.3. Performance leadership changes between Intel and AMD
• In 2003 AMD introduced their K8-based processors implementing
• the 64-bit x86 ISA and
• the direct connect architecture concept, that includes
• integrated memory controllers and
23
Created by XMLmind XSL-FO Converter.
The Core 2 line
• high speed point-to-point serial buses (the HyperTransport bus) used to connect processors to processors
and processors to south bridges.
• AMD’s K8-based processors became the performance leader, first of all on the DP and MP server market,
where the 64-bit direct connect architecture has clear benefits vs Intel’s 32-bit Pentium 4 based processors
using shared FSBs to connect processors to north bridges.
2.1.4. Example 1: DP web-server performance comparison (2003)
2.9. ábra - DP web server performance comparison: AMD Opteron 248 vs. Intel Xeon
2.8 [6]
2.1.5. Example 2: Summary assessment of extensive benchmark tests
contrasting dual Opterons vs dual Xeons (2003) [7]
“In the extensive benchmark tests under Linux Enterprise Server 8 (32 bit as well as 64 bit), the AMD Opteron
made a good impression.
Especially in the server disciplines, the benchmarks (MySQL, Whetstone, ARC 2D, NPB, etc.) show quite
clearly that the Dual Opteron puts the Dual Xeon in its place”.
• This situation has completely changed in 2006 when Intel introduced their Core 2 microarchitecture,
2.10. ábra -
24
Created by XMLmind XSL-FO Converter.
The Core 2 line
• The Core 2 has
• a 4-wide front-end and retire unit compared to the 3-wide K8 or the Pentium 4,
• three complex FP/SSE units compared to two units available in the K8 or just a single complex unit and a
second simple unit performing only FP-move and FP store operations.
• This and further enhancements of the Core microarchitecture, detailed subsequently, resulted in record
breaking performance figures.
nyil Intel regained performance leadership vs AMD.
2.1.6. Example: DP web-server performance comparison (2006)
2.11. ábra - DP web server performance comparison: AMD Opteron 275/280 vs. Intel
Xeon 5160 [8]
Remark
Both web-server benchmark results were published from the same source (AnandTech)
2.2. 2.2.2 Smart L2 cache
• Using a shared L2 cache instead of private L2 caches
• Intensive use of hardware prefetchers
2.2.1. Shared L2 cache
Shared L2 instead of private L2 caches associated with the cores.
2.12. ábra - Core’s shared L2 cache vs. Pentium 4’s private L2 caches
25
Created by XMLmind XSL-FO Converter.
The Core 2 line
2.2.2. Benefits of shared caches
• Dynamic cache allocation to the individual cores
• Efficient data sharing (no replicated data)
+ 2x bandwidth to L1 caches.
2.2.3. Drawbacks of shared caches
2.13. ábra -
Shared caches combine access patterns
nyil Reduce the efficiency of hardware prefetching vs private caches.
2.3. 2.2.3 Smart memory accesses
2.3.1. Hardware prefetchers [9]
Remarks
• Intel’s first hardware prefetcher appeared in the Pentium 4 family, associated with the L2 cache (2000).
26
Created by XMLmind XSL-FO Converter.
The Core 2 line
• Intel’s first on-die L2 cache debuted only about one year earlier (10/1999), in the second core of the Pentium
III line (called the Coppermine core, built on 180 nm technology, with a size of 256 KB).
Principle of operation of the L2 hardware prefetcher
• it monitors data access patterns and prefetches data automatically into the L2 cache,
• it attempts to stay 256 bytes ahead of the current data access location,
• the prefetcher remembers the history of cache misses to detect concurrent, independent data streams that it
tries to prefetch ahead of its use.
2.3.2. Intensive use of hardware prefetchers [11]
8 prefetchers per two-core processor
• 2 data and 1 L1 instruction prefetchers per core, able to handle multiple simultaneous patterns.
• 2 prefetchers in the L2 cache tracking multiple access patterns per core.
2.3.3. Hardware prefetchers within the Core 2 microarchitecture
2.14. ábra - Hardware prefetchers within the Core 2 microarchitecture [11]
2.4. 2.2.4 Enhanced digital media support
• Widening the FP/SSE Execution units from 64-bit to 128-bit.
• Supplemental enhancement of the SSE3 ISA extension.
2.4.1. Widening the width of FP/SSE Execution units from 64-bit to 128-bit
27
Created by XMLmind XSL-FO Converter.
The Core 2 line
2.15. ábra - Widening the FP/SSE Execution Units from 64-bit to 128-bit [12]
• Supplemental enhancement of the SSE3 ISA extension,
as shown in the next Figure.
2.4.2. Overview of the x86 ISA extensions in Intel’ processor lines
2.16. ábra - Intel’s x86 ISA extensions - the SIMD register space (based on [18]) BMA
2.17. ábra - SIMD execution resources in Intel’s basic processors (based on [18])
28
Created by XMLmind XSL-FO Converter.
The Core 2 line
2.18. ábra - Overview of Intel’s x86 ISA extensions (based on [18])
2.19. ábra - Intel’s x86 ISA extensions - the operations introduced (based on [17])
29
Created by XMLmind XSL-FO Converter.
The Core 2 line
2.20. ábra -
2.4.3. Achieved performance boost in Core 2 for gaming apps
2.21. ábra - Achieved performance boost in Core2 for gaming vs AMD’s Athlon 64
FX60 [13]
30
Created by XMLmind XSL-FO Converter.
The Core 2 line
2.5. 2.2.5 Intelligent Power management
• Ultra fine grained power control
• Platform Thermal Control
2.5.1. Ultra fine grained power control
Shutting down currently not needed units of the processor.
2.22. ábra - The operation of the Ultra fine grained power control – an example [11]
2.5.2. Platform Thermal Control
2.23. ábra - Principle of the Platform Thermal Control [11] , [20]
31
Created by XMLmind XSL-FO Converter.
The Core 2 line
2.5.3. Possible solution for the Platform Thermal Control Manager [88]
2.24. ábra - The aSC7621 hardware monitor with fan control and PECI from Andigilog
3. 2.3 Overview of Core 2 based processor lines
2.25. ábra -
32
Created by XMLmind XSL-FO Converter.
The Core 2 line
33
Created by XMLmind XSL-FO Converter.
3. fejezet - The Penryn line
1. 3.1 Introduction
1.1. Penryn
Basically a shrink (tick) from the 65 nm Core to 45 nm with a few microarchitectural and ISA enhancements,
discussed subsequently.
3.1. ábra - Dynamic and static power dissipation trends in chips [21]
3.2. ábra - Structure of a high-k + metal transistor [23]
34
Created by XMLmind XSL-FO Converter.
The Penryn line
3.3. ábra - Benefits of high-k + metal gate transistors [23], [24]
3.4. ábra - The 45 nm Penryn is a shrink of the 65 nm Core 2 with a few enhancements
[25]
2. 3.2 Key enhancements of Penryn line
3.5. ábra - Key enhancements introduced into Penryn’s microarchitecture vs. the Core
(based on [25])
35
Created by XMLmind XSL-FO Converter.
The Penryn line
2.1. 3.2.2 More advanced power management
3.6. ábra - Intel’s x86 ISA extensions - the operations introduced (based on [17])
• Deep Power Down (DPD) technology
• Enhanced Dynamic Acceleration (EDAT)
available only on mobile platforms.
(Both techniques became introduced in Nehalem for general use)!
2.1.1. Deep Power Down technology (DPD)
(First Introduced in the Core Duo (3. core of the Pentium M line)
3.7. ábra - Intel’s Deep Power Down technology [26]
36
Created by XMLmind XSL-FO Converter.
The Penryn line
3.8. ábra - Operation of Intel’s Deep Power Down technology [27]
3.9. ábra - Power reduction achieved by the Deep Power Down Technology [27]
37
Created by XMLmind XSL-FO Converter.
The Penryn line
2.1.2. Enhanced Dynamic Acceleration Technology (EDAT) (for mobiles)
Principle of EDAT in the dual core Penryn processors
3.10. ábra - Principle of the Enhanced Dynamic Acceleration Technology [27]
38
Created by XMLmind XSL-FO Converter.
The Penryn line
Remark
Intel’s next basic core, the Nehalem includes a more advanced technology than the Enhanced Dynamic
Acceleration Technology, called the Turbo Boost Technology for increasing clock frequency in case of inactive
cores or light workloads.
2.1.3. Overall performance achievements with Penryn (1)
3.11. ábra - Performance improvements of Penryn vs. Core at the same clock frequency
[26]
39
Created by XMLmind XSL-FO Converter.
The Penryn line
3. 3.3 Overview of Penryn based processor lines
3.12. ábra -
40
Created by XMLmind XSL-FO Converter.
The Penryn line
41
Created by XMLmind XSL-FO Converter.
4. fejezet - The Nehalem line
1. 4.1 Introduction to the 1. generation Nehalem line
(Bloomfield)
Developed at Hillsboro, Oregon, at the site where the Pentium 4 emerged.
4.1. ábra -
The design effort took about five years and required thousands of engineers (Ronak Singhal, lead architect of
Nehalem) [37].
First implementation of the Nehalem microarchitecture appeared in the desktop segment Core i7-9xx
(Bloomfield) 4C in 11/2008
Design objective: The same core for all major segments
4.2. ábra - Design objective of Nehalem [1]
Nevertheless, Intel introduced also in 2008 their mobile oriented low power Atom family.
4.3. ábra -
42
Created by XMLmind XSL-FO Converter.
The Nehalem line
Based on [44]
1.1. Die shot of the 1. generation Nehalem desktop processor
(Bloomfield) [45]
4.4. ábra -
43
Created by XMLmind XSL-FO Converter.
The Nehalem line
Note
• Both the desktop oriented Bloomfield chip and the DP server oriented Gainestown chip have the same layout.
• The Bloomfield die has two QPI bus controllers, in spite of the fact that they are not needed for the desktop
part.
In the Bloomfield die one of the controllers is simply not activated [45], whereas both are active in the DP
alternative (Gainestown).
2. 4.2 Major innovations of the 1. generation Nehalem
line [54]
• Native 4C
• Simultaneous Multithreading (SMT) (Section 4.2.1)
• New cache architecture (Section 4.2.2)
• SSE 4.2 ISA extension (Not detailed)
• Integrated memory controller (Section 4.2.3)
• QuickPath Interconnect bus (QPI) (Section 4.2.4)
• Enhanced power management (Section 4.2.5)
• Advanced virtualization (Not detailed)
• New socket (Section 4.2.6)
44
Created by XMLmind XSL-FO Converter.
The Nehalem line
4.5. ábra - Die photo of the Bloomfield/Gainestown chip
2.1. 4.2.1 Simultaneous Multithreading (SMT)
• Two-way multithreaded ( two threads at the same time)
Benefits
• A 4-wide core is fed more efficiently (from 2 threads)
• Hides latency of a single tread
• More performance with low additional die area cost,
• May provide significant performance increase on dedicated applications.
4.6. ábra - Simultaneous Multithreading (SMT) of Nehalem [1]
45
Created by XMLmind XSL-FO Converter.
The Nehalem line
2.1.1. Performance gains of SMT
4.7. ábra - Performance gains achieved by Nehalem’ SMT [1]
46
Created by XMLmind XSL-FO Converter.
The Nehalem line
2.2. 4.2.2 New cache architecture
4.8. ábra - The 3-level cache architecture of Nehalem (based on [1])
2.2.1. Distinguished features of Nehalem’s cache architecture
• The L2 cache is private again rather than shared as in the Core and Penryn processors
4.9. ábra -
47
Created by XMLmind XSL-FO Converter.
The Nehalem line
Assumed reason for returning to the private scheme
Private caches allow a more effective hardware prefetching than shared ones.
Reason
• Hardware prefetchers look for memory access patterns.
• Private L2 caches have more easily detectable memory access patterns than shared L2 caches.
Remark
4.10. ábra -
• The L3 cache is inclusive rather than exclusive as in a number of competing designs, such as UltraSPARC
IV+ (2005), POWER5 (2005), POWER6 (2007), AMD’s K10-based processors (2007).
2.2.1.1. Intel’s argumentation for inclusive caches [38]
Inclusive L3 caches prevent L2 snoop traffic for L3 cache misses since
• with inclusive L3 caches an L3 cache miss means that the referenced data doesn’t exist in any core’s L2
caches, thus no L2 snooping is needed.
• By contrast, with exclusive L3 caches the referenced data may exist in any of the L2 caches, thus L2 snooping
is required.
For higher core numbers L2 snooping becomes a more demanding task and may overshadow the benefits arising
from the more efficient cache use of the explicit cache scheme.
2.3. 4.2.3 Integrated memory controller
4.11. ábra - Integrated memory controller of Nehalem [33]
48
Created by XMLmind XSL-FO Converter.
The Nehalem line
2.3.1. Main features
• 3 channels per socket
• Up to 3 DIMMs per channel (impl. dependent)
• DDR3-800, 1066, 1333,…
• Supports both RDIMM and UDIMM (impl. dependent)
• Supports single, dual and quad-rank DIMMs
2.3.2. Benefit of integrated memory controllers
4.12. ábra -
2.3.3. Drawback of integrated memory controllers
• Processor becomes memory technology dependent
• For an enhanced memory solution (e.g. for increased memory speed) a new processor modification is needed.
2.3.4. Non Uniform Memory Access (NUMA)
It is a consequence of using integrated memory controllers in connection with multi-socket servers
4.13. ábra - Non Uniform Memory Access (NUMA) in multi-socket servers [1]
49
Created by XMLmind XSL-FO Converter.
The Nehalem line
• Most multi-socket platforms use NUMA
• Remote memory access latency ~ 1.7 x longer than local memory access latency
• Local memory bandwidth is up to 2 x greater than remote bandwidth
• Demands a fast processor-to-processor interconnection to relay memory traffic (QPI)
• Operating systems differ in allocation strategies + APIs
2.3.5. Memory latency comparison: Nehalem vs Penryn
4.14. ábra - Memory latency comparison: Nehalem vs. Penryn [1]
Remark
Intel’s Timna – a forerunner to integrated memory controllers [34]
Timna (announced in 1999, due to 2H 2000, cancelled in Sept. 2000)
50
Created by XMLmind XSL-FO Converter.
The Nehalem line
• Developed in Intel’s Haifa Design and Development Center.
• Low cost microprocessor with integrated graphics and memory controller (for Rambus DRAMs).
• Due to design problems and lack of interest from many vendors, Intel finally cancelled Timna in Sept. 2000.
4.15. ábra - The low cost (<600 $) Timna PC [40]
4.16. ábra - Point of attaching memory
2.4. 4.2.4 QuickPath Interconnect bus (QPI)
• A processor interconnect bus, connecting processors to processors or South Bridges.
Note
(The QPI isn’t an I/O interface, the standard I/O interface remains the PCI-Express bus)
• Formerly designated as the Common System Interface bus (CSI bus)
51
Created by XMLmind XSL-FO Converter.
The Nehalem line
• A serial, high speed differential point-to-point interconnect (similar to the HyperTransport bus)
• Consists of 2 unidirectional links, one in each directions, called the TX and RX unidirectional links.
• Its use is strongly motivated by introducing integrated memory controllers.
2.4.1. Signals of the QuickPath Interconnect bus (QPI bus)
4.17. ábra - Signals of the QuickPath Interconnect bus (QPI-bus) [22]
2.4.2. QuickPath Interconnect bus (QPI)
• A processor interconnect bus, connecting processors to processors or South Bridges.
• Formerly designated as the Common System Interface bus (CSI bus)
• A serial, high speed differential point-to-point interconnect (similar to the HyperTransport bus).
• Consists of 2 unidirectional links, one in each directions, called the TX and RX unidirectional links.
• Each unidirectional link comprises 20 data lanes and a clock lane, with each lane consisting of a pair of
differential signals.
2.4.3. QPI based DP and MP server system architectures
4.18. ábra - QPI based DP and MP server system architectures [31], [33]
52
Created by XMLmind XSL-FO Converter.
The Nehalem line
2.4.4. Comparison of the transfer rates of the QPI, FSB and HT buses
4.19. ábra -
2.4.5. The notion of “Uncore”
4.20. ábra - Interpretation of the notion “Uncore” [1]
53
Created by XMLmind XSL-FO Converter.
The Nehalem line
2.5. 4.2.5 Enhanced power management
Discussed issues
• Integrated power gates
• Integrated Power Control Unit
• Turbo boost technology
4.21. ábra - Use of integrated power gates [32]
4.22. ábra - Overview of the Power Control unit [32]
54
Created by XMLmind XSL-FO Converter.
The Nehalem line
2.5.1. Nehalem’s Turbo Mode
2.5.1.1. Aim
Utilization of the power headroom of inactive cores and that of active cores with light workload for increasing
clock frequency.
Remarks
1. The Penryn core already introduced a less intricate technology for the same purpose, termed as the (EDAT)
Enhanced Dynamic Acceleration Technology that increases clock frequency only for the mobile platform and
in case of inactive cores.
2. The turbo boost technology considers a core “active” if it is in ACPI C0 or C1 states, whereas cores in the C3
to C6 ACPI states are considered as “inactive”.
2.5.2. ACPI states [26]
4.23. ábra -
55
Created by XMLmind XSL-FO Converter.
The Nehalem line
2.5.2.1. Understanding the notion of TDP and the related potential to boost performance (1)
[50]
TDP (Thermal Design Power) is the maximum power consumed at realistic worst case applications (TDP
application).
The cooling system (thermal solution) needs to ensure that the junction temperature (Tj) at maximum core
frequency specified in connection with TDP does not exceed the junction temperature limit (Tjmax) while the
processor runs TDP applications.
The maximum clock frequency related to TDP (2 GHz in the above example) is determined while running
(worst case) TDP applications that intensively utilize all four cores such that at this frequency dissipation still
remains below TDP (i.e. 55 W in the above example).
2.5.2.2. Understanding the notion of TDP and the related potential to boost performance (2)
[50]
Typical workloads however, are not intensive enough to push power consumption to the TDP limit.
The remaining power headroom can be utilized to increase fc if
• the OS requests the highest performance ACPI state (P0 state)
• provided that the processor operates within its TDP and temperature limits.
The possible frequency increase (up to the given limit of the particular processor, called the turbo frequency)
depends on the intensity of the workload and the number of active cores.
2.5.2.3. Principle of the turbo boost technology (1) [51]
If the OS requests an active core to increase fc beyond the TDP limited maximum frequency (i.e. to enter the PO
state), and there is an available power headroom
• either by having idle cores
• or a lightly threaded workload
the turbo mode controller will increase the core frequency of the active cores
provided that the power consumption of the processor (socket) and the junction temperatures of the cores do not
exceed the given limits.
56
Created by XMLmind XSL-FO Converter.
The Nehalem line
In turbo mode all active cores in the processor will operate at the same fc and voltage.
2.5.2.4. Principle of the turbo boost technology (2) [52]
4.24. ábra - Turbo mode uses the available power headroom in processor package
power limits [52]
2.5.2.5. Maximum Turbo boost frequency [53]
Maximum turbo frequencies are factory configured and kept in an internal register (MSR 1ADH).
E.g. in case of the Core i7-920XM the basic core frequency is 2.0 GHz and the maximum Turbo frequency is
3.2 GHz, which is 9 frequency bins higher than the TDP limited basic b core frequency of 2.0 GHz.
2.5.2.6. Increasing/decreasing turbo boost frequencies [51]
• If OS is requesting the ACPI P0 state and the calculated power consumption of the package and measured
junction temperatures (Tj) of the cores remain below factory configured limits the turbo boost controller
automatically steps up core frequency typically by 133 MHz until it reaches the maximum frequency dictated
by the number of active cores or the intensity of the workload.
• When the power consumption of the processor or the junction temperature of any core exceeds the factory
configured limits, the turbo boost controller automatically steps down core frequency in increments of e.g.
133 MHz.
Remark
In the above example the PLL of the clock generator will generate the actual clock frequency fc as nx133 MHz
while stepping up and down n.
2.5.2.7. Assuring that power and temperature values do not exceed specified limits [53], [50]
To control fc the turbo boost controller samples the current power consumption and die temperatures in 5 ms
intervals [53].
57
Created by XMLmind XSL-FO Converter.
The Nehalem line
Power consumption is determined by monitoring the processor current at its input pin as well as the associated
voltage (Vcc) and calculating the power consumption as a moving average.
The junction temperature of the cores are monitored by DTSs (Digital Thermal Sensors) with an error of ± 5 %
[50].
2.6. 4.2.6 New socket
4.25. ábra - New LGA sockets
3. 4.3 Major innovations of the 2. generation Nehalem
line (Lynnfield) (1) [46]
• The Lynnfield chip is a major redesign of the Bloomfield chip that provides a cheaper and more effective
two-chip system solution instead of the previous three chip solution.
• The Lynnfield chip is connected to the P55 PCH by a DMI interface instead of a QPI interface that was used
in the previous solution for connecting the Bloomfield chip to the X58 chipset.
4.26. ábra -
3.1. Major innovations of the 2. generation Nehalem line
(Lynnfield) (2) [46]
58
Created by XMLmind XSL-FO Converter.
The Nehalem line
• It provides only 16 PCIe 2.0 lanes to attach a graphics card immediately to the processor rather than to the
north bridge as done by 36 lanes in the previous solution.
4.27. ábra -
3.2. Evolution of providing PCIe lanes for graphics
4.28. ábra - Main options of providing PCIe lanes on the processor for graphics cards in
DT systems
3.3. Evolution of the topology and type of available PCIe lanes for
graphics cards
4.29. ábra -
59
Created by XMLmind XSL-FO Converter.
The Nehalem line
3.4. Major innovations of the 2. generation Nehalem line
(Lynnfield) (3) [46]
• It supports only two DDR3 memory channels instead of three as in the previous solution.
• Its socket needs less connections (LGA-1156) than the Bloomfield chip (LGA-1366).
• All in all the Lynnfield chip is a cheaper and more effective successor of the Bloomfield chip.
4.30. ábra -
3.5. Die photos of the 1. and 2. gen. Nehalem desktop chips
4.31. ábra -
60
Created by XMLmind XSL-FO Converter.
The Nehalem line
61
Created by XMLmind XSL-FO Converter.
5. fejezet - The Nehalem-EX line
1. 5.1 Introduction
• The Nehalem-EX line aims DP/MP servers.
• This line is also designated as the Beckton family.
• First Nehalem-EX processors were delivered in 3/2010.
1.1. Overview of the Nehalem-EX based processor lines (based
on [44])
5.1. ábra -
2. 5.2 Major innovations of the Nehalem-EX
processors
2.1. 5.2.1 Overview
• Native 8 cores with 24 MB L3 cache (LLC) (Section 5.2.2)
• On-die ring interconnect (Section 5.2.3)
• Serial memory channels (designated as scalable memory interface (SMI) (Section 5.2.4)
• Enhanced support for RAS, virtualization and power control (not detailed here)
• Scalable platform configurations (Section 5.2.5)
• Extending turbo boost to 8 cores
2.2. 5.2.2 Native 8 cores with 24 MB L3 cache (LLC) [55]
5.2. ábra -
62
Created by XMLmind XSL-FO Converter.
The Nehalem-EX line
2.2.1. Die micrograph of the 8 core Nehalem-EX (Xeon 7500/Beckton) MP
server [71], [72]
5.3. ábra -
2.3. 5.2.3 On-die ring interconnect bus [56]
63
Created by XMLmind XSL-FO Converter.
The Nehalem-EX line
Intel’s first ring bus implementation, used subsequently in the Westemere-EX and Sandy-Bridge.
5.4. ábra -
2.4. 5.2.4 Serial memory channels [55]
5.5. ábra -
Nehalem-EX processors provide an FB-DIMM like memory subsystem [55].
5.6. ábra -
64
Created by XMLmind XSL-FO Converter.
The Nehalem-EX line
Remark: The SMI interface was formerly designated as the Fully Buffered DIMM2 interface
2.5. 5.2.5 Scalable platform configurations [55]
Nehalem-EX allows platform scaling from 2 to 256 sockets.
5.7. ábra -
65
Created by XMLmind XSL-FO Converter.
The Nehalem-EX line
3. 5.3 Performance features of the 8-core Nehalem-EX
based Xeon 7500 vs the Penryn based 6-core Xeon
7400 [67]
5.8. ábra -
66
Created by XMLmind XSL-FO Converter.
6. fejezet - The Westmere line
1. 6.1 Introduction
• Westmere (formerly Nehalem-C) is the 32 nm die shrink of Nehalem.
• First Westmere-based processors were launched in 1/2010
6.1. ábra -
1.1. Westmere 2-core and 6-core die plots [57]
6.2. ábra -
67
Created by XMLmind XSL-FO Converter.
The Westmere line
2. 6.2 Key enhancements of the Westmere lines vs.
the Nehalem lines [44]
• Over 100 incremental improvements in the microarchitecture [58] (not discussed here).
• Enhanced support for AES (Advanced Encryption Standard) by providing a set of instructions to perform
hardware accelerated encryption/decryption (not discussed here).
• Enhanced support for virtualization (not discussed here).
2.1. Overview of the Westmere lines
6.3. ábra -
3. 6.3 Dual-core Westmere-based mobile/desktop
lines
3.1. 6.3.1 Overview
6.4. ábra -
68
Created by XMLmind XSL-FO Converter.
The Westmere line
3.2. 6.3.2 Innovations and enhancements of the dual-core
mobile/desktop lines
3.2.1. 6.3.2.1 Overview
• In-package integrated CPU/GPU for the 2 core mobile and desktop segments (Section 6.3.2.2).
• Enhanced Turbo Boost technology in the dual-core mobile Arrandale line (Section 6.3.2.3).
3.2.2. 6.3.2.2 In-package integrated CPU/GPU for the 2 core mobile and desktop
segments
6.5. ábra -
69
Created by XMLmind XSL-FO Converter.
The Westmere line
3.2.2.1. a) In-package integrated CPU/GPU of the mobile Arrandale line [89]
6.6. ábra - Westmere-based dual-core mobile and desktop platform
70
Created by XMLmind XSL-FO Converter.
The Westmere line
3.2.2.2. Basic components of the mobile Arrandale line [90]
6.7. ábra -
3.2.2.3. b) In-package integrated CPU/GPU of the desktop Clarksdale line
6.8. ábra - The Clarksdale processor with in-package integrated graphics along with the
H57 chipset [91]
71
Created by XMLmind XSL-FO Converter.
The Westmere line
3.2.2.4. Key features of the Clarkdale desktop line [92]
6.9. ábra -
3.2.2.5. Integrated Graphics Media (IGM) architecture of Clarkdale [92]
6.10. ábra -
72
Created by XMLmind XSL-FO Converter.
The Westmere line
3.2.3. 6.3.2.3 Enhanced Turbo Boost technology in the mobile Arrandale line
[57]
• The dual-core mobile Arrandale line extends Turbo Boost to graphics as well with dynamic frequency
scaling.
• In this way CPU cores and the graphics core are co-optimized for overall performance and power efficiency.
• In addition, Westmere makes use of per-core and uncore power gates.
• Further on, the Clarkdale desktop alternative is the first Intel processor supporting both normal and lowvoltage DDR3 memory.
4. 6.4 The six core Westmere-based desktop line
6.11. ábra -
73
Created by XMLmind XSL-FO Converter.
The Westmere line
4.1. Platform and main features of the six core Westmere-based
desktop line [93]
6.12. ábra -
5. 6.5 Six core Westmere-EP server lines
5.1. Native 6 cores with 12 MB L3 cache (LLC) for UP/DP servers
[58]
6.13. ábra -
5.2. Overview of the models of the Westmere-EP based Xeon
5600 family [94]
74
Created by XMLmind XSL-FO Converter.
The Westmere line
6.14. ábra -
5.3. Example Westmere-EP DP server platform [57]
6.15. ábra - Westmere-EP 6-core DP server platform
Remark
In Jan. 2011 Intel replaced their in-package integrated CPU/GPU lines with the on-die integrated Sandy Bridge
line.
75
Created by XMLmind XSL-FO Converter.
7. fejezet - The Westmere-EX line
1. 7.1 Introduction
• Westmere-EX processors are 32 nm die shrinks of the 45 nm Nehalem-EX line.
• They are intended for UP/DP or MP servers.
• First Westmere-EX processors were shipped in 4/2011.
• They are socket compatible with the Nehalem-EX line (Xeon 75xx or Benton line).
Overview of the Westmere-EX based processor lines (based on [44])
7.1. ábra -
2. 7.2 Key enhancement of the Westmere-EX line vs.
the Nehalem-EX server line [95]
7.2. ábra -
76
Created by XMLmind XSL-FO Converter.
The Westmere-EX line
3. 7.3 Selected details of the Westmere-EX processors
3.1. 7.3.1 Native 10 cores with 30 MB L3 cache (LLC) [60]
Westmere-EX processors have 10 cores with 30 MB of L3 cache (LLC) on the die vs Nehalem-EX’s 8 cores
with 24 MB L3 cache (LLC), as indicated on the next Figure, in order to compete with AMD’s 2x6 core (dual
chip) Magny Course processors.
3.2. 7.3.2 Basic building blocks of the Westmere-EX processor
(10 cores/30 MB L3 cache) (LLC) [60]
7.3. ábra -
77
Created by XMLmind XSL-FO Converter.
The Westmere-EX line
3.3. 7.3.3 Interconnection of the basic building blocks of the
Westmere-EX processors [60]
Like Nehalem-EX the basic building blocks of Westmere-EX are interconnected through a ring bus, as shown
below.
7.4. ábra -
78
Created by XMLmind XSL-FO Converter.
8. fejezet - The Sandy Bridge line
1. 8.1 Introduction
• Sandy Bridge is Intel’s next new microarchitecture using 32 nm line width.
• First delivered in 1/2011.
• It is termed also as Intel’s second generation Core processors.
8.1. ábra - Intel’s Tick-Tock development model (Based on [1])
1.1. Overview of the Sandy Bridge family
8.2. ábra -
1.2. Overview of the Sandy Bridge based processor lines
8.3. ábra -
79
Created by XMLmind XSL-FO Converter.
The Sandy Bridge line
1.3. Main functional units of Sandy Bridge [96]
8.4. ábra -
80
Created by XMLmind XSL-FO Converter.
The Sandy Bridge line
2. 8.2 Major innovations of the Sandy Bridge line vs.
the 1. generation Nehalem line [61]
2.1. 8.2.1 Overview
Extension of the ISA (of the cores) by the AVX instruction set (Section 8.2.2)
New microarchitecture for the cores (Section 8.2.3)
On die ring interconnect bus (Section 8.2.4)
On-die graphics unit (Section 8.2.5)
Enhanced Turbo Boost technology (Section 8.2.6)
8.5. ábra -
81
Created by XMLmind XSL-FO Converter.
The Sandy Bridge line
2.2. 8.2.2 Extension of the ISA (of the cores) by the AVX
instruction set
AVX: Advanced Vector Extensions
It is the extension of the 128-bit SIMD instruction set (introduced with the Pentium III in 1999)
to 256-bit as follows:
2.3. 8.2.2 Extension of the ISA (of the cores) by the AVX
instruction set (Based on [18])
8.6. ábra -
82
Created by XMLmind XSL-FO Converter.
The Sandy Bridge line
2.3.1. The AVX extension includes [97]:
8.7. ábra -
Note
AVX doubled only FP vector width, as indicated in the Figure below [97].
8.8. ábra 83
Created by XMLmind XSL-FO Converter.
The Sandy Bridge line
2.3.2. Implementation of AVX
To implement 256-bit FP operations Intel did not widen related data paths and FP execution units to 256 bit and,
instead designers use two 128-bit data paths and two FP execution units in the same time, as indicated in the
previous Figure [98].
2.3.3. Subsequent evolution of AVX [97]
8.9. ábra -
84
Created by XMLmind XSL-FO Converter.
The Sandy Bridge line
2.4. 8.2.3 New microarchitecture of the cores
Intel redesigned large parts of the microarchitecture of the cores, as indicated by yellow boxes in the Figure
below.
8.10. ábra - Microarchitecture of the cores of Sandy Bridge [64]
Here we do not want to go into details of the microarchitecture, but refer to two very detailed descriptions [64],
[98].
2.5. 8.2.4 On die ring interconnect bus [66]
8.11. ábra -
85
Created by XMLmind XSL-FO Converter.
The Sandy Bridge line
2.5.1. Main features of the on-die interconnect bus [64]
8.12. ábra -
2.6. 8.2.5 On die graphics unit [99]
8.13. ábra - Evolution of graphics implementation from Westmere to Sandy Bridge [99]
86
Created by XMLmind XSL-FO Converter.
The Sandy Bridge line
2.6.1. Support of both media and graphics processing by the graphics unit [99]
8.14. ábra -
2.6.2. Main features of the on die graphics unit [99]
8.15. ábra 87
Created by XMLmind XSL-FO Converter.
The Sandy Bridge line
2.6.3. Specification data of the HD 2000 and HD 3000 graphics [100]
8.16. ábra -
2.6.4. Performance comparison of the Sandy Bridge’s graphics: gaming [101]
8.17. ábra -
88
Created by XMLmind XSL-FO Converter.
The Sandy Bridge line
2.7. 8.2.6 Enhanced Turbo Boost technology [64]
Designated also as the 2.0 generation Turbo Boost technology.
The concept utilizes the real temperature response of processors to power changes in order to increase the extent
of overclocking [64].
8.18. ábra -
Based on the real temperature response the thermal energy budget accumulated during idle periods can be
utilized to push the core beyond the TDP for short periods of time (e.g. for 20 sec).
8.19. ábra 89
Created by XMLmind XSL-FO Converter.
The Sandy Bridge line
Multiple algorithms manage in parallel current, power and die temperature. [64]
2.7.1. Intelligent power sharing between the cores and the integrated graphics
[64]
8.20. ábra -
90
Created by XMLmind XSL-FO Converter.
The Sandy Bridge line
8.21. ábra -
Remark
• Individual cores may run at different frequencies but all cores share the same power plane.
• Individual cores may be shut down if idle by power gates.
3. 8.3 Example for a Sandy Bridge based desktop
platform with the H67 chipset [102]
8.22. ábra -
91
Created by XMLmind XSL-FO Converter.
The Sandy Bridge line
4. 8.4 The E3-1200 UP server line [103]
8.23. ábra -
92
Created by XMLmind XSL-FO Converter.
9. fejezet - The Sandy Bridge-E line
1. 9.1 Introduction
It belongs also to the 2. gen. Core processor line.
Introduced in 11/2011 as a “precursor” of the upcoming Sandy Bridge-EN/EP server lines with two cores of the
8 cores of the Sandy Bridge-EN/EP lines disabled.
It targets high performance desktops for enthusiast gamers.
It provides 40 configurable PCIe 3.0 lanes that enables to attach up to 4 graphics cards.
1.1. Overview of the Sandy Bridge-E based processor lines
9.1. ábra -
Data based on [62], [63]
1.2. Comparison of die parameters of recent DT processors [77]
Sandy Bridge-E has 2x the die area of Sandy Bridge with 2.27 billion transistors, as the next Table indicates.
9.2. ábra -
2. 9.2 Differences to the original Sandy Bridge line
2.1. 9.2.1 Overview
Up to 6 cores, no integrated graphics (Section 9.2.2)
93
Created by XMLmind XSL-FO Converter.
The Sandy Bridge-E line
Up to 15 MB shared L3 cache instead of 8 MB in the Sandy Bridge lines (Section 9.2.2)
4 DDR3 memory channels instead of 2 available in the Sandy Bridge lines (Section 9.2.3)
40 PCIe 2. gen. lanes to connect multiple graphics cards to the processor (Section 9.2.4)
LGA-2011 socket instead of the LGA-1155 used in the original Sandy Bridge lines (Section 9.2.5)
2.1.1. 9.2.2 6 cores, no integrated graphics
From the original Sandy Bridge-EN/EP design 2 cores are disabled.
As the Sandy Bridge-E targets desktops with high end discrete graphics, there is no need for integrated graphics,
for this reason the Sandy Bridge-E die does not include integrated graphics [76].
9.3. ábra -
9.4. ábra -
94
Created by XMLmind XSL-FO Converter.
The Sandy Bridge-E line
2.1.1.1. Cache/memory latencies of the Sandy Bridge-E processors [77]
9.5. ábra -
2.1.2. 9.2.3 4 parallel memory channels instead of 2 available in the Sandy
Bridge lines
It is inherited from the Sandy bridge-EN/EP server side, instead of 2 of the previous lines.
Support of DDR3 of up to 1600 MT/s.
A single DDR3-1600 DIMM per channel or 2 DDR3-1333 DIMMs per channel [78].
9.6. ábra -
95
Created by XMLmind XSL-FO Converter.
The Sandy Bridge-E line
Note
There are 4 memory channels provided to support up to 4 graphics cards.
2.1.3. 9.2.4 40 PCIe 2. gen. lanes to connect multiple graphics cards to the
processor
This is an increase in the number of PCIe 2. gen. lanes compared to 16 lanes provided by the original Sandy
Bridge line [78].
9.7. ábra - The Sandy Bridge-E platform with the X79 chipset [78]
96
Created by XMLmind XSL-FO Converter.
The Sandy Bridge-E line
2.1.3.1. Overview of providing PCIe lanes on Intel desktop processors
9.8. ábra -
2.1.3.2. Lane configuration options - Sandy Bridge-E [104]
97
Created by XMLmind XSL-FO Converter.
The Sandy Bridge-E line
9.9. ábra -
2.2. 9.2.2 LGA-2011 socket instead of the LGA-1155 used in the
original Sandy Bridge line
Due to the increased number of memory channels on to the Sandy Bridge-E processor its socket needs more
pins than the Sandy Bridge processor that has only two memory channels and the LGA-1155 socket.
9.10. ábra -
2.2.1. Main features of the Sandy Bridge-E line vs the Sandy Bridge line [77]
9.11. ábra -
98
Created by XMLmind XSL-FO Converter.
The Sandy Bridge-E line
2.2.2. Example for a Sandy Bridge-E/X79 based 4-way SLI multi graphics card
configuration
(ASUS’s 4-Way SLI “Rampage IV Formula” motherboard with GTX 680 4-way ready graphics cards) [105]
9.12. ábra -
99
Created by XMLmind XSL-FO Converter.
10. fejezet - The Sandy Bridge-EN/EP
line
1. 10.1 Introduction
• Launched in 3/2012 (-EP lines) and in 5/2012 (-EN lines).
• Up to 8 cores, 20 MB L3.
• They target servers, as indicated below.
1.1. Overview of the Sandy Bridge-EN/EP lines
10.1. ábra -
1.2. Improvements of the microarchitecture of the Sandy BridgeEN/EP processors [107]
10.2. ábra -
100
Created by XMLmind XSL-FO Converter.
The Sandy Bridge-EN/EP line
1.3. Die shot of the Xeon E5-2600 [107]
10.3. ábra -
1.4. The interconnection ring connecting main units of the
processor [107]
10.4. ábra 101
Created by XMLmind XSL-FO Converter.
The Sandy Bridge-EN/EP line
2. 10.2 Main enhancements of the Sandy Bridge-EN
line over the previous Westmere-EP Xeon 5600 line
[108]
• Up to 8 cores vs. up to 6 cores
• Up to 20 MB L3 vs. 12 MB L3
• Additional AVX support
• 1600 MT/s memory speed vs. 1333 MT/s
• QPI 1.1 with 8 GB/s speed vs. QPI 1.0 with 6.4 GB/s, but only
• 24 PCIe 3.0 lanes on the die instead of 36 PCIe 2.0 lanes on the chipset.
3. 10.3 Main enhancements of the Sandy Bridge-EP
line over the Sandy Bridge-EN line [108]
• 4 DDR3 memory channels vs. 3
• Dual inter socket QPI links vs. 1
• 40 PCIe 3.0 lanes vs. 24
• Socket LGA 2011 vs. LGA 1356 (due to the additional memory channel and PCIe lanes).
3.1. Feature comparison Westmere-EP 5600, Sandy Bridge-EN
(E5-2400) and Sandy Bridge-EP (E5-2600) [108]
10.5. ábra 102
Created by XMLmind XSL-FO Converter.
The Sandy Bridge-EN/EP line
3.2. Comparison of the dual socket (DP) Sandy Bridge-EN and
Sandy Bridge-EP platforms [109]
10.6. ábra -
103
Created by XMLmind XSL-FO Converter.
The Sandy Bridge-EN/EP line
Remark
By comparison, the original Sandy Bridge processor (2011) did not provide a QPI link, and had only dual DDR3
memory channels as well as 16 PCIe 2.0 lanes.
10.7. ábra - The original Sandy Bridge processor [109]
3.3. The dual socket (DP) Xeon E5-2600 (Sandy Bridge-EP)
Romley platform [110]
10.8. ábra -
104
Created by XMLmind XSL-FO Converter.
The Sandy Bridge-EN/EP line
3.4. The quad socket (MP) Xeon E5-2600 (Sandy Bridge-EP)
Romley platform [110]
10.9. ábra -
4. 10.4 Main features of selected E5-EP models [111]
10.10. ábra -
105
Created by XMLmind XSL-FO Converter.
The Sandy Bridge-EN/EP line
Remarks
• The 8-core model E5-2687W targets workstations.
• It has the highest clock frequency (3.1 GHz) among the 8-core models achieved by increased supply voltage
and TDP rating.
5. 10.5 More details on the Romley server platform
It supports both the Sandy Bridge-EN/EP and the subsequent Ivy Bridge-EN/EP lines [107].
10.11. ábra - The Romley server platform [107]
5.1. The Patsburg (C600) chipset
10.12. ábra - Intel's Patsburg chipset diagram [107]
106
Created by XMLmind XSL-FO Converter.
The Sandy Bridge-EN/EP line
6. 10.6 Performance comparison Sandy Bridge-EP vs.
Westmere-EP X5680 [112]
The Figure on the right compares the performance of the highest clocked Sandy Bridge-EP based 8-core Xeon
E5-2687W with the 6-core Westmere-EP based Xeon X5680.
10.13. ábra -
107
Created by XMLmind XSL-FO Converter.
The Sandy Bridge-EN/EP line
6.1. Summary assessment of the performance comparison
As the Figure above shows the Sandy Bridge-EP based 8 core Xeon E5-2687W provides on average about 21 %
more performance than the previous 6-core Westmere-EP based Xeon X5680.
It is interesting to note that this figure is approximately the same as the products of the ratios of the core counts
and clock rates, as indicated below.
Ratio of core counts (E5-2687W/X5680: 8/6 = 1.33
Ratio of clock speeds (E5-2687W/X5680: 3.1 GHz/3.33 GHz = 0,94
Product of both ratios: 1.33 x 0.94 = 1.25
This figure is very close to the average performance boost of the E5-2687W vs. the X5680 of about 21 %
Nevertheless, the utilization of the performance potential of more cores required a higher memory bandwidth,
that was achieved by moving from three to four memory channels and
increasing the memory transfer rate from 1.3 GT/s to 1.6 GT/s [107].
6.2. Historical increase of the integer performance of 2 Socket
(2S) configurations [113]
108
Created by XMLmind XSL-FO Converter.
The Sandy Bridge-EN/EP line
10.14. ábra -
7. 10.7 Intel’s Xeon E5 family server roadmap [114]
10.15. ábra -
109
Created by XMLmind XSL-FO Converter.
11. fejezet - The Ivy Bridge line
1. 11.1 Introduction
Ivy Bridge processors are termed also as 3. gen. Intel Core processors.
11.1. ábra -
Launched: 4/2012
1.1. Overview of the Ivy Bridge family-1
11.2. ábra -
1.2. Overview of the Ivy Bridge family-2
11.3. ábra -
110
Created by XMLmind XSL-FO Converter.
The Ivy Bridge line
111
Created by XMLmind XSL-FO Converter.
The Ivy Bridge line
1.3. Contrasting the Sandy Bridge and Ivy Bridge dies [81]
11.4. ábra -
As the pictures show Intel puts much more emphasis on graphics in the Ivy Bridge die.
1.4. Main implementation parameters of recent processors [81]
11.5. ábra -
1.5. Overview of the Ivy Bridge based processor lines
11.6. ábra -
112
Created by XMLmind XSL-FO Converter.
The Ivy Bridge line
2. 11.2 Major innovations of Ivy Bridge [80]
2.1. 11.2.1 Overview
11.7. ábra -
113
Created by XMLmind XSL-FO Converter.
The Ivy Bridge line
2.2. 11.2.2 The 22 nm tri-gate process technology within Intel’s
technology roadmap [82]
11.8. ábra -
2.2.1. The traditional planar transistor [82]
11.9. ábra -
114
Created by XMLmind XSL-FO Converter.
The Ivy Bridge line
2.2.2. The 22 nm Tri-Gate transistor-1 [82]
11.10. ábra -
115
Created by XMLmind XSL-FO Converter.
The Ivy Bridge line
2.2.3. The 22 nm Tri-Gate transistor-2 [82]
11.11. ábra -
116
Created by XMLmind XSL-FO Converter.
The Ivy Bridge line
2.2.4. Transistor characteristics [82]
11.12. ábra -
117
Created by XMLmind XSL-FO Converter.
The Ivy Bridge line
2.2.5. Transistor gate delay [82]
11.13. ábra -
118
Created by XMLmind XSL-FO Converter.
The Ivy Bridge line
2.2.6. Intel’s 22 nm manufacturing fabs [82]
11.14. ábra -
119
Created by XMLmind XSL-FO Converter.
The Ivy Bridge line
2.2.7. Ivy Bridge chips on a 300 mm wafer [82]
11.15. ábra -
120
Created by XMLmind XSL-FO Converter.
The Ivy Bridge line
2.3. 11.2.3 Supervisory Mode Execute Protection [83]
11.16. ábra -
2.4. 11.2.4 Next generation processor graphics and media [81]
11.17. ábra 121
Created by XMLmind XSL-FO Converter.
The Ivy Bridge line
2.4.1. Overview of video interfaces of computing devices to external displays
11.18. ábra -
3. 11.3 Main features of Ivy Bridge-based first
introduced processors
3.1. 11.3.1 Main features of the first introduced Ivy Bridge-based
desktop models [116]
122
Created by XMLmind XSL-FO Converter.
The Ivy Bridge line
11.19. ábra -
3.2. 11.3.2 Main features of the first introduced Ivy Bridge-based
mobile models [116]
11.20. ábra -
4. 11.4 Ivy Bridge-based desktop platform [81]
11.21. ábra -
123
Created by XMLmind XSL-FO Converter.
The Ivy Bridge line
5. 11.5 Performance assessment of the desktop
models
5.1. 11.5.1 CPU performance of the highest clocked Ivy Bridge
model Core i7-3770K [81] (Higher is better)
11.22. ábra -
124
Created by XMLmind XSL-FO Converter.
The Ivy Bridge line
5.2. 11.5.2 Relative GPU performance (with games
DX9/DX10/DX11) of the highest performance Ivy Bridge DT model
Core i7-3770K (Resolutions 1440x900, 1680x1050) [81] (Higher is
better)
11.23. ábra -
5.2.1. Increasing performance of Intel’s integrated graphics [117]
11.24. ábra -
125
Created by XMLmind XSL-FO Converter.
The Ivy Bridge line
6. 11.6 Main features of first introduced Ivy Bridgebased Xeon E3-12xx v2 models [118]
11.25. ábra -
Remark
The Xeon E3-12xx v2 models resemble very much to the Ivy Bridge-based i5/i7-3xxx desktop models.
7. 11.7 The Ivy Bridge-based Xeon E3-1200 v2
platform (called the Bromolow refresh server
platform) [119]
11.26. ábra 126
Created by XMLmind XSL-FO Converter.
The Ivy Bridge line
127
Created by XMLmind XSL-FO Converter.
12. fejezet - The Ivy Bridge-E line
1. 12.1 Introduction
• It belongs also to the 3. gen. Core processor family.
• Introduced in 9/2013 one week before Intel’s IDF Fall 2013.
• It targets high performance desktops for hardcore gamers and graphics enthusiasts.
• It provides 40 configurable PCIe 3.0 lanes that enables to attach up to 4 graphics cards.
1.1. Overview of the Ivy Bridge-E based processor lines
12.1. ábra -
Data based on [131]
2. 12.2 Differences to the previous Sandy Bridge-E
line [132]
The Ivy Bridge-E line provides basically the same features as the previous Sandy Bridge-E line,
such as
• Up to 6 cores, no integrated graphics
• Up to 15 MB shared L3 cache
• LGA-2011 socket.
On the other hand it provides the following main enhancements vs. the previous Sandy bridge-E lines:
• 4 parallel DDR3 memory channels with up to 1866 MT/s rather than up to 1600 MT/s,
• 40 PCIe 3. gen. lanes to connect up to 4 graphics cards to the processor rather than 40 PCIe 2. gen. lanes, as
indicated in the next Figure.
2.1. Overview of providing PCIe lanes on Intel desktop
processors
12.2. ábra -
128
Created by XMLmind XSL-FO Converter.
The Ivy Bridge-E line
2.2. Die plot of an Ivy Bridge-E processor [133]
12.3. ábra -
2.3. Main features of Ivy Bridge-E models [131]
129
Created by XMLmind XSL-FO Converter.
The Ivy Bridge-E line
12.4. ábra -
3. 12.3 Example for an Ivy Bridge-E based desktop
platform with the X79 chipset [134]
12.5. ábra -
4. 12.4 Performance increase achieved by the Ivy
Bridge-E line vs. the previous Sandy Bridge-E line
[135]
130
Created by XMLmind XSL-FO Converter.
The Ivy Bridge-E line
12.6. ábra -
131
Created by XMLmind XSL-FO Converter.
13. fejezet - The Ivy Bridge-EN/EP
lines
1. 13.1 Introduction
Launched in 9/2013 (only the Ivy Bridge-2600 v2 (Ivy Bridge-EP line).
• Further models of the EP line and the EN line are expected to be launched in 2014.
• Up to 12 cores, 30 MB L3.
• They target servers, as indicated below.
1.1. Overview of the Ivy Bridge-EN/EP lines
13.1. ábra -
1.2. Die layouts [137]
13.2. ábra -
132
Created by XMLmind XSL-FO Converter.
The Ivy Bridge-EN/EP lines
1.3. Die shot of the ten-core Ivy Bridge-EP processor [138]
13.3. ábra -
133
Created by XMLmind XSL-FO Converter.
The Ivy Bridge-EN/EP lines
2. 13.2 Main enhancements of the Ivy Bridge-EPbased Xeon E5-2600 v2 line vs. the Sandy Bridge-EPbased Xeon E5-2600 line [138]
• Up to 12 cores vs. up to 8 cores
• Up to 30 MB L3 vs. 20 MB L3
• 1866 MT/s DDR3 memory speed vs. 1600 MT/s.
Some more improvements are indicated in the next Table and discussed in [139].
2.1. Comparison of main features of the Ivy Bridge-EP-based
Xeon E5-2600 v2 line vs. the Sandy Bridge-EP-based Xeon E52600 line [138]
13.4. ábra -
134
Created by XMLmind XSL-FO Converter.
The Ivy Bridge-EN/EP lines
3. 13.3 Main features of specific models of the Xeon
E5-2600 v2 series [139]
13.5. ábra -
4. 13.4 Main features of specific models of the Xeon
E5-1600 v2 series [140]
13.6. ábra 135
Created by XMLmind XSL-FO Converter.
The Ivy Bridge-EN/EP lines
• Xeon E5-1600 v2 models are targeting single-socket workstations.
• They have very similar specifications as the Core i7 Extreme parts, that were also launched about the same
time (9/2013).
5. 13.5 The Romley server platform [138]
It is designed to support beyond Sandy Bridge parts Ivy-Bridge-EP/EN parts as well, as indicated before.
13.7. ábra -
5.1. Intel Xeon E5 family server roadmap [136]
13.8. ábra -
136
Created by XMLmind XSL-FO Converter.
The Ivy Bridge-EN/EP lines
137
Created by XMLmind XSL-FO Converter.
14. fejezet - The Ivy Bridge-EX line
1. 14.1 Introduction
• Ivy Bridge-EX models are expected to arrive in 2H2013..
• They are intended for UP/DP or MP servers.
• Ivy Bridge-EX processors will be part of the Brickland platform that is intended to span three generations
including Ivy Bridge-EX, Haswell-EX and Broadwell-EX [141].
1.1. Ivy Bridge-EX
14.1. ábra -
2. 14.2 Main features of the Ivy Bridge-EX line [142]
• Up to 15 cores
• Up to 37.5 MB L3 cache
• C602J chipset
• Up to 4 C102/C104 scalable memory buffers per scket
• Each memory buffer will support up to 3 DDR3-1600 DIMMs
• Maximum number of DIMMs per socket: 24
• 3 QPI links
• Up to 32 PCIe lanes.
138
Created by XMLmind XSL-FO Converter.
15. fejezet - The Haswell line
1. 15.1 introduction
Haswell processors are termed also as the 4. gen. Intel Core processors, as indicated below.
15.1. ábra - Intel’s Tick-Tock development model (Based on [1])
Launched: 6/2013 at Computex.
1.1. Overview of Haswell-based processor lines (Based on [120])
15.2. ábra -
1.2. Die plot of a Haswell processor [121]
15.3. ábra -
139
Created by XMLmind XSL-FO Converter.
The Haswell line
1.3. Sub-families of Haswell [144]
15.4. ábra -
140
Created by XMLmind XSL-FO Converter.
The Haswell line
2. 15.2 Key enhancements of the Haswell cores [80]
15.5. ábra -
2.1. Buffer sizes of subsequent generations of Core processors
[80]
15.6. ábra -
141
Created by XMLmind XSL-FO Converter.
The Haswell line
2.2. Cache sizes, latencies and bandwidth values of subsequent
Core generations [122]
15.7. ábra -
2.3. Issue rate and execution unit enhancements of Haswell [80]
15.8. ábra -
142
Created by XMLmind XSL-FO Converter.
The Haswell line
3. 15.3 ISA enhancements of the Haswell cores [80]
15.9. ábra -
3.1. Evolution of the AVX ISA extension [97]
15.10. ábra -
143
Created by XMLmind XSL-FO Converter.
The Haswell line
3.2. Enhancements of AVX2 [97]
15.11. ábra -
3.3. FMA and peak FLOPs of Haswell [97]
15.12. ábra -
144
Created by XMLmind XSL-FO Converter.
The Haswell line
4. 15.4 Main innovations of the Haswell processor
4.1. 15.4.1 Overview
• Enhanced graphics (Section 15.4.2)
• On-package e-DRAM cache (Section 15.4.3)
4.2. 15.4.2 Enhanced graphics
15.13. ábra -
• To compete with AMD’s advanced graphics solutions Intel put a great emphasis on enhancing Haswell’s
integrated graphics.
• The new graphics units are termed as Iris Pro and Iris graphics.
4.2.1. Main enhancements of the Iris Pro and Iris graphics units [123]
15.14. ábra -
145
Created by XMLmind XSL-FO Converter.
The Haswell line
4.2.2. Performance boost provided by the Iris Pro/Iris graphics vs. the previous
generation [123]
15.15. ábra -
146
Created by XMLmind XSL-FO Converter.
The Haswell line
4.2.3. Graphics performance increase of subsequent Core generations [117]
15.16. ábra -
4.3. 15.4.3 On-package eDRAM cache [117]
147
Created by XMLmind XSL-FO Converter.
The Haswell line
15.17. ábra -
4.3.1. Principle of operation [117]
• The on package eDRAM, designated also as Crystallwell, it operates as a true 4th level cache of the memory
hierarchy.
• It acts as a victim buffer to the L3 cache, in the sense that anything evicted from the L3 cache immediately
goes into the L4 cache.
• Both CPU and GPU requests are cached.
• The cache partitioning between CPU and GPU is dynamic.
• If the GPU is not in use the whole L4 cache may be devoted the CPU, in this case the CPU has a 128 MB L4
cache.
• Access latency after an L3 miss is 30 – 32 ns.
• The L4 cache is capable of delivering 50 GB/s in each direction.
• The Crystallwell die consumes between 0.5 and 1.0 W if idle and between 3.5 and 4.5 W under full load.
• The PCU (Power Control Unit) of the processor takes over the power management of the eDRAM, beyond
the power management of the CPU cores, GPU, L3 cache etc.
4.3.2. Implemented on-package eDRAM [124]
15.18. ábra -
148
Created by XMLmind XSL-FO Converter.
The Haswell line
4.3.3. Memory latency vs. access range in a memory system with eDRAM
cache (L4) [117]
15.19. ábra -
149
Created by XMLmind XSL-FO Converter.
The Haswell line
5. 15.5 Main features of the Haswell line of mobile and
desktop processors
5.1. 15.5.1 Example 1: Main features of Haswell-based mobile
Core i7 M-Series processors [125]
15.20. ábra -
5.2. 15.5.2 Example 2: Main features of Haswell-based Core i7
desktop processors [126]
15.21. ábra -
150
Created by XMLmind XSL-FO Converter.
The Haswell line
6. 15.6 Haswell-based desktop platform [145]
15.22. ábra -
151
Created by XMLmind XSL-FO Converter.
The Haswell line
7. 15.7 Integer and FP performance of subsequent
generations of Core processors [127]
15.23. ábra -
Note
The AVX2 extension of Haswell processors introduces 256-bit FX operations, due to this fact Haswell
processors have a considerable higher integer performance than the previous Core generations.
8. 15.8 Graphics performance of subsequent
generations of Core processors [127]
15.24. ábra -
152
Created by XMLmind XSL-FO Converter.
The Haswell line
9. 15.9 Main features of Haswell-based Xeon E3-12xx
v3 line of server processors [128]
15.25. ábra -
9.1. Main features of subsequent generations of E3-1200 Xeon
processors [129]
153
Created by XMLmind XSL-FO Converter.
The Haswell line
15.26. ábra -
10. 15.10 Haswell-based Xeon E3-1200 v3 server
platform [130]
15.27. ábra -
154
Created by XMLmind XSL-FO Converter.
16. fejezet - The Haswell-E line
1. 16.1 Introduction
• It belongs to the 4. gen. Core processor family.
• To be introduced in 2014.
• It targets high performance desktops for hardcore gamers and graphics enthusiasts.
• It provides 40 configurable PCIe 3.0 lanes that enables to attach up to 4 graphics cards.
2. 16.2 Differences to the previous Ivy Bridge-E line
[143]
2.1. 16.2.1 Overview
16.1. ábra -
2.2. 16.2.2 The Haswell-E processor [143]
16.2. ábra -
155
Created by XMLmind XSL-FO Converter.
The Haswell-E line
2.3. 16.2.3 The Wellsburg-X PCH [143]
16.3. ábra -
2.4. 16.2.4 DDR4 memory [143]
16.4. ábra -
156
Created by XMLmind XSL-FO Converter.
The Haswell-E line
2.5. 16.2.4 DDR4 memory [143]
16.5. ábra -
157
Created by XMLmind XSL-FO Converter.
17. fejezet - 17. References
[2.1] Singhal R., “Next Generation Intel Microarchitecture (Nehalem) Family: Architecture Insight and Power
Management,
IDF
Taipeh,
Oct.
2008,
http://intel.wingateweb.com/taiwan08/published/sessions/TPTS001/FA08%20IDFTaipei_TPTS001_10
0.pdf
[2.2]
Bryant
D.,
“Intel
Hitting
on
All
Cylinders,”
UBS
Conf.,
Nov.
http://files.shareholder.com/downloads/INTC/0x0x191011/e2b3bcc5-0a37-4d06-aa5a0c46e8a1a76d/UBSConfNov2007Bryant.pdf
2007,
[2.3] Fisher S., “Technical Overview of the 45 nm Next Generation Intel Core Microarchitecture (Penryn),” IDF
2007, ITPS001, http://isdlibrary.intel-dispatch.com/isd/89/45nm.pdf
[2.4] De Gelas J., “Intel Core versus AMD’s K8 architecture,” AnandTech,
http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=2748&p=1
[2.5]
Carmean D., “Inside the Pentium 4 Processor
http://people.virginia.edu/~zl4j/CS854/pda_s01_cd.pdf
Micro-architecture,”
May 1
2006,
Aug.
2000,
[2.6] Shimpi A. L. & Clark J., “AMD Opteron 248 vs. Intel Xeon 2.8: 2-way Web Servers go Head to Head,”
AnandTech, Dec. 17 2003, http://www.anandtech.com/showdoc.aspx?i=1935&p=1
[2.7] Völkel F., “Duel of the Titans: Opteron vs. Xeon : Hammer Time: AMD On The Attack,” Tom’s
hardware, Apr. 22 2003, http://www.tomshardware.com/reviews/duel-titans,620.html
[2.8] De Gelas J., “Intel Woodcrest, AMD's Opteron and Sun's UltraSparc T1: Server CPU Shoot-out,”
AnandTech, June 17 2006, http://www.anandtech.com/IT/showdoc.aspx?i=2772&p=1
[2.9] Hinton G. & al., “The Microarchitecture of the Pentium 4 Processor,” Intel Technology Journal, Q1 2001,
pp. 1-13
[2.10] Wechsler O., “Inside Intel Core Microarchitecture,” White Paper, Intel, 2006
[2.11]
Lee V., “Inside the Intel Core Microarchitecture,” IDF, May 2006,
http://www.prcidf.com.cn/sz/systems_conf/track_sz/SMC/Intel%20Core%20uArch.pdf
[2.12]
Doweck
J.,
“Inside
Intel
Core
http://www.hotchips.org/archives/hc18/
Microarchitecture,”
Hot
Chips
Shenzhen,
18,
2006,
[2.13] Gruen H., “Intel’s new Core Microarchitecture,” Develop Brighton, AMD Technical Day, July 2006,
http://ati.amd.com/developer/brighton/03%20Intel%20MicroArchitecture.pdf
[2.14] Doweck J., “Intel Smart Memory access: Minimizing Latency on Intel Core Microarchitecture, ”
Technology
@
intel
Magazine,
Sept.
2006,
pp.
1-7,
ftp://download.intel.com/corporate/pressroom/emea/deu/fotos/06-10Strategie_Tag/Intel/Intel_Core2_Prozessoren/Texte/ENGSmart_Memory_Access_Technology@Intel_Magazine_Article.pdf
[2.15] Sima D., Fountain T., Kacsuk P., Advanced Computer Architectures, Addison Wesley, Harlow etc., 1997
[2.16]
Jafarjead
B.,
“Intel
Core
Duo
Processor,”
Intel,
http://masih0111.persiangig.com/document/peresentation/behrooz%20jafarnejad.ppt
2006,
[2.17]
Pawlowski S. & Wechsler O., “Intel Core Microarchitecture,”
http://www.intel.com/pressroom/kits/core2duo/pdf/ICM_tech_overview.pdf
2006,
IDF
Spring,
[2.18] Goto H., Larrrabee architecture can be integrated into CPU”, PC Watch, Oct. 6 2008,
http://pc.watch.impress.co.jp/docs/2008/1006/kaigai470.htm
[2.19] SIMD Instruction Sets, http://softpixel.com/~cwright/programming/simd/index.php
158
Created by XMLmind XSL-FO Converter.
17. References
[2.20]
Wikipedia:
Platform
Environment
http://en.wikipedia.org/wiki/Platform_Environment_Control_Interface
Control
Interface,
[2.21] Kim N. S. et al., „Leakage Current: Moore’s Law Meets Static Power”, Computer, Dec. 2003, pp. 68-75.
[2.22] Ng P. K., “High End Desktop Platform Design Overview for the Next Generation Intel Microarchitecture
(Nehalem)
Processor,”
IDF
Taipei,
TDPS001,
2008,
http://intel.wingateweb.com/taiwan08/published/sessions/TDPS001/FA08%20IDFTaipei_TDPS001_100.pdf
[2.23] Bohr M., Mistry K., Smith S., “Intel Demonstrates High-k + Metal Gate Transistor Breakthrough in 45
nm
Microprocessors,”,
Intel,
Jan.
2007,
http://download.intel.com/pressroom/kits/45nm/Press45nm107_FINAL.pdf
[2.24]
Scott D. S., “Toward Petascale and Beyond,” APAC Conference, Oct. 2007,
http://www.apac.edu.au/apac07/pages/program/presentations/Tuesday%20Harbour%20A%20B/David_
Scott.pdf
[2.25]
Smith
S.
L.,
“45
nm
Product
Press
Briefing,”
IDF
http://download.intel.com/pressroom/kits/events/idffall_2007/BriefingSmith45nm.pdf
Fall,
2007,
[2.26] Fisher S., “Technical Overview of the 45nm Next Generation Intel Core Microarchitecture (Penryn),”
IPTS001, Fall IDF 2007, http://isdlibrary.intel-dispatch.com/isd/89/45nm.pdf
[2.27] George V., 45nm Next Generation Intel Core Microarchitecture (Penryn),” Hot Chips 19, 2007,
http://www.hotchips.org/archives/hc19/3_Tues/HC19.08/HC19.08.01.pdf
[2.28] Wikipedia: Foxton Technology, http://en.wikipedia.org/wiki/Foxton_Technology
[2.29] Coke J. & al., “Improvements in the Intel Core Penryn Processor Family Architecture and
Microarchitecture,” Intel Technology Journal, Vol. 12, No. 3, 2008, pp. 179-192
[2.30] Fisher S., “Technical Overview of the 45nm Next Generation Intel Core Microarchitecture (Penryn),”
BMA
S004,
IDF
2007,
http://my.ocworkbench.com/bbs/attachment.php?attachmentid=318&d=1176911500
[2.31]
Gelsinger
P.
P.,
“Intel
Architecture,
Press
Briefing,
http://www.slideshare.net/angsikod/gelsinger-briefing-on-intel-architecture
March
2008,
[2.32]
Gelsinger
P.,
“Invent
the
new
reality,”
IDF
Fall
2008,
San
http://download.intel.com/pressroom/kits/events/idffall_2008/PatGelsinger_day1.pdf
Francisco
[2.32]
Gelsinger
P.,
“Invent
the
new
reality,”
IDF
Fall
2008,
San
http://download.intel.com/pressroom/kits/events/idffall_2008/PatGelsinger_day1.pdf
Francisco
[2.34] Intel Timna Microprocessor Family, CPU World, http://www.cpu-world.com/CPUs/Timna/
[2.35] Smith T., “Timna - Intel's first system-on-a-chip, Before 'Tolapai', before 'Banias'. Register Hardware,
Febr. 6 2007, http://www.reghardware.co.uk/2007/02/06/forgotten_tech_intel_timna/
[2.36] Images, Xtreview, http://xtreview.com/images/K10%20processor%2045nm%20architec%203.jpg
[2.37] Singhal R. Intel’s i7, Podtech, http://www.podtech.net/home/5436/intels-core-i7
[2.38]
Strong
B.,
“A
Look
Inside
Intel:
The
http://www.cs.utexas.edu/users/cart/arch/beeman.ppt
Core
(Nehalem)
Microarchitecture,”
[2.39] Valles A. C., Ansari Z., Mehrotra P.: “Tuning Your Software for the Next Generation Intel
Microarchitecture
(Nehalem)
Family,”
IDF
2008,
NGMS002,
http://www.benchmark.rs/tests/editorial/Nehalem_munich/presentations/SoftwareTuning_for_Nehalem.pdf
159
Created by XMLmind XSL-FO Converter.
17. References
[2.40] The low cost Tymna CPU, Tom’s hardware, Febr. 25 2000, http://www.tomshardware.com/reviews/idf2000,166-3.html
[2.41]
Intel®
Virtualization
Technology,
Special
http://www.intel.com/technology/itj/2006/v10i3/
issue,
Vol.
10 No.
03 Aug.
2006,
[2.42] Intel Core 2 Duo Processor E8000 and E7000 Series Datasheet, Intel, Jan. 2009
[2.43]
Wikipedia:
List
of
Intel
Core
http://en.wikipedia.org/wiki/List_of_Intel_Core_2_microprocessors
2
microprocessors
[2.44] Wikipedia: Nehalem (microarchitecture) http://en.wikipedia.org/wiki/Nehalem_(microarchitecture)
[2.45]
Glaskowsky P.: Investigating Intel's Lynnfield mysteries,
http://news.cnet.com/8301-13512_3-10357328-23.html
cnet
News,
Sept.
21
2009,
[2.46] Shimpi A. L.: Intel's Core i7 870 & i5 750, Lynnfield: Harder, Better, Faster Stronger, AnandTech, Sept.
8 2009, http://www.anandtech.com/show/2832
[2.47]
Intel Xeon Processor C5500/C3500 Series, Datasheet –
http://download.intel.com/embedded/processor/datasheet/323103.pdf
Volume
1,
Febr.
2010,
[2.48] Intel Core TM i7-800 and i5-700 Desktop Processor Series Datasheet – Volume 1, July 2010,
http://download.intel.com/design/processor/datashts/322164.pdf
[2.49] Glaskowsky P.: Intel's Lynnfield mysteries solved, cnet News, Sept. 28 2009, http://news.cnet.com/830113512_3-10362512-23.html
[2.50] Intel Core TM i7-900 Mobile Processor Extreme Edition Series, Intel Core i7-800 and i7-700 Mobile
Processor
Series,
Datasheet
–
Volume
One,
Sept.
2009
http://download.intel.com/design/processor/datashts/320765.pdf
[2.51] Intel Turbo Boost Technology in Intel CoreTM Microarchitecture (Nehalem) Based Processors, White
Paper, Nov. 2008 http://download.intel.com/design/processor/applnots/320354.pdf
[2.52]
Power
Management
in Intel
Architecture
Servers,
White
Paper,
April
2009
http://download.intel.com/support/motherboards/server/sb/power_management_of_intel_architecture_s
ervers.pdf
[2.53] Glaskowsky P.: Explaining Intel’s Turbo Boost technology, cnet News, Sept. 28 2009,
http://news.cnet.com/8301-13512_3-10362882-23.html
[2.54]
Intel
Xeon
Processor
7500
Series,
Datasheet
http://www.intel.com/Assets/PDF/datasheet/323341.pdf
–
Volume
2,
March
2010
[2.55] Pawlowski S.: Intelligent and Expandable High- End Intel Server Platform, Codenamed Nehalem-EX,
IDF 2009
[2.56] Kottapalli S., Baxter J.: Nehalem-EX CPU Architecture, Hot Chips 2009, Sept. 10 2009
http://www.hotchips.org/archives/hc21/2_mon/HC21.24.100.ServerSystemsI-Epub/HC21.24.122Kottapalli-Intel-NHM-EX.pdf
[2.57] Kurd N. A. & all: A Family of 32 nm IA Processors, IEEE Journal of Solide-State Circuits, Vol. 46, Issue
1., Jan. 2011, pp. 119-130
[2.58]
Hill D., Chowdhury M.: Westmere Xeon-56xx „Tick” CPU, Hot Chips
http://www.hotchips.org/uploads/archive22/HC22.24.620-Hill-Intel-WSM-EP-print.pdf
2010
[2.59] Intel CoreTM i7-600, i5-500, i5-400 and i3-300 Mobile Processor Series, Datasheet - Vol.1, Jan. 2010,
http://download.intel.com/design/processor/datashts/322812.pdf
[2.60] Nagaraj D., Kottapalli S.: Westmere-EX: A 20 thread server CPU, Hot Chips 2010
http://www.hotchips.org/uploads/archive22/HC22.24.610-Nagara-Intel-6-Westmere-EX.pdf
160
Created by XMLmind XSL-FO Converter.
17. References
[2.61] Kahn O., Piazza T., Valentine B.: Technology Insight: Intel Next Generation Microarchitecture
Codename Sandy Bridge, IDF 2010 extreme.pcgameshardware.de/.../281270d1288260884bonusmaterial-pc- games-hardware-12-2010-sf10_spcs001_100.pdf
[2.62] Wikipedia: Sandy Bridge, http://en.wikipedia.org/wiki/Sandy_Bridge
[2.63] http://ark.intel.com
[2.64] Kahn O., Valentine B.: Intel Next Generation Microarchitecture Codename Sandy Bridge: New Processor
Innovations, IDF 2010
[2.65]
Shimpi A. L.: Intel Pentium 4
http://www.anandtech.com/show/661/5
1.4GHz
&
1.5GHz,
AnandTech,
Nov.
20
2000
[2.66] Yuffe M., Knoll E., Mehalel M., Shor J., Kurts T.: A fully integrated multi-CPU, GPU and memory
controller 32nm processor, ISSCC, Febr. 20-24 2011, pp. 264-266
[2.67] Intel Xeon Processor 7500/6500 Series, Public Gold Presentation, Data Center Group, March 30. 2010,
http://cache-www.intel.com/cd/00/00/44/64/446456_446456.pdf
[2.68] Tang H., Cheng H.: Intel Xeon Processor E3 Family Based Servers: A Smart Investment for Managing
Your Small Business, IDF 2011
[2.69] Thomadakis M. E. PhD: The Architecture of the Nehalem Processor and Nehalem-EP SMP Platforms,
Texas
A&M
University,
March
17
2011
http://alphamike.tamu.edu/web_home/papers/perf_nehalem.pdf
[2.70] Intel Xeon Processor E7-8800/4800/2800 Product Families, Datasheet Vol. 1 of 2, April 2011,
http://www.intel.com/Assets/PDF/datasheet/325119.pdf
[2.71] Mitchell D., Intel Nehalem-EX review, PCPro, http://www.pcpro.co.uk/reviews/processors/357709/intelnehalem-ex
[2.72] Rusu & al.: A 45 nm 8-Core Enterprise Xeon Processor, IEEE journal of Solid State Circuits, Vol. 45,
No. 1, Jan. 2010, pp. 7-14
[2.73] Wechsler O., Multicore Archtecture Challenges and Opportunities, Intel’s Annual Symposium on VLSI
CAD
and
Validation,
Keynote
presentation,
July
2007
http://www3.ee.technion.ac.il/sites/Workshop/upload/Events/ACRC%20-%20Intel/OfriWechsler%20Keynotes%20Speech.pdf
[2.74] Razin A., Core, Nehalem, Gesher. Intel: New Architecture Every Two Years, Xbit Laboratories, April 28
2006, http://www.xbitlabs.com/news/cpu/display/20060428162855.html
[2.75]
Har_Even
B.,
Justin
Rattner
Keynote?
Trusted
Reviews,
April
http://www.trustedreviews.com/news/IDF-Spring-2007-Day-One-Opening-Keynote
17
2007,
[2.76] Weatherstone R., Intel Core i7-3960X Extreme Edition Processor (Sandy Bridge-E) Review, Nov. 14
2011,
Vortez,
http://www.vortez.net/articles_pages/intel_core_i7_3960x_extreme_edition_processor_sandy_bridge_e
,2.html
[2.77] Shimpi A. L., Intel Core i7 3960X (Sandy Bridge E) Review: Keeping the High End Alive, Nov. 14
2011, AnandTech, http://www.anandtech.com/show/5091/intel-core-i7-3960x-sandy-bridge-e-reviewkeeping-the-high-end-alive/4
[2.78]
Intel® X79 Express Chipset,
chipsets/x79-express-chipset.html
http://www.intel.com/content/www/us/en/chipsets/performance-
[2.79] Intel® Desktop Board DX79SI Extreme Series, 2011, http://www.intel.com/content/dam/doc/productbrief/desktop-board-dx79si-extreme-brief.pdf
161
Created by XMLmind XSL-FO Converter.
17. References
[2.80] Chappell R., Toll B., Singhal R.: Intel Next Generation Microarchitecture Codename Haswell: New
Processor Innovations, IDF 2012
[2.81]
Olivera, A régóta várt Intel Ivy Bridge tesztje, Prohardware,
http://prohardver.hu/teszt/intel_ivy_bridge_teszt/az_ivy_bridge.html
April
[2.82]
Bohr M., Mistry K.: Intel’s Revolutionary 22 nm transistor technology,
http://download.intel.com/newsroom/kits/22nm/pdfs/22nm-Details_Presentation.pdf
13
2012,
May
2011,
[2.83] George V., Piazza T., Jiang H.: Technology Insight: Intel Next Generation Microarchitecture Codename
Ivy Bridge, IDF 2011
[2.84] 3rd Generation Intel Core Processor Family Quad Core Launch Product Information, April 23 2012,
http://download.intel.com/newsroom/kits/core/3rdgen/pdfs/3rd_Generation_Intel_Core_Product_Infor
mation.pdf
[2.85] Ivy Bridge and Haswell die configurations (estimates included), Anandtech, March 21 2012,
http://forums.anandtech.com/showthread.php?t=2234017
[2.86] Piazza T., Jiang H., Hammerlund P., Singhal R.: Technology Insight: Intel Next Generation
Microarchitecture Codename Haswell, IDF 2012 SPCS001
[2.87] Haynes D.: 2012 Socket Guide, Aug. 4 2012, http://www.ocmodshop.com/cpu-socket-guide2012/lga2011/
[2.88] O’Shea P., Thermal management solutions include support for Intel platform environmental control
interface, EE Times, Oct. 17 2006, http://www.eetimes.com/document.asp?doc_id=1301982
http://hothardware.com/Reviews/Intel-Arrandale-Core-i5-and-Core-i3-Mobile-Unveiled/
[2.89] Altavilla D., Intel Arrandale Core i5 and Core i3 Mobile Unveiled, Hot Hardware, Jan. 4 2010,
[2.90] Shimpi A. L., The Intel Core i3 530 Review - Great for Overclockers & Gamers, AnandTech, Jan. 22
2010, http://www.anandtech.com/show/2921
[2.91] Chiappetta M., Intel Clarkdale Core i5 Desktop Processor Debuts, Hot Hardware, Jan. 3 2010,
http://hothardware.com/Articles/Intel-Clarkdale-Core-i5-Desktop-Processor-Debuts/
[2.92] Thomas S. L., Desktop Platform Design Overview for Intel Microarchitecture (Nehalem) Based Platform,
Presentation ARCS001, IDF 2009
[2.93] Kirsch N., Intel Core i7-980X Six-Core Processor Extreme Edition Review, Legit Reviews, March 11
2010,
http://www.legitreviews.com/intel-core-i7-980x-six-core-processor-extreme-editionreview_1245
[2.94]
Kanter
D.,
Westmere
Arrives,
http://www.realworldtech.com/westmere/2/
Real
World
Tech,
March
17
2010,
[2.95] Hruska J., Intel Unveils 10-Core Xeons, Mission-Critical Servers, Hot Hardware, Apr. 5 2011,
http://hothardware.com/Reviews/Intel-Unveils-10Core-Xeons-MissionCritical-Servers/
[2.96]
Intel
Sandy
Bridge
Review,
Bit-tech,
Jan.
tech.net/hardware/cpus/2011/01/03/intel-sandy-bridge-review/1
3
2011,
http://www.bit-
[2.97] Toll B., Locktyukhin M., Girkar M., Intel Advanced Vector Extensions 2 and Bit Manipulation New
Instructions, IDF 2012
[2.98] Torres G., Inside the Intel Sandy Bridge Microarchitecture, Hardware Secrets, Dec. 30 2010,
http://www.hardwaresecrets.com/article/Inside-the-Intel-Sandy-Bridge-Microarchitecture/1161/2
[2.99] Piazza T., Dr. Jiang H., Microarchitecture Codename Sandy Bridge: Processor Graphics, Presentation
ARCS002, IDF San Francisco, Sept. 2010
[2.100] Wikipedia: Intel GMA, 2011, http://en.wikipedia.org/wiki/Intel_GMA
162
Created by XMLmind XSL-FO Converter.
17. References
[2.101] Shimpi A. L., The Sandy Bridge Review: Intel Core i7-2600K, i5-2500K and Core i3-2100 Tested,
AnandTech, Jan. 3 2011, http://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i7600k-i5-2500k-core-i3-2100-tested/11
[2.102]
Intel® Q67 Express Chipset,
chipsets/q67-express-chipset.html
http://www.intel.com/content/www/us/en/chipsets/mainstream-
[2.103] Rodolf K., Small and Medium Business Server Solutions from Intel, Sept. 20 2012
[2.104] Intel Core i7 Processor Family for the LGA-2011 Socket, Datasheet, Vol.1, Nov. 2012,
http://www.intel.com/content/www/us/en/processors/core/core-i7-lga-2011-datasheet-vol-1.html
[2.105] Crijns K., nVidia GeForce GTX 680 Quad-SLI review, Hardware.info, March 23 2012,
http://nl.hardware.info/reviews/2641/nvidia-geforce-gtx-680-quad-sli-review-english-version
[2.106]
Wikipedia:
List
of
Intel
Xeon
microprocessors,
http://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors#Sandy_Bridge-based_Xeons
[2.107] Morgan T.P., Intel plugs both your sockets with 'Jaketown' Xeon E5-2600s, The Register, March 6
2012, http://www.theregister.co.uk/2012/03/06/intel_xeon_2600_server_chip_launch/
[2.108] Intel Xeon E5-2690 001, CPU, Tom’s Hardware, http://www.tomshardware.com/gallery/Intel-Xeon-E52690-001,0101-320460-0-2-3-1-jpg-.html
[2.109]
System
Architecture
and
Configuration,
http://www.qdpma.com/systemarchitecture/SystemArchitecture_2012Q4.html
Oct.
2012,
[2.110] Intel Xeon Processor E5-1600/E5-2600/E5-4600 Product Families, Datasheet, Vol.1, May 2012,
http://www.intel.la/content/www/xl/es/processors/xeon/xeon-e5-1600-2600-vol-1-datasheet.html
[2.111]
Bolaria J., Romley Revamps Server Platforms,
http://www.linleygroup.com/mpr/article.php?id=10879
The
Linley
Group,
April
2
2012,
[2.112] Angelini C., Intel Xeon E5-2600: Doing Damage With Two Eight-Core CPUs, Tom’s Hardware, March
6 2012, http://www.tomshardware.com/reviews/xeon-e5-2687w-benchmark-review,3149-10.html
[2.113] Ulvr P., Fait F., Built to Scale: Introducing the Intel Xeon Processor E5 Family, 2012, http://www05.ibm.com/cz/events/truck2012/bp/pdf/system_x_cast/Intel_Xeon_E5_obchodni.pdf
[2.114] Panagiotidis N.G., Intel Xeon: the Enabling Technology to the Performing, Flexible & Energyefficient
Cloud
Infrastructure,
April
23
2013,
http://www.cisco.com/web/GR/connect2013/pdfs/007_intel_nikos_panagiotidis.pdf
[2.115]
Wikipedia:
Ivy
Bridge
http://en.wikipedia.org/wiki/Ivy_Bridge_%28microarchitecture%29
(microarchitecture),
[2.116] Vilches J., Intel Ivy Bridge: everything you need to know, TechSpot, March 8 2012,
http://www.techspot.com/guides/502-intel-ivy-bridge/page3.html
[2.117] Shimpi A.L., Intel Iris Pro 5200 Graphics Review: Core i7-4950HQ Tested, AnandTech, June 1 2013,
http://www.anandtech.com/show/6993/intel-iris-pro-5200-graphics-review-core-i74950hq-tested
[2.118]
Product
Brief
Intel
Xeon
Processor
E3-1200
v2
Product
Family,
http://software.intel.com/sites/products/vcsource/files/xeon-e3-1200v2-brief__2_.pdf
2012,
[2.119] Leberecht M., Expanding Intel Architecture Flexibility in the Data Center, March 20 2013,
http://www.worldhostingdays.com/downloads/2013/hStag2a1.pdf
[2.120]
Wikipedia:
Haswell
http://en.wikipedia.org/wiki/Haswell_%28microarchitecture%29
(microarchitecture),
[2.121] Von Holzbauer F., Kugler A., Neue Intel-Architektur mit Grafik-Fokus, Chip Online, June 1 2013,
http://www.chip.de/artikel/Intel-Haswell-Neue-CPUs-fuer-Notebooks-und-PCs_62209040.html
163
Created by XMLmind XSL-FO Converter.
17. References
[2.122] Valentine B., Intel Next Generation Microarchitecture Codename Haswell: New Processor Innovations,
2012, http://ftp.software-sources.co.il/Processor_Architecture_Update-Bob_Valentine.pdf
[2.123]
Brown M., Intel lifts the veil on Haswell graphics, PC World, May
http://www.pcworld.com/article/2037063/intel-lifts-the-veil-on-haswell-graphics.html
2
2013,
[2.124] Scansen D., Intel Launches Next Generation of Microprocessors, Engineering, June 10 2013,
http://www.engineering.com/ElectronicsDesign/ElectronicsDesignArticles/ArticleID/5838/IntelLaunches-Next-Generation-of-Microprocessors.aspx
[2.125] Walton J., Hit the Road, Jack: Intel’s Mobile Quad-Core Haswell SKUs, AnandTech, June 1 2013,
http://www.anandtech.com/show/7002/hit-the-road-jack-intels-mobile-quadcore-haswell-skus
[2.126] Walton J., Intel’s Haswell: Quad-Core Desktop Processor SKUs, AnandTech, June 1 2013,
http://www.anandtech.com/show/7001/intels-haswell-quadcore-desktop-processor-skus
[2.127] Kirsch N., Intel Core i7-4770K Haswell Processor Performance Review Posted, Legit Reviews, March
18 2013, http://www.legitreviews.com/intel-core-i7-4770k-haswell-processor-performance-reviewposted_15286
[2.128] Kennedy P., Intel Xeon E3-1200 V3 Series Compute Value Comparison, STH, June 13 2013,
http://www.servethehome.com/intel-xeon-e3-1200-v3-series-compute-comparison/
[2.129] Kennedy P., Leaked Intel Xeon E3-1200 V3 Series and Comparison to V1 and V2, STH, April 3 2013,
http://www.servethehome.com/leaked-intel-xeon-e3-1200-v3-series-comparison-v1-v2/
[2.130] Intel® Xeon® Processor E3-1200 v3 Product Family with Intel® HD Graphics P4600 and Intel® C226
Chipset: Block Diagram, http://www.intel.com/content/www/us/en/intelligent-systems/denlow/xeone3-1200-v3-c266-chipset-ibd.html
[2.131] Chiappetta M., Intel Core i7-4960X Ivy Bridge-E CPU Review, Hot Hardware, Sept. 3 2013,
http://hothardware.com/Reviews/Intel-Core-i74690X-Extreme-Edition-Ivy-Bridge-E-CPUReview/?page=2
[2.132] Intel Core i7 High End Desktop Processor Family for Socket LGA-2011, Aug. 2013,
[2.133] Pop S., Intel Releases Core i7 Extreme Central Processing Units, Softpedia, Sept. 9 2013,
http://news.softpedia.com/news/Intel-Releases-Core-i7-Extreme-Central-Processing-Units381369.shtml
[2.134] Shimpi A.L., Intel Core i7 4960X (Ivy Bridge E) Review, AnandTech, Sept. 3 2013,
http://www.anandtech.com/show/7255/intel-core-i7-4960x-ivy-bridge-e-review
[2.135] Hruska J., Core i7-4960X Ivy Bridge-E review: Intel’s Great Limp Forward, ExtremeTech, Sept. 3
2013,
http://www.extremetech.com/gaming/165498-core-i7-4960x-ivy-bridge-e-review-intels-greatlimp-forward
[2.136] Novakovic N., Intel Xeon 2013 update – A bit later, but a bit better too, VR-Zone, April 9 2013,
http://vr-zone.com/articles/intel-xeon-2013-update--a-bit-later-but-a-bit-bettertoo/19581.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+TheTomorro
wTimes+%28The+Tomorrow+Times%29
[2.137] De Gelas J., Intel's Xeon E5-2600 V2: 12-core Ivy Bridge EP for Servers, AnandTech, Sept. 17 2013,
http://www.anandtech.com/show/7285/intel-xeon-e5-2600-v2-12-core-ivy-bridge-ep
[2.138] Morgan T.P., Intel carves up Xeon E5-2600 v2 chips for two-socket boxes, The Register, Sept. 10 2013,
http://www.theregister.co.uk/2013/09/10/intel_ivy_bridge_xeon_e5_2600_v2_launch/
[2.139] Eshelman E., Intel Xeon E5-2600v2 “Ivy Bridge” Processor Review, Microway, Sept. 10 2013,
http://www.microway.com/hpc-tech-tips/intel-xeon-e5-2600v2-ivy-bridge-processor-review/
164
Created by XMLmind XSL-FO Converter.
17. References
[2.140] Shvets A., Intel Xeon E5-1600 v2 and E5-2600 v2 CPUs launched, CPU World, Sept. 14 2013,
http://www.cpu-world.com/news_2013/2013091401_Intel_Xeon_E5-1600_v2_and_E52600_v2_CPUs_launched.html
[2.141] Mujtaba H., Intel Ivy Bridge-EX Confirmed To Feature 15 Cores – Arrives in 2H 2013, WCCF Tech,
2013, http://wccftech.com/intel-ivy-bridge-ex-feature-15-cores/
[2.142] Shvets G., Some details of upcoming Intel Xeon E5 v2 and E7 v2 CPUs, CPU World, April 2 2013,
http://www.cpuworld.com/news_2013/2013040201_Some_details_of_upcoming_Intel_Xeon_E5_v2_and_E7_v2_CP
Us.html
[2.143] Mouthaan M., Intel Haswell-E slides published; DDR4 and octa-cores, Hardware Info, June 16 2013,
http://us.hardware.info/news/35581/intel-haswell-e-slides-published-ddr4-and-octa-cores
[2.144]
Introducing the 4th Generation Intel Core Processor (code-named Haswell), 2013,
http://software.intel.com/sites/default/files/introduction-to-intel-4th-generation-core-processor.pdf
[2.145] Unparalleled Performance and Responsiveness, Product Brief, Intel Z87 Chipset, 2013,
http://www.intel.com/content/dam/www/public/us/en/documents/product-briefs/z87-chipset-brief.pdf
165
Created by XMLmind XSL-FO Converter.
Part III. rész - AMD’s high
performance oriented Family 15h
(Bulldozer-based) processor lines
Created by XMLmind XSL-FO Converter.
Tartalom
1. Overview of AMD’s high performance oriented Family 15h (Bulldozer-based) processor lines 173
1. Overview of AMD’s high performance oriented Family 15h processor lines (based on [1]) 173
2. Performance increase of AMD’s DP servers up to the Bulldozer-based Interlagos [18] .. 174
3. AMD’s projection to increase performance in post Bulldozer architectures [19] ............. 174
4. Recent roadmaps of AMD’s basic lines [2] ...................................................................... 174
5. Introduction to the Family 15h lines of processors, designated also as the Bulldozer lines 175
6. The compute module of the Family 15h processors .......................................................... 176
7. Shared and dedicated components of the Bulldozer cores ................................................ 178
8. Design philosophy of using compute modules in Bulldozer-based designs ...................... 178
8.1. Main design aspects-1 [3] ..................................................................................... 178
9. Design philosophy of using compute modules .................................................................. 179
9.1. Main design aspects-2 [3] ..................................................................................... 179
10. Example: Clock speed gain achieved by the 1. generation Bulldozer design vs. the previous
K10.5 design-1 ...................................................................................................................... 179
11. a) Servers ......................................................................................................................... 179
12. Main operational parameters of AMD’s K10.5 Istambul-based DP servers (Lisbon) [13] 180
13. Main operational parameters of AMD’s Family 15h-based DP servers (Valencia) [13] 180
14. b) Desktops ..................................................................................................................... 181
15. Main features of AMD’s K10.5-based Phenom™ II X6 desktop processors [14] .......... 181
16. Main features of AMD’s 1. generation Bulldozer-based FX desktop processors [14] .... 181
17. Example: Clock speed gain achieved by the 1. generation Bulldozer design vs. the previous
K10.5 design - Summary ....................................................................................................... 182
18. The width of the Bulldozer cores .................................................................................... 182
2. First generation Family 15h Bulldozer-based processor lines .................................................... 183
1. 2.1 Overview of Family 15h Bulldozer-based processor lines [3] .................................... 183
1.1. AMD’s Bulldozer-based server and desktop lines – Overview-1 (based on [1]) . 183
1.2. Brand names of AMD’s Bulldozer-based server and desktop lines ..................... 184
1.3. Positioning AMD’s Bulldozer-based server lines ................................................. 184
1.4. Positioning AMD’s Bulldozer-based desktop lines .............................................. 185
2. 2.2 The Bulldozer Compute Module ................................................................................. 185
2.1. 2.2.1 Overview of the Bulldozer Compute Module .............................................. 185
2.1.1. The Bulldozer Compute module .............................................................. 185
2.1.2. Principle of operation of a Bulldozer module [4] ..................................... 186
2.2. 2.2.2 ISA extensions introduced in the Bulldozer design ..................................... 186
2.2.1. New Bulldozer instructions and their possible use [15] ........................... 186
2.2.2. Introduction of ISA x86 extensions by Intel vs. AMD ............................ 187
2.2.3. Comparison of FP-capabilities of Buldozer, Magny Course and Sandy Bridge [16]
............................................................................................................................ 188
2.2.4. Compiler support of Bulldozers new instructions [15] ............................ 189
2.3. 2.2.3 The microarchitecture of the Bulldozer Compute Module .......................... 189
2.3.1. AMD’s Bulldozer module contrasted with two cores of Magny Course [4] 190
2.3.2. The microarchitecture of a Bulldozer core [10] ....................................... 190
2.3.3. Block diagram of Intel’s Core 2 microarchitecture [11] ......................... 191
2.3.4. Block diagram of AMD’s K8 microarchitecture [11] .............................. 191
2.3.5. The microarchitecture of a Bulldozer core [10] ....................................... 192
2.3.6. The microarchitecture of of Intel’s Sandy Bridge cores [17] ................... 193
2.3.7. The microarchitecture of Intel’s Westmere cores [10] ............................. 194
2.4. 2.2.4 Assessing the performance potential of the Bulldozer module-1 [3] ........... 195
2.4.1. Contrasting the execution resources of the Bulldozer core with previous designs
196
2.5. 2.2.4 Assessing the performance potential of the Bulldozer module-2 [3] ........... 196
2.5.1. Contrasting the FP execution resources of the Bulldozer core with previous
designs ............................................................................................................... 197
2.5.2. Contrasting the FP execution resources of the Bulldozer core with previous
designs ............................................................................................................... 197
167
Created by XMLmind XSL-FO Converter.
AMD’s high performance oriented
Family 15h (Bulldozer-based)
processor lines
2.5.3. Comparing Bulldozer’s per module and Sandy Bridge’s per core available 256-bit
execution resources-1 [17] ................................................................................ 199
2.5.4. Comparing Bulldozer’s per module and Sandy Bridge’s per core available 256-bit
execution resources-1 [17] ................................................................................ 200
2.6. 2.2.4 Assessing the performance potential of the Bulldozer module-3 [3] ........... 200
2.6.1. Cache/main memory latencies of K10/K10.5, Bulldozer and Sandy Bridge
processors [3] ..................................................................................................... 200
2.6.2. Cache sizes of K10/K10.5, Bulldozer and Sandy Bridge processors ...... 201
2.6.3. AMD’s projection to increase performance in post Bulldozer architectures [19]
201
3. 2.3 The Orochi die ............................................................................................................ 202
3.1. The floorplan of the Orochi die ............................................................................ 202
3.2. The North Bridge of Orochi [21] .......................................................................... 203
3.3. Block diagram of the Orochi die ........................................................................... 204
4. 2.4 New power management features of the Bulldozer design ......................................... 204
4.1. AMD’s power management techniques K8 – 1. gen. Family 15h (Bulldozer) (based on
[4]) .............................................................................................................................. 204
4.2. New power management features of the Bulldozer design ................................... 205
4.3. TDP Power Cap [23] ............................................................................................ 205
4.4. Module C6 state [24], [6] ...................................................................................... 205
4.5. Module level VSS power gating ........................................................................... 206
4.6. Benefit of module level power gating (C6) vs. C1E state [7] ............................... 207
4.7. Contrasting the Smart Fetch technique with entering the Module C6 state [7] ... 208
4.8. LV-DDR3 support ................................................................................................ 208
5. 2.5 Bulldozer-based server lines ....................................................................................... 209
5.1. 2.5.1 Overview of the Bulldozer-based server lines ............................................. 209
5.1.1. Overview of the Bulldozer-based server lines-1 (Based on [1]) ............. 209
5.1.2. Overview of the Bulldozer-based server lines-2 (Based on [1]) ............. 210
5.2. 2.5.2 The Bulldozer-based Interlagos MP server line ........................................... 211
5.2.1. Positioning the Bulldozer-based Interlagos MP server line ..................... 211
5.2.2. Block diagram of Interlagos [6] ............................................................... 211
5.2.3. Example: Interlagos-based MP system [6] .............................................. 212
5.2.4. Performance increase of AMD’s MP servers up to the Bulldozer-based Interlagos
[18] ..................................................................................................................... 213
5.2.5. Performance/Watt evolution of AMD’s server lines [2] .......................... 213
5.2.6. Main features of Bulldozer-based Interlagos MP server lines [13] .......... 214
5.2.7. Comparing main features of Bulldozer-based lines with the previous generation
[4] ...................................................................................................................... 214
5.2.8. Performance assessment of Family 15h Bulldozer-based MP servers [13] 214
5.2.9. Throughput results of the Open Source server workload runs [26] .......... 215
5.2.10. Response time results of the Open Source server workload runs [26] ... 215
5.2.11. Power consumption results of the Open Source server workload runs [26] 216
5.2.12. Assessing the benchmark results gained for the Bulldozer-based Interlagos 6276
server ................................................................................................................. 216
5.3. 2.5.3 The Turbo core technology of Bulldozer-based MP servers ....................... 216
5.3.1. Principle of operation [6] ......................................................................... 217
5.3.2. Full and half load turbo frequencies of Family 15h Bulldozer-based Interlagos MP
servers [13] ......................................................................................................... 217
5.4. 2.5.4 Bulldozer-based DP (Valencia) and UP (Zurich) server lines ..................... 218
5.4.1. AMD’s 2012 – 2013 server roadmap [2] ................................................. 218
5.4.2. The Family 15h Bulldozer-based DP system (Valencia) [6] .................... 218
5.4.3. Example Family 15h Bulldozer-based DP system (Valencia) [6] ............ 219
5.4.4. Main parameters of the Family 15h Bulldozer-based Valencia DP server line [13]
............................................................................................................................ 219
5.4.5. Main parameters of the Family 15h Bulldozer-based Zurich UP server line [13]
220
5.4.6. AMD’s 2012 – 2013 server roadmap [2] ................................................. 220
5.4.7. Recent roadmaps of AMD’s basic lines [27] ........................................... 221
6. 2.6 The Bulldozer-based Zambezi DT line ....................................................................... 221
6.1. 2.6.1 Overview of the Bulldozer-based Zambezi high performance desktop line [1] 221
168
Created by XMLmind XSL-FO Converter.
AMD’s high performance oriented
Family 15h (Bulldozer-based)
processor lines
6.1.1. Brand name of the Bulldozer-based high performance Zambezi desktop line 222
6.1.2. Positioning the Bulldozer-based Zambezi high performance desktop line 222
6.1.3. The Family 15h Bulldozer-based high performance Zambezi desktop line [6]
223
6.1.4. Die plot of Zambezi [28] .......................................................................... 223
6.1.5. Key parameters of the Family 15h Bulldozer-based Zambezi desktop line [29]
224
6.1.6. System example of a Zambezi desktop system (Scorpius platform) [30] 224
6.2. 2.6.2 The Turbo core technology of the Bulldozer-based Zambezi desktop line . 225
6.2.1. Contrasting AMD’s 1. and 2. gen. Turbo core implementations [36] ...... 225
6.2.2. AMD’s 2. generation Turbo core technology .......................................... 226
6.2.3. Principle of operation [6] ......................................................................... 226
6.2.4. Nominal, 8-core Turbo, and 4-core max. Turbo frequencies of the Zambezi DT
[29] ..................................................................................................................... 227
6.2.5. Example for the operation of AMD’s 2. generation Turbo core technology [37]
227
6.2.6. Example: Running a single threaded workload on the 8150 Zambezi DT with
Turbo core enabled [36] ..................................................................................... 228
6.2.7. Run time reduction achieved by enabling Turbo core for a single threaded
workload running on an FX-8150 (Zambezi) [38] ............................................. 228
6.2.8. Run time reduction achieved by enabling Turbo core for a multi-threaded
workload running on an FX-8150 (Zambezi) [38] ............................................. 229
6.2.9. Contrasting the operation of AMD’s 2. gen. Turbo core with that of Intel’s Turbo
Boost technology, as implemented in Sandy Bridge-based desktops (i5-2500K) [36]
230
6.2.10. Principle of operation of Intel’s Deep Power Down technology [39] .... 231
6.2.11. a) Precursor of Intel’s Turbo Boost: EDAT-2 ........................................ 231
6.2.12. b) Intel’s 1. gen. Turbo Boost ................................................................ 231
6.2.13. c) Intel’s enhanced 1. gen. Turbo Boost ................................................. 232
6.2.14. Available Turbo Boost bins (133 MHz) for the 1. and 2. gen. Nehalem
processors [38] ................................................................................................... 232
6.2.15. d) Intel’s 2. gen. (Next gen.) Turbo Boost (Dynamic Turbo Boost) ...... 232
6.2.16. Contrasting the introduction of Intel’s and AMD’s Turbo and Power gating
technologies ....................................................................................................... 233
6.2.17. Evolution of Intel’s Turbo technology [34] ........................................... 234
6.3. 2.6.3 Performance assessment of the Bulldozer-based Zambezi desktop line ...... 234
6.3.1. Summary benchmark results including all tests excl. games [32] ............ 234
6.3.2. Summary performance assessment of Zambezi-1 .................................... 235
6.3.3. Summary benchmark results including all tests excl. games [32] ............ 235
6.3.4. Summary performance assessment of Zambezi-2 .................................... 236
6.3.5. Summary benchmark results including all tests excl. games [32] ............ 236
6.3.6. Example: Impact of Windows 7’s scheduling policy to the activation of Max.
Turbo mode [9] .................................................................................................. 237
6.3.7. Summary assessment of the benchmark results of the Zambezi FX 8150 line [32]
239
6.3.8. Summary assessment of all Bulldozer based designs ............................... 239
6.3.9. Remark – AMD’s reorganization after the Bulldozer disaster ................. 239
3. Second generation Family 15h Piledriver-based processor lines ................................................ 240
1. 3.1 Overview of the Pilediriver-based processor lines (based on [1]) .............................. 240
1.1. Brand names of Piledriver-based processor lines ................................................. 240
2. Piledriver-based processor lines ........................................................................................ 241
3. 3.2 The Piledriver Compute Module ................................................................................. 241
3.1. 3.2.1 Overview of the Piledriver Compute Module .............................................. 241
3.2. 3.2.1 Piledriver’s performance enhancements vs. Bulldozer [54] ........................ 242
3.2.1. Piledriver’s performance enhancements vs. the (Fam. 12h) Husky and Bulldozer
cores [55] ........................................................................................................... 242
3.3. 3.2.3 Piledriver’s power management enhancement vs. Bulldozer – The RCM technology
[63] .............................................................................................................................. 243
3.3.1. 3.2.2.1 A brief introduction into clock distribution networks [57] ........... 243
3.3.2. 3.2.3.2 Principle of the Resonant Clock Mesh (RCM) technology ......... 247
169
Created by XMLmind XSL-FO Converter.
AMD’s high performance oriented
Family 15h (Bulldozer-based)
processor lines
3.3.3. 3.2.3.3 The evolution of implementing RCM .......................................... 254
3.3.4. Main features of AMD’s Bulldozer- and Piledriver based Opteron server lines
[65] ..................................................................................................................... 255
3.3.5. Plans to implement Cyclos’s RCM in ARM Cortex-A15 [66] ................ 256
4. 3.3 Piledriver-based GPU-less processor lines .................................................................. 256
4.1. 3.3.1 Overview of the Piledriver-based GPU-less processor lines-1 .................... 256
4.1.1. Comparing the Bulldozer-based and Piledriver-based 4-module (8 cores) dies [6],
[54] .................................................................................................................... 257
4.1.2. Main functional blocks of a Piledriver-based GPU-less processor die [54] 258
4.2. 3.3.2 The Abu Dhabi Opteron 6300 server line .................................................... 258
4.2.1. Main functional blocks of the dual-chip Opteron 6300 (Abu Dhabi) 4P server
processor [67] ..................................................................................................... 259
4.2.2. Die plot of the dual-chip Opteron 6300 (Abu Dhabi) server processor [68] 260
4.2.3. Model numbers and main features of the Opteron 6300 (Abu Dhabi) 4P line [69]
260
4.2.4. Comparison of the Bulldozer-based Opteron 6200 and the Piledriver-based
Opteron 6300 server lines [67] ........................................................................... 261
4.3. 3.3.3 The Vishera high performance FX desktop line .......................................... 261
4.3.1. Main functional blocks of the high performance Vishera FX desktop line [54]
262
4.3.2. Die plot of the high performance Vishera FX desktop line [54] .............. 262
4.3.3. Model numbers and main features of the high performance Vishera FX desktop
line [60] .............................................................................................................. 263
4.3.4. Comparing main features of AMD’s Vishera and Zambezi FX desktop lines [49]
263
4.3.5. Main features of the 9-Series chipset supporting the high performance Vishera DT
[70] ..................................................................................................................... 264
4.3.6. AMD’s high-performance processor roadmap from 10/2011 [44] ........... 264
5. 3.4 Piledriver-based Trinity APU lines ............................................................................. 265
5.1. 3.4.1 Overview of the Piledriver-based Trinity APU lines ................................... 265
5.1.1. Piledriver-based Trinity APU lines .......................................................... 265
5.2. 3.4.2 The Trinity APU die ................................................................................... 265
5.2.1. AMD’s Trinity APU die [71] .................................................................. 266
5.2.2. Comparing die plots of AMD’s Llano and Trinity dies [72] .................... 266
5.2.3. Improvements of the Piledriver APU family over the Llano APU family 267
5.2.4. a) Enhancements of the microarchitecture of the Trinity APU [73] ........ 267
5.2.5. b) Improvement of the power management ............................................ 267
5.2.6. The Turbo Core technology of the Llano APU [74], [75] ........................ 268
5.2.7. Illustration of the operation of the Turbo Core Technology 3.0 of the Trinity APU
[77] ..................................................................................................................... 270
5.3. 3.4.3 The Trinity mainstream desktop APU line .................................................. 271
5.3.1. Positioning the Trinity mainstream desktop APU line [51] .................... 272
5.3.2. Main components of the Trinity mainstream desktop APU [78] ............. 272
5.3.3. Model numbers and main features of the mainstream Trinity desktop APU line
[78] (Virgo platform) ......................................................................................... 273
5.3.4. The new FM2 socket of the Trinity mainstream desktop APU line [78] 273
5.3.5. System architecture of the mainstream Trinity desktop APU with the A85X FCH
[79] ..................................................................................................................... 274
5.3.6. Performance increase achieved over the previous A-Series Llano APU line [78]
274
5.4. 3.4.4 The Trinity mobile APU line ....................................................................... 275
5.4.1. Positioning the Trinity mobile APU line-1 [51] ...................................... 275
5.4.2. Positioning the Trinity mobile APU line-2 [52] ...................................... 276
5.4.3. Model numbers and main features of the Trinity mobile APU line [80] (Comal
platform) ............................................................................................................ 276
5.4.4. The Comal mobile platform including the (Piledriver-based) Trinity APU and the
A70M/A60M FCH [52] ..................................................................................... 277
6. 3.5 Piledriver-based Richland APU lines .......................................................................... 277
6.1. 3.5.1 Overview of the Piledriver-based Trinity APU lines ................................... 277
6.1.1. Positioning the Trinity mainstream desktop and mobile APU lines [52] 278
170
Created by XMLmind XSL-FO Converter.
AMD’s high performance oriented
Family 15h (Bulldozer-based)
processor lines
6.1.2. Die shot of the Richland APU [81] .......................................................... 278
6.1.3. Key features of the Richland mobile APU line as exposed by AMD [82] 279
6.1.4. Major improvements of the Richland mobile APU line discussed [83], [84] 279
6.1.5. Principle of operation of the Temperature Smart Turbo Core (TSTC) technique-1
280
6.1.6. Principle of operation of the Temperature Smart Turbo Core (TSTC) technique-2
[85] .................................................................................................................... 280
6.1.7. Comparing clock frequencies of the Richland and the Trinity APU lines [86]
281
6.1.8. Principle of operation of the Temperature Smart Turbo Core (TSTC) technique-3
[85] .................................................................................................................... 281
6.1.9. Introducing additional frequency/voltage operating points ...................... 281
6.1.10. An innovative suite of apps. available typically on the Richland A8 and A10
models [87] ........................................................................................................ 282
6.1.11. AMD Face Login [88] ............................................................................ 282
6.1.12. AMD Gesture Control [88] .................................................................... 283
6.1.13. AMD Screen Mirror [88] ....................................................................... 283
6.1.14. AMD optimized games [88] ................................................................... 283
6.2. 3.5.2 The Richland mainstream desktop APU line ............................................... 283
6.2.1. Overview of the Richland mainstream desktop APU line ........................ 283
6.2.2. Positioning the Richland mainstream desktop and mobile APU lines [52] 284
6.2.3. Model numbers and expected key features of the Richland desktop APU line [89]
(Elite Performance platform) ............................................................................. 284
6.3. 3.5.3 The Richland mobile APU line .................................................................... 285
6.3.1. Positioning the Richland mobile APU line [52] ...................................... 285
6.3.2. Model numbers and expected main features of the Richland mobile APU line [84]
(Elite performance APU platform) ..................................................................... 286
6.3.3. AMD’s graphics performance figures of the Richland mobile APU line vs. Intel’s
Ivy Bridge-based mobile processors [83] .......................................................... 286
4. Third generation Family 15h Steamroller-based processor lines ................................................ 288
1. 4.1 Overview of Family 15h Steamroller-based processor lines (based on [1]) ............... 288
1.1. Brand names of Family 15h Steamroller-based processor lines ........................... 288
1.2. Overview of AMD’s Family 15h Steamroller-based processor lines ................... 289
2. 4.2 The Steamroller Compute Module .............................................................................. 289
2.1. Planned introduction of the Steamroller compute module .................................... 289
2.2. Preview of the Steamroller compute module (CM) ............................................. 290
2.3. Block diagram of the Steamroller compute module [45] ...................................... 290
2.4. Improvements of the front-end part of the Steamroller compute module [45] .... 290
2.5. Improving integer scheduling, integer execution and reducing average load latency in the
Steamroller compute module [45] ............................................................................... 291
2.6. Improving the power efficiency (performance/Watt figure) of the Steamroller compute
module [45] ................................................................................................................. 291
2.7. Comparing the block diagrams of three generations of the Family 15h Bulldozer design-1
...................................................................................................................................... 292
2.8. Improvements made in the microarchitecture of the Steamroller compute module 292
3. 4.3 Steamroller-based Opteron server lines ...................................................................... 293
3.1. Overview of AMD’s Family 15h Steamroller-based processor lines ................... 294
3.2. 4.3.1 Overview of Steamroller-based server lines (based on [1]) ........................ 294
3.2.1. Bringing forward the introduction of the Steamroller based server line .. 294
3.2.2. AMD’s server roadmap from 2/2012 [27] ............................................... 294
3.2.3. AMD’s indication of introducing the Streamroller based server line already in
2013 [50] ............................................................................................................ 295
4. 4.4 Overview of Steamroller-based Kaveri desktop and mobile APU lines (based on [1]) 295
4.1. AMD’s Family 15h Steamroller-based mobile APU lines (based on [1]) ........... 296
4.2. Positioning the Steamrolller-based Kaveri APU line as mainstream desktop line [51]
297
4.3. Positioning the Steamroller-based Kaveri APU as performance/mainstream mobile line
[51] ............................................................................................................................. 297
4.4. Revised positioning the Steamroller-based Kaveri APU line [52] ...................... 298
4.5. Overview of AMD’s Family 15h Steamroller-based APU lines .......................... 298
171
Created by XMLmind XSL-FO Converter.
AMD’s high performance oriented
Family 15h (Bulldozer-based)
processor lines
4.6. Main components of Kaveri APUs .......................................................................
4.7. Architectural integration of the CPU and the GPU in Kaveri APU lines .............
4.8. Evolution of HSA in subsequent mobile APU lines [48] .....................................
4.9. GPU co-processing without pointers and data sharing – Without HSA [91] ........
4.10. GPU co-processing with pointers and data sharing – With HSA [91] ................
4.11. Data transfers in the memory hierarchy of the Llano APU [53] .........................
5. References ..................................................................................................................................
172
Created by XMLmind XSL-FO Converter.
299
299
299
299
300
301
302
1. fejezet - Overview of AMD’s high
performance oriented Family 15h
(Bulldozer-based) processor lines
The high performance oriented Family 15h processor lines include up to date three generations, as follows:
1.1. ábra -
1. Overview of AMD’s high performance oriented
Family 15h processor lines (based on [1])
1.2. ábra -
173
Created by XMLmind XSL-FO Converter.
Overview of AMD’s high
performance oriented Family 15h
(Bulldozer-based) processor lines
2. Performance increase of AMD’s DP servers up to
the Bulldozer-based Interlagos [18]
1.3. ábra -
3. AMD’s projection to increase performance in post
Bulldozer architectures [19]
1.4. ábra -
With the above slow rate of performance increase it is strongly questionable whether AMD will able to catch up
ever with Intel’s future processor lines.
4. Recent roadmaps of AMD’s basic lines [2]
Published: February 2011
Note: This roadmap does not show any performance increase estimation more.
174
Created by XMLmind XSL-FO Converter.
Overview of AMD’s high
performance oriented Family 15h
(Bulldozer-based) processor lines
1.5. ábra -
In the literature and also in these slides the term Bulldozer is used in two disrtinct interpretations;
Note to the terminology
a) The Bulldozer designation typically refers to the 1. generation Family 15h processor lines.
b) In addition, the Bulldozer designation is also used to refer to the whole set of performance oriented Family
15h processor lines.
According to this interpratation we designate
• the Piledriver processor lines as the 2. generation Bulldozer lines and
• the Steamroller processor lines as the 3. generation Bulldozer lines.
Usually, the context clarifies which interpretation fits.
5. Introduction to the Family 15h lines of processors,
designated also as the Bulldozer lines
• The Bulldozer project started in 2005 [4].
• First release of Bulldozer-based desktops: 10/2011
• First release of Bulldozer-based servers: 11/2011
• Bulldozer-based processors are built up of compute modules.
1.6. ábra -
175
Created by XMLmind XSL-FO Converter.
Overview of AMD’s high
performance oriented Family 15h
(Bulldozer-based) processor lines
6. The compute module of the Family 15h processors
It is designated also as the Bulldozer module.
A Bulldozer module can execute two threads in parallel i.e. it can be considered as built up of two cores [15].
1.7. ábra -
176
Created by XMLmind XSL-FO Converter.
Overview of AMD’s high
performance oriented Family 15h
(Bulldozer-based) processor lines
The difference to traditional cores is that both cores of a Bulldozer module have beyond dedicated also shared
resources in order to reduce silicon area and save power [4].
1.8. ábra -
177
Created by XMLmind XSL-FO Converter.
Overview of AMD’s high
performance oriented Family 15h
(Bulldozer-based) processor lines
7. Shared and dedicated components of the Bulldozer
cores
Dedicated and shared components of the Bulldozer cores are indicated in the Figure below.
Shared components may be shared either at the module level or at the chip level, as the next Figure indicates [4].
1.9. ábra -
8. Design philosophy of using compute modules in
Bulldozer-based designs
8.1. Main design aspects-1 [3]
AMD optimized the microarchitecture of their Bulldozer-based processors for multithreaded workloads rather
than for single threaded performance.
b) In light of the Fusion system architecture concept AMD’s belief is that heavy FP tasks should not be executed
on the CPU cores but on an integrated GPGPU.
As a consequence the FP part of the microarchitecture may be designed for a low FP-load.
c) A further key aspect was reducing power consumption.
This goal motivated a number of design decisions related to the microarchitecture and also the utilization of a
number of low power techniques, that are discussed in Section 2.4.
178
Created by XMLmind XSL-FO Converter.
Overview of AMD’s high
performance oriented Family 15h
(Bulldozer-based) processor lines
AMD’s decision to integrate two conventional cores into a Bulldozer module is in line with the aspects a) to c)
since
• a module provides two high performance, separate FX cores to support multithreading (aspect a))
• the choice to include a single moderately high performance shared FP unit satisfies aspects b) and c)
• sharing the complex x86 decoding by two cores reduces power consumption (aspect c), nevertheless it
reduces the decode bandwidth.
9. Design philosophy of using compute modules
9.1. Main design aspects-2 [3]
d) As far as the single-threaded performance concerns, AMD focused on increasing clock speed rather than ILP.
To increase clock speed AMD lengthened Buldozer’s pipeline compared to their previous K10/K10.5 designs
(which used 12 FX stages and 17 FP stages).
AMD declined to release the pipeline depth of Bulldozer, nevertheless, according to unofficial sources
Bulldozer has 18 pipeline stages [12].
Remark
Number of pipeline stages in recent Intel and AMD processors
1.10. ábra -
10. Example: Clock speed gain achieved by the 1.
generation Bulldozer design vs. the previous K10.5
design-1
As the basic building blocks of the 1. generation Bulldozer-based processors (and also of all futher generations)
are 4-module units, called in case of the 1. generation as the Orochi dies, and these 4-module units include 8
cores, (nevertheless with shared resources) actually we will compare clock frequencies of 8-core Family 15h
Bulldozer systems with the previous 6-core K10.5 Istambul-based designs.
11. a) Servers
Comparing clock speeds of K10.5 Istambul-based 6C DP servers (Lisbon) vs. Family 15h Bulldozer based 8C
DP servers (called Valencia).
179
Created by XMLmind XSL-FO Converter.
Overview of AMD’s high
performance oriented Family 15h
(Bulldozer-based) processor lines
12. Main operational parameters of AMD’s K10.5
Istambul-based DP servers (Lisbon) [13]
1.11. ábra -
13. Main operational parameters of AMD’s Family 15hbased DP servers (Valencia) [13]
1.12. ábra -
180
Created by XMLmind XSL-FO Converter.
Overview of AMD’s high
performance oriented Family 15h
(Bulldozer-based) processor lines
Example: Clock speed gain achieved by the 1. generation Bulldozer design vs. the previous K10.5 design-2
14. b) Desktops
Comparing clock speeds of K10.5 Istambul-based 6C desktops (Phenom II X6) vs. Family 15h Bulldozer based
8C desktops (FX).
15. Main features of AMD’s K10.5-based Phenom™ II
X6 desktop processors [14]
1.13. ábra -
16. Main features of AMD’s 1. generation Bulldozerbased FX desktop processors [14]
181
Created by XMLmind XSL-FO Converter.
Overview of AMD’s high
performance oriented Family 15h
(Bulldozer-based) processor lines
1.14. ábra -
17. Example: Clock speed gain achieved by the 1.
generation Bulldozer design vs. the previous K10.5
design - Summary
1.15. ábra -
The achieved clock speed gain of Bulldozer-based designs is about 10 – 20 %.
This speed gain is quite moderate since K10.5 Istambul-based processors are fabricated by 45 nm whereas
Family 15h Bulldozer-based processors with 32 nm feature size.
18. The width of the Bulldozer cores
Bulldozer’s cores have a new, 4-wide microarchitecture unlike previous 3-wide K8-Family 12h designs, as
detailed in Section 2.2.3
Remark
With the 4-wide Bulldozer design AMD caught up with Intel’s 4-wide Core 2 (2006) and subsequent designs.
182
Created by XMLmind XSL-FO Converter.
2. fejezet - First generation Family
15h Bulldozer-based processor lines
1. 2.1 Overview of Family 15h Bulldozer-based
processor lines [3]
Officially designated as the Family 15h Models 00h-0Fh processor lines.
They are called also as the 1. generation Bulldozer-based processor lines.
2.1. ábra -
1.1. AMD’s Bulldozer-based server and desktop lines – Overview1 (based on [1])
2.2. ábra -
183
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
1.2. Brand names of AMD’s Bulldozer-based server and desktop
lines
2.3. ábra -
1.3. Positioning AMD’s Bulldozer-based server lines
2.4. ábra -
184
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
1.4. Positioning AMD’s Bulldozer-based desktop lines
2.5. ábra -
2. 2.2 The Bulldozer Compute Module
2.1. 2.2.1 Overview of the Bulldozer Compute Module
2.1.1. The Bulldozer Compute module
185
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
It includes two cores with dedicated and shared resources, as disscussed already in Chapter 1, but redrawn
below [4].
2.6. ábra -
2.1.2. Principle of operation of a Bulldozer module [4]
2.7. ábra -
2.2. 2.2.2 ISA extensions introduced in the Bulldozer design
2.2.1. New Bulldozer instructions and their possible use [15]
2.8. ábra 186
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
2.2.2. Introduction of ISA x86 extensions by Intel vs. AMD
2.9. ábra -
187
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
2.10. ábra - Overview of Intel’s x86 ISA extensions (based on [44])
2.2.3. Comparison of FP-capabilities of Buldozer, Magny Course and Sandy
Bridge [16]
2.11. ábra -
188
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
2.2.4. Compiler support of Bulldozers new instructions [15]
2.12. ábra -
2.3. 2.2.3 The microarchitecture of the Bulldozer Compute
Module
189
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
Bulldozer based lines are built up on Bulldozer Compute Modules, each of which can be considered as being
two conventional cores.
2.3.1. AMD’s Bulldozer module contrasted with two cores of Magny Course [4]
2.13. ábra -
2.3.2. The microarchitecture of a Bulldozer core [10]
• 3. gen. superscalar
• Front-end: 4 wide
2.14. ábra -
190
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
While introducing a 4-wide microarchitecture AMD eliminated their intrinsic drawback vs. Intel that arose with
the introduction of Intel’s 4-wide Core 2 microarchitecture in 2006 whereas AMD remained stuck with their 3wide K8 design until Bulldozer.
2.3.3. Block diagram of Intel’s Core 2 microarchitecture [11]
2.15. ábra -
2.3.4. Block diagram of AMD’s K8 microarchitecture [11]
2.16. ábra -
191
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
2.3.5. The microarchitecture of a Bulldozer core [10]
• 3. gen. superscalar
• Front-end: 4 wide
• Issue rate to the EUs: 4 + 4 (shared)
2.17. ábra -
192
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
2.3.6. The microarchitecture of of Intel’s Sandy Bridge cores [17]
• 3. gen. superscalar
• Front-end: 4 wide
• Issue rate to the EUs: 6
2.18. ábra -
193
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
2.3.7. The microarchitecture of Intel’s Westmere cores [10]
• 3. gen. superscalar
• Front-end: 4 wide
• Issue rate to the EUs: 6
2.19. ábra -
194
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
Remark
A very detailed description of Bulldozer’s microarchitecture can be found in [10].
2.4. 2.2.4 Assessing the performance potential of the Bulldozer
module-1 [3]
195
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
The FX-part of a Bulldozer core-1
• It can be considered as a peculiar Bulldozer core that shares specific resources (decoding, FP and multimedia
processing, L2 cache) with another core.
• A Bulldozer core includes less execution resources than the previous K8-Family 12h cores, as indicated in the
next Figure, presumably, in order to reduce power consumption or to optimize performance/power.
2.4.1. Contrasting the execution resources of the Bulldozer core with previous
designs
2.20. ábra -
1) The FX-part of a Bulldozer core-2 [3]
Previous K8-Family 12h designs provided basically
• three 64-bit FX ALUs and
• three 64-bit AGUs
(used as Address Generation Units to calculate memory addresses of load/store operations).
On the other side a Bulldozer core is equipped only with
• two 64-bit FX ALUs and
• two 64-bit AGUs.
As a consequence, a Bulldozer core can execute up to two ALU and up to two AGU operations per cycle, less
than previous AMD designs that allowed to perform up to three ALU and up to three AGU operations per cycle.
2.5. 2.2.4 Assessing the performance potential of the Bulldozer
module-2 [3]
2) The FP-part of the Bulldozer module-1
It is shared by two cores and incorporates four 128-bit units.
196
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
2.5.1. Contrasting the FP execution resources of the Bulldozer core with
previous designs
2.21. ábra -
2) The FP-part of the Bulldozer module-2 [3]
From the available four units
• two serve multimedia operations (MMX and SSE) and
• only two can be used for FP operations (FMAC).
The two FMAC (FP Multiply Accumulate) units can be ganged together to execute 256-bit AVX (Advanced
Vector Extension) instructions.
2) The FP-part of the Bulldozer module-3
On the other hand AMD’s K10- Family 12h cores have
• Three 128-bit FP-units, as indicated in the next Figure.
2.5.2. Contrasting the FP execution resources of the Bulldozer core with
previous designs
2.22. ábra -
197
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
2) The FP-part of the Bulldozer module-4
On the other and AMD’s K10- Family 12h cores have
• Three 128-bit FP-units.
Each of AMD’s K10-Family 12h cores can perform up to two 64-bit FP operations and beyond that 64-bit
MMX or 128 bit SSE operations.
Remark
The FP-units of the K8 cores were only 64-bit wide and each of them could perform only a single FP DP
operation.
2) The FP-part of the Bulldozer module-5
Comparison of the number of FP DP operations that can be executed per cycle
2.23. ábra -
Obviously, Bulldozer has considerable less per thread available FP execution resources than K10-Family 12h
cores, presumably in order to achieve power reduction.
3) 256-bit execution resources
Bulldozer makes use of two available 128-bit FMAC units as a 256-bit AVX unit [15] (called by AMD as the
FLEX FP).
2.24. ábra -
198
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
2.5.3. Comparing Bulldozer’s per module and Sandy Bridge’s per core
available 256-bit execution resources-1 [17]
2.25. ábra -
199
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
2.5.4. Comparing Bulldozer’s per module and Sandy Bridge’s per core
available 256-bit execution resources-1 [17]
As long as Bulldozer has a single 256-bit execution resource (2 ganged 128-bit FMAC units) per module (two
cores)
Intel’s Sandy Bridge includes three 256 bit units per core, i.e. it has considerable more 256-bit execution
resources.
2.6. 2.2.4 Assessing the performance potential of the Bulldozer
module-3 [3]
4) The pipeline depth of Bulldozer-1
In order to increase single thread performance designers of Bulldozer lengthened its
FX pipeline to about 18 or 20 stages compared to 12 stages of the K8-Family 12h designs.
Consequences of the longer pipelines
• Increased penalty of incorrectly guessed branches
• Longer cache and main memory latencies.
2.6.1. Cache/main memory latencies of K10/K10.5, Bulldozer and Sandy Bridge
processors [3]
2.26. ábra -
200
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
Cache memory latencies can only be assessed however, in relation with cache sizes.
2.6.2. Cache sizes of K10/K10.5, Bulldozer and Sandy Bridge processors
2.27. ábra -
4) The pipeline depth of Bulldozer-2 [3]
• As above data shows, Bulldozer has larger L2/L3 caches.
• Bulldozer’s larger L2/L3 caches vs. their previous designs as well as its higher clock speed gave rise to higher
access latencies.
• Larger caches result in less cache misses but cause higher access latencies that impede IPC.
• It remains to be evaluated whether or not AMD’s decisions related to the Bulldozer design, including the
module concept, the trade-offs concerning pipeline length, cache sizes, and cache and memory latencies pay
off or not.
Sections 2.5.2 and 2.6.3 compare the achieved performance of Bulldozer-based server and desktop designs
with AMD’s previous K10.5 (Istambul)-based designs as well as Intel’s Westmere-EP based server and
Sandy Bridge based desktop processors.
2.6.3. AMD’s projection to increase performance in post Bulldozer
architectures [19]
2.28. ábra -
201
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
With the above slow rate of performance increase it is strongly questionable whether AMD will able to catch up
ever with Intel’s future processor lines.
3. 2.3 The Orochi die
The high level building block of the 1. generation Bulldozer family is the Orochi die.
2.29. ábra -
3.1. The floorplan of the Orochi die
2.30. ábra -
202
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
Main parameters of the Orochi die
• 32 nm feature size
• 1.2 billion transistors
• 315 mm2
• 1 MB L2/core
• 8 MB L3
Bulldozer-based processors are built up of one or two Orochi dies as follows:
Servers
Interlagos: 2 dies (16 cores) implemented as a Multi-Chip Module (MCM)
Valencia: 1 die (8 cores)
Desktops
Zambezi: 1 die (8 cores)
3.2. The North Bridge of Orochi [21]
2.31. ábra -
203
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
3.3. Block diagram of the Orochi die
It incorporates 4 Bulldozer-modules (8 cores) [22]
2.32. ábra -
4. 2.4 New power management features of the
Bulldozer design
4.1. AMD’s power management techniques K8 – 1. gen. Family
15h (Bulldozer) (based on [4])
2.33. ábra 204
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
4.2. New power management features of the Bulldozer design
• TDP Power Cap
• Module C6 state
• Module level VSS power gating
• Ultra LV-DDR3 support
4.3. TDP Power Cap [23]
• Power Capping was introduced in K10.5 Shanghai based servers to set a power limit by setting the max. Pstate via BIOS.
This kind of operation however restricts the processor from using the highest clock frequencies that are
associated with the locked out P-states. This results in longer response or run times.
• TDP Power Cap however, allows users to restrict power consumption without capping clock frequencies.
Then while the processor runs under normal circumstances (e.g. at 40-70 % of its full load) the response or
run times remain about the same as without power capping. The max. TDP can be set either via BIOS or
APML.
4.4. Module C6 state [24], [6]
(designated as Core C6 state or CC6 state by AMD)
The related BIOS and Kernel Developer’s Guide (BKDG) and most AMD literature designates the Module C6
state as Core C6 state.
In the Module C6 state
• the L1 data caches of both cores and the shared L1 instruction cache and the L2 cache of the module are
flushed into the L3,
205
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
• the module state (register state) is dumped to DRAM, and
• VSS is power gated.
Entering the Module C6 state
When both Bulldozer cores of a module enter an idle state (non C0 state, like C1 to C3) and the condition for
flushing the L1/L2 caches remains valid for a preset period of time (checked by a counter)
• the L1/L2 caches of the module are flushed to the L3 cache,
• the internal state of the module is dumped to the DRAM and
• VSS of the module becomes power gated.
Module level VSS power gating results by approximately 95 % reduction of the leakage power [25].
Exiting the Module C6 state
It happens in reverse sequence than entering into the Module C6 state.
Remark [15]
• Entering a Core C6 state with power gating would be possible only for the components which are dedicated
for a core, such as the integer unit and the L1 data cache.
• Shared components can be power gated obviously only at the module level.
2.34. ábra -
4.5. Module level VSS power gating
The last step of entering the Module C6 state is power gating of the module.
206
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
A Bulldozer module will be power gated by a dedicated power gating ring that isolates the core VSS from the
real VSS [6], as detailed for the Fam. 12h Llano processor in Section 11.
2.35. ábra -
4.6. Benefit of module level power gating (C6) vs. C1E state [7]
2.36. ábra -
207
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
4.7. Contrasting the Smart Fetch technique with entering the
Module C6 state [7]
2.37. ábra -
4.8. LV-DDR3 support
208
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
LV-DDR3 support was already introduced for K10.5 Magny Course servers for 1.35 V low-voltage DDR3
devices.
LV-DDR3 support is now extended for 1.25 V ultra low-voltage DDR3 devices as well.
Remark
Summary of AMD’s power management techniques used in the Family 15h [15]
2.38. ábra -
5. 2.5 Bulldozer-based server lines
5.1. 2.5.1 Overview of the Bulldozer-based server lines
5.1.1. Overview of the Bulldozer-based server lines-1 (Based on [1])
2.39. ábra -
209
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
5.1.2. Overview of the Bulldozer-based server lines-2 (Based on [1])
2.40. ábra -
210
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
5.2. 2.5.2 The Bulldozer-based Interlagos MP server line
5.2.1. Positioning the Bulldozer-based Interlagos MP server line
2.41. ábra -
5.2.2. Block diagram of Interlagos [6]
2.42. ábra -
211
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
5.2.3. Example: Interlagos-based MP system [6]
2.43. ábra -
212
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
5.2.4. Performance increase of AMD’s MP servers up to the Bulldozer-based
Interlagos [18]
2.44. ábra -
5.2.5. Performance/Watt evolution of AMD’s server lines [2]
2.45. ábra -
213
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
5.2.6. Main features of Bulldozer-based Interlagos MP server lines [13]
Released: 11/2011 – Socket G34
2.46. ábra -
5.2.7. Comparing main features of Bulldozer-based lines with the previous
generation [4]
2.47. ábra -
5.2.8. Performance assessment of Family 15h Bulldozer-based MP servers [13]
There are results available for Open Source server workload running on four DP configurations covering
competing AMD and Intel server processors.
The Open Source server workload (termed as VApus FOS) was created by Anandtech.
It is is a mix of four Virtual Machines (VM) with open source workloads including
214
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
• Apache2,
• MySQL,
• Community server 5.1.37 database,
• VMware’s open source groupware Zimbra 7.1.0.
The processors compared in DP configurations are
• AMD Opteron "Bulldozer“ based Interlagos 6276 at 2.3GHz -16 cores
• AMD Opteron "Magny-Cours" 6174 at 2.2GHz – 12 cores
• Intel Xeon X5670 “Westmere” at 2.93GHz - 6 cores
• Intel Xeon X5650 “Westmere” at 2.66GHz - 6 cores
These processors have roughly the same price point.
The software environment and the hardware configurations are detailed in [26].
5.2.9. Throughput results of the Open Source server workload runs [26]
2.48. ábra -
5.2.10. Response time results of the Open Source server workload runs [26]
2.49. ábra -
215
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
5.2.11. Power consumption results of the Open Source server workload runs
[26]
2.50. ábra -
5.2.12. Assessing the benchmark results gained for the Bulldozer-based
Interlagos 6276 server
• The Bulldozer-based 8-module (16-core) Opteron 6276 (fc = 2.3 GHz) was at writing the report AMD’s
second highest performing Bulldozer server processor.
(The flagship model 6282 SE is clocked at 2.6 GHz.)
• The benchmark results show that AMD’s 16-core 2.3 GHz Opteron 6276
• provides only a moderate performance increase over the previous K10.5 Magny Course-based 12-core
Opteron 6174, if any and
• it has lower performance figures than Intel’s 6-core Westmere-based Xeon X5650/5670 processors clocked
at 2.66 and 2.93 GHz, respectively.
5.3. 2.5.3 The Turbo core technology of Bulldozer-based MP
servers
216
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
Aim of the Turbo core technology
Increase performance of lightly threaded workloads by raising fc if there is a TDP headroom available.
Second generation Turbo core technology
The Turbo core technology of Bulldozer-based MP servers is already AMD’s second generation Turbo core
technology.
The first generation Turbo core tecnology became introduced with the 6-core K10.5 Istambul based desktop line
(Phenom II X6 line) called Thuban (4/2010).
5.3.1. Principle of operation [6]
2.51. ábra -
5.3.2. Full and half load turbo frequencies of Family 15h Bulldozer-based
Interlagos MP servers [13]
2.52. ábra -
217
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
A detailed description of the Bulldozer-based Turbo core technique will be given in connection with the
Zambezi desktop processor, in Section 2.6
5.4. 2.5.4 Bulldozer-based DP (Valencia) and UP (Zurich) server
lines
5.4.1. AMD’s 2012 – 2013 server roadmap [2]
2.53. ábra -
5.4.2. The Family 15h Bulldozer-based DP system (Valencia) [6]
2.54. ábra -
218
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
5.4.3. Example Family 15h Bulldozer-based DP system (Valencia) [6]
2.55. ábra -
5.4.4. Main parameters of the Family 15h Bulldozer-based Valencia DP server
line [13]
Released 11/2011 - Socket C32
2.56. ábra -
219
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
5.4.5. Main parameters of the Family 15h Bulldozer-based Zurich UP server line
[13]
Released 3/2012 – Socket AM3+
2.57. ábra -
5.4.6. AMD’s 2012 – 2013 server roadmap [2]
2.58. ábra -
220
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
5.4.7. Recent roadmaps of AMD’s basic lines [27]
Published: Oct. 2011
2.59. ábra -
Note
If AMD will achieve only the estimated performance increase of about 10-15 % per year they have no hope to
compete with Intel on the high performance segment.
6. 2.6 The Bulldozer-based Zambezi DT line
6.1. 2.6.1 Overview of the Bulldozer-based Zambezi high
performance desktop line [1]
221
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
2.60. ábra -
6.1.1. Brand name of the Bulldozer-based high performance Zambezi desktop
line
2.61. ábra -
6.1.2. Positioning the Bulldozer-based Zambezi high performance desktop line
2.62. ábra -
222
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
6.1.3. The Family 15h Bulldozer-based high performance Zambezi desktop line
[6]
2.63. ábra -
6.1.4. Die plot of Zambezi [28]
2.64. ábra -
223
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
Zambezi is based on the Orochi die (it includes 4 Bulldozer modules)
6.1.5. Key parameters of the Family 15h Bulldozer-based Zambezi desktop line
[29]
2.65. ábra -
6.1.6. System example of a Zambezi desktop system (Scorpius platform) [30]
2.66. ábra -
224
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
6.2. 2.6.2 The Turbo core technology of the Bulldozer-based
Zambezi desktop line
6.2.1. Contrasting AMD’s 1. and 2. gen. Turbo core implementations [36]
6.2.1.1. The 1. generation Turbo core technology
• It appeared in K10.5 Istambul-based desktops (Phenom II X6, Thuban) in 4/2010.
• This processor did not yet support power-gating.
Much less headroom was available for the Turbo core technology.
• Beyond the base clock frequency there was only a single higher frequency value, the turbo frequency.
Turbo core became seldom activated and if so, it remained only for short times effective.
Example [36]
2.67. ábra -
225
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
6.2.2. AMD’s 2. generation Turbo core technology
• The 2. generation Turbo core is introduced in Family 15h Bulldozer-based servers and desktops (Interlagos,
Zambezi) in 10/2011.
• These processors do support power-gating.
So much more headroom remains for utilizing Turbo core.
• Beyond the base clock frequency there are two turbo levels,
• the 8-core Turbo frequency, that becomes activated if all cores are active but there remains a power
headroom up to the TDP, and
• the 4-core Turbo frequency, that can be activated if at least half of the cores are in the CC6 state and the
active cores request max. performance.
For single threaded applications the active core will run basically at the 8-core Turbo frequency and if there
remains enough headroom to the TDP even at the the 4-core Turbo frequency, as demonstrated below.
6.2.3. Principle of operation [6]
2.68. ábra -
226
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
6.2.4. Nominal, 8-core Turbo, and 4-core max. Turbo frequencies of the
Zambezi DT [29]
2.69. ábra -
6.2.5. Example for the operation of AMD’s 2. generation Turbo core technology
[37]
2.70. ábra -
227
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
6.2.6. Example: Running a single threaded workload on the 8150 Zambezi DT
with Turbo core enabled [36]
2.71. ábra -
While running a single threaded workload, essentially seven of the 8 cores remain idle. The processor runs most
of the time at the Turbo core frequency (3.9 GHz for the FX-8150). The average clock speed is 3.93GHz, 9%
above the 3.6 GHz base clock of the FX-8150.
6.2.7. Run time reduction achieved by enabling Turbo core for a single
threaded workload running on an FX-8150 (Zambezi) [38]
2.72. ábra -
228
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
In the single threaded example, 7 of the 8 cores remain typically idle, and with the Turbo core mode enabled, the
processor runs mostly at the Turbo frequency, and partly at the max. Turbo frequency.
This results in a run time reduction of about 10 s (~ 7 %) while the Turbo core mode is activated.
6.2.8. Run time reduction achieved by enabling Turbo core for a multi-threaded
workload running on an FX-8150 (Zambezi) [38]
2.73. ábra -
229
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
The multi-threaded workload is spread across all 8 cores, and if Turbo core is enabled, clock frequency
alternates between the base clock of 3.6 GHz and the (8-core) Turbo frequency of 3.9 GHz.
The resulting run time reduction is about 0.2 s (~ 4 %), much less than for a single threaded workload.
Remark
The efficiency of Turbo core is affected also by the scheduler of the OS, as discussed in Section 2.6.3.
6.2.9. Contrasting the operation of AMD’s 2. gen. Turbo core with that of Intel’s
Turbo Boost technology, as implemented in Sandy Bridge-based desktops (i52500K) [36]
2.74. ábra -
Intel’s Turbo Boost implementation gives rise to a more frequent fluctuation than in case of AMD’s Turbo core.
The average clock frequency remains at 3.5 GHz only about 6 % higher over the base frequency.
So it seems that AMD’s Turbo core technology, at least in the example shown, is more efficient than Intel’s
Turbo Boost.
Remark
Brief comparison with Intel’s Turbo Boost implementations
a) Precursor of Intel’s Turbo Boost: EDAT-1
(Enhanced Dynamic Acceleration Technology)
• Introduced in Penryn-based 2-core mobiles in 2008, along with the DPD technology
(Deep Power Down Technology).
• The DPD technology is activated by the OS (through the MWAIT API) if a core is „ long enough” idle.
230
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
“Long enough” will be decided by the OS based on a heuristics to prevent situations when saving and restoring
needs more power than gained by entering this state.
When activated, the MWAIT API lets flush the L2 cache, save the core state of the idle core into an SRAM that
has a private power supply and then reduces core voltage to a very low level.
Thus entering the DPD state assures a power headroom for increasing the clock frequency of the active core.
6.2.10. Principle of operation of Intel’s Deep Power Down technology [39]
2.75. ábra -
6.2.11. a) Precursor of Intel’s Turbo Boost: EDAT-2
The EDAT technology
If one of the cores becomes idle and enters the C3 state or deeper, and the OS requests the highest performance
state for the active core, the clock frequency of the active core will be raised by a single turbo bin (133 MHz).
6.2.12. b) Intel’s 1. gen. Turbo Boost
Introduced in 1. gen. Nehalem processors (such as the 4-core Bloomfield (2008) targeting desktops and servers),
along with
• Integrated Power Gates (for VCC) to reduce leakage current to near zero, and a
• Power Control Unit (integrated microcontroller of the complexity of a 486 processor) that has real time
sensors for current, voltage and temperatures, samples these values in 5 ms intervals and controls Turbo
Boost based on sophisticated algorithms [40], [41].
• If the OS requests an active core to increase fc beyond the TDP limited maximum frequency (i.e. to enter the
PO state), and there is available power headroom
• either by having idle cores
• or a lightly threaded workload
231
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
the Power Control Unit will increase the core frequency of the active cores provided that the power consumption
of the socket and junction temperatures of the cores do not exceed given limits.
• In turbo mode all active cores in the processor will operate at the same fc and voltage.
• There are only 2 turbo bins available for boosting the clock frequency (2x133 MHz).
6.2.13. c) Intel’s enhanced 1. gen. Turbo Boost
• Introduced in 2. gen. Nehalem processors (such as the 4-core Lynnfield lines (2009) targeting all processor
categories.
• The enhancement is that there are more than two turbo bins (2x 133 MHz) available for raising core
frequency, as shown in the next Table.
6.2.14. Available Turbo Boost bins (133 MHz) for the 1. and 2. gen. Nehalem
processors [38]
2.76. ábra -
6.2.15. d) Intel’s 2. gen. (Next gen.) Turbo Boost (Dynamic Turbo Boost)
• Introduced along with the Sandy Bridge line of mobile and desktop processors in 2011.
These processors incorporate up to 4 cores and a GPU.
• It allows to utilize the energy budget accumulated during idle periods for boosting fc such that the power
consumption can raise beyond TDP for a short period of time [8].
2.77. ábra -
232
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
6.2.16. Contrasting the introduction of Intel’s and AMD’s Turbo and Power
gating technologies
2.78. ábra -
233
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
6.2.17. Evolution of Intel’s Turbo technology [34]
2.79. ábra -
As indicated in the previous slide, Intel has a lead of about two years over AMD both in the Turbo and the
Power gating technologies..
6.3. 2.6.3 Performance assessment of the Bulldozer-based
Zambezi desktop line
There are many benchmark investigations related to AMD’s Zambezi, e.g. [31], [32].
Below we show key results of the very extensive report [32] covering a wide range of application areas,
including
• synthetic benchmarks
• audio processing
• video processing
• image processing
• packing data
• rendering
• games.
6.3.1. Summary benchmark results including all tests excl. games [32]
2.80. ábra -
234
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
6.3.2. Summary performance assessment of Zambezi-1
a) AMD’s Bulldozer-based 4-module, 8-core FX 8150 flagship processor is far away from overtaking the
performance leadership from Intel’s Sandy Bridge based 6-core i7 990X.
The i7 990X provides about 30% higher performance across all benchmarks (excl. games) than AMD’s
Bulldozer based FX 8150 flagship desktop processor, nevertheless, for a considerable higher price (~ 1000 $ at
the time of publishing the benchmark report).
The fact that Intel has no competition on the high end desktop market implies that Intel can determine high end
desktop prices as high as the market it allows.
b) Comparable priced Sandy Bridge based processors have typically higher performance than AMD’s FX 8150.
E.g. Although Intel’s Sandy Bridge based 4-core i5 2500K costs less than AMD’s FX 8150 at the time of the
cited benchmark review [33], it performs about 10 % higher than AMD’s FX 8150.
Other benchmark investigations reveal also that the Bulldozer-based Zambezi underperforms against Intel’s
Sandy Bridge-based desktop processors [33], [31].
Remark
In order to take into account AMD’s module concept Microsoft released two patches to Windows 7 (patch,
patch v2) in cooperation with AMD [32].
Nevertheless, these patches actually did not improve the performance of the FX 8150 [32].
6.3.3. Summary benchmark results including all tests excl. games [32]
2.81. ábra -
235
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
6.3.4. Summary performance assessment of Zambezi-2
c) The Bulldozer-based 8-core (four modules) flagship FX 8150 achieves only a moderate gain (~ 10 – 20 %)
vs. AMD’s previous 4 to 6-core K10.5 (Phenom X6/Phenom X4) designs.
6.3.5. Summary benchmark results including all tests excl. games [32]
2.82. ábra -
236
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
Remarks
a) The scheduling policy of the OS affects the efficiency of the Turbo core technology and thus the achieved
performance [3].
Windows 7 does not recognize the module structure of Bulldozer based processors nor the peculiarities of their
Turbo core technology.
It spreads threads across modules preventing the activation of the max. turbo speed, since max. turbo speed can
only be reached when at least half of the Bulldozer modules are idle (being in the C6 state).
Furthermore, the scheduler of Windows 7 re-schedules threads from time to time.
As a consequence, the processor can not reach its peak performance for workloads that do not utilize all 8
available cores.
6.3.6. Example: Impact of Windows 7’s scheduling policy to the activation of
Max. Turbo mode [9]
2.83. ábra -
237
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
b) There are OS patches available for the FX series – worked out by Microsoft in cooperation with AMD - to
remedy this problem [32].
These patches are not too efficient, as benchmark results indicate it [32].
c) Windows 8 has already a redesigned scheduler
The new scheduler takes already into account the module structure and Turbo core features of Bulldozer.
This allows a few percent performance boost, as indicated below for games [35].
2.84. ábra -
238
Created by XMLmind XSL-FO Converter.
First generation Family 15h
Bulldozer-based processor lines
6.3.7. Summary assessment of the benchmark results of the Zambezi FX 8150
line [32]
2.85. ábra -
6.3.8. Summary assessment of all Bulldozer based designs
All in all Bulldozer-based server and desktop lines were assessed by many market observers as “disappointing”,
e.g. [3], [31], [32].
6.3.9. Remark – AMD’s reorganization after the Bulldozer disaster
• After Intel announced their Sandy Bridge (1/2011) AMD’s Board of Directors pressured AMD’s CEO Dirk
Meyer to step down.
Dirk Meyer was originally a processor architect (co-architect of DEC’s 21064 and 21264 processors and chief
architect of AMD’s highly sussessful Athlon (K7) processor).
• In 8/2011 he was followed by Lenovo’s former CEO Rory Read.
• Read reorganized AMD and laid off 1400 employees out of a work force of about 40000 in 11/2011 [42],
[43].
239
Created by XMLmind XSL-FO Converter.
3. fejezet - Second generation Family
15h Piledriver-based processor lines
1. 3.1 Overview of the Pilediriver-based processor
lines (based on [1])
3.1. ábra -
1.1. Brand names of Piledriver-based processor lines
3.2. ábra -
240
Created by XMLmind XSL-FO Converter.
Second generation Family 15h
Piledriver-based processor lines
2. Piledriver-based processor lines
3.3. ábra -
3. 3.2 The Piledriver Compute Module
3.1. 3.2.1 Overview of the Piledriver Compute Module
The Piledriver Compute Module includes two cores like the Bulldozer Compute Module, but is a thorough
redesign of the ill fated Bulldozer Compute Module [54].
241
Created by XMLmind XSL-FO Converter.
Second generation Family 15h
Piledriver-based processor lines
3.4. ábra -
3.2. 3.2.1 Piledriver’s performance enhancements vs. Bulldozer
[54]
3.5. ábra -
The Piledriver Compute Module includes two cores like the Bulldozer Compute Module, but is a thorough
redesign of the ill fated Bulldozer Compute Module.
3.2.1. Piledriver’s performance enhancements vs. the (Fam. 12h) Husky and
Bulldozer cores [55]
3.6. ábra -
242
Created by XMLmind XSL-FO Converter.
Second generation Family 15h
Piledriver-based processor lines
Remark
A detailed description of Piledriver’s improvements and enhancements can be found in [56].
3.3. 3.2.3 Piledriver’s power management enhancement vs.
Bulldozer – The RCM technology [63]
• Along with the Piledriver design AMD introduced the Resonant Clock Mesh technology (RCM) in order to
reduce power consumption of the clock distribution network.
Reduced overall power consumption can be utilized also to increase clock frequency within the same TDP
limit.
• Announcement of RCM: in 2/2012 at the ISSCC.
• As the RCM technology aims at reducing the power consumption of clock distribution networks, first we
provide a brief overview about them.
Then we will discuss the principle of operation and the introduction of RCM.
3.3.1. 3.2.2.1 A brief introduction into clock distribution networks [57]
Along with the increasing number of transistors on a chip and raising clock frequencies clock distribution
became a more and more intricate issue and thus a field of intensive research already in the beginning of the
1990’s.
Without going into details next we give an overview of the main steps of the evolution of clock distribution
networks.
3.3.1.1. Main types of clock distribution networks [58]
3.7. ábra -
243
Created by XMLmind XSL-FO Converter.
Second generation Family 15h
Piledriver-based processor lines
Remark
Both illustrations of clock distribution networks include already clock gating to be discussed later.
3.3.1.2. Main types of tree-based clock distribution networks [58]
3.8. ábra -
244
Created by XMLmind XSL-FO Converter.
Second generation Family 15h
Piledriver-based processor lines
3.3.1.3. Main types of grid-based clock distribution networks
3.9. ábra -
3.3.1.4. Example: Experimental grid-based clock distribution network with H-tree grid driving
The Figure below is an illustration for an experimental grid-based clock distribution network with H-tree driving
[59].
3.10. ábra -
3.3.1.5. Drawback of the grid-based clock distribution
High power consumption due to the buffers needed to drive the grid.
3.11. ábra - Distribution of power consumption in a Bulldozer processor [60]
245
Created by XMLmind XSL-FO Converter.
Second generation Family 15h
Piledriver-based processor lines
3.3.1.6. Main steps of the evolution of clock distribution networks
3.12. ábra -
3.3.1.7. Clock gaters
Clock gating is widely used to reduce power consumption by switching off clocking of temporarily not used
parts of the processor.
E.g. already Intel’s Pentium 4 (2000) utilized aggressive clock gating, whereas AMD made use of this technique
later, presumable beginning with their K8 family (2003).
246
Created by XMLmind XSL-FO Converter.
Second generation Family 15h
Piledriver-based processor lines
Clock gaters are on/off switches of the clock that are implemented simply by an AND function introduced in the
clock wires, as indicated below.
3.13. ábra - Use of clock gating to switch off temporarily not used units in a grid-based
clock distribution network [57]
3.3.1.8. Resonant clock meshing (RCM)
It aims at reducing the high power consumption of the grid-based clock distribution network.
Reduced overall power consumption can be utilized to boost clock frequency within a given TDP limit.
The RCM technology will be introduced in the next Section.
3.3.2. 3.2.3.2 Principle of the Resonant Clock Mesh (RCM) technology
Cyclos, the inventor of RCM, provides a very good brief explanation of the RCM technology that we cite
subsequently [57].
3.3.2.1. The Power Challenge [57]
“A modern SOC may consume up to 30% of its power just on the clock buffers, which is really a big contributor
to the overall power consumption. Other EDA (Electronic Design Automation) vendors are focused on reducing
power for the areas marked below with red arrows:
3.14. ábra -
247
Created by XMLmind XSL-FO Converter.
Second generation Family 15h
Piledriver-based processor lines
The promise of Cyclos technology is to reduce the power consumption on the clock buffers:
3.15. ábra -
Many chips today use the familiar clock tree approach to distribute a clock signal across the chip.
3.16. ábra -
248
Created by XMLmind XSL-FO Converter.
Second generation Family 15h
Piledriver-based processor lines
Another clock distribution approach is called the Mesh where a metal layer ties all the distributed clock signals
together to form a low resistance Mesh after the initial clock driver cells:
3.17. ábra -
The clock mesh gives you a very low skew value however it's capacitance requires increase energy to drive
which also increases power consumption. We like the low skew but we don't like increasing the power.
In EE theory classes we all learned about oscilators built out of LC circuits:
3.18. ábra -
249
Created by XMLmind XSL-FO Converter.
Second generation Family 15h
Piledriver-based processor lines
What if we could combine the benefits of the clock mesh topology with the resonance of an oscillator to reduce
the energy required to drive a clock network?
Hmm, that idea could work in theory:
3.19. ábra -
Benefits of such an approach:
• Low clock skews because of the low-resistance mesh
• Metal mesh less impacted by On Chip Variation (OCV) and Process/Voltage/Temperature (PVT)
variations
• The Post-gater trees timing are isolated, so ECOs are easier in the design cycle
• Lower power consumed by the clock distribution network
Challenges of this approach:
• EDA tools not commercially developed yet
250
Created by XMLmind XSL-FO Converter.
Second generation Family 15h
Piledriver-based processor lines
• Design flow not well understood or built
The LC circuit created by the inductors basically helps to recycle clock power thus lowering consumption:
3.20. ábra -
OK, lower power consumption for my clock network is always a good thing but are there more benefits? Yes,
you even have reduced jitter on your clock edges:
3.21. ábra -
251
Created by XMLmind XSL-FO Converter.
Second generation Family 15h
Piledriver-based processor lines
The theory of using a resonant mesh for clocks is appealing, but who is really using this in production chip?
I did a quick Google search and found hundreds of articles and patents on the subject, so it looks like the leap
from theory to practice has been bridged. A few more side benefits of resonant mesh clock designs are lower RF
noise than clock trees, and electromigration reduction from bidirectional current flow in the clock net.
3.3.2.2. Silicon Confirmation
252
Created by XMLmind XSL-FO Converter.
Second generation Family 15h
Piledriver-based processor lines
The Cyclos Semi approach has been used with an ARM926 chip where first silicon showed a 25% to 35%
reduction in total power. Several other chips have used this approach with early customers and one DSP chip
showed a 75% lower clock power number while using GHz speeds.
3.22. ábra -
3.3.2.3. Implementation
The theory matches the silicon results, so how do we get inductors onto an IC design?
Here's a mesh with distributed inductors built in a top-level of metal using standard processing steps. You don't
want circuits underneath these on-chip inductors so that will increase your silicon area up to 5% typically:
3.23. ábra -
There are at least three clock distribution choices: clock tree, clock mesh, resonant mesh
3.24. ábra -
253
Created by XMLmind XSL-FO Converter.
Second generation Family 15h
Piledriver-based processor lines
The engineers at Cyclos are promoting the resonant mesh for GHz designs as a way to reduce power and tighten
up the clock specs.
3.3.2.4. EDA Tool Flow for Design
Initially the way to get this resonant clock mesh (RCM) for your chip requires some manual work so you could
hire Cyclos as a consulting company or wait until their tool flow is released in 2012. The idea is to create a
compiler that automates the layout implementation parts.
3.25. ábra -
To get the RCM implementation is either $500K as a design service or wait until the RCM
Compiler tool is ready around the DAC time frame. An IP license will run you $1M per process node, and
finally there's a usage fee.
Compared to the digital libraries from Artisan/ARM there were only usage fees, called royalties”.
This is the end of Cyclos’s introduction to the RCM technology.
3.3.3. 3.2.3.3 The evolution of implementing RCM
• The interest to use RCM for reducing power consumption of clock distribution networks arose more or less in
the beginning of the 2010’s.
• A number of papers appeared and later also numerous small test chips were developed to demonstrate the
potential of RCM [61].
Subsequently, we will briefly review only the large scale approaches to implement RCM in commercial
processors.
3.3.3.1. Experimental implementation of RCM in the ARM9EJ-S [62]
254
Created by XMLmind XSL-FO Converter.
Second generation Family 15h
Piledriver-based processor lines
In 2009 Cyclos, a small start up company grounded for marketing the IP of RCM announced that they
completed the experimental implementation of RCM in the ARM9EJ-S.
This was the first experimental implementation of RCM covering the entire clock distribution network of a
commercial processor, nevertheless, only to demonstrate the feasibility of Cyclos’ IP.
The implementation made use of an off-chip inductor to resonate the parasitic clock capacitance from the root of
the clock network.
The total power savings achieved amounted to 20 % to 35 % depending on the workload.
3.3.3.2. Experimental implementation of RCM in the Cell/B.E. (2009) [61], [62].
• The work aiming at the implementation of the RCM technology into the Cell/B.E. design started when the
Cell/B.E. was already in volume production.
• The implementation was limited to the global clock distribution network without including the driving flipflops resulting in moderate total power savings of about 5%.
• RCM was implemented by using 830 on-chip spiral inductors.
3.3.3.3. Implementation of RCM in AMD’s Piledriver-based processor lines (2012) [63]
• It is the first volume production enabled microprocessor that makes use of the RCM technology.
• The clock system operates in two modes: direct-drive (without RCM) and resonant (RCM) mode.
In direct-drive mode the inductors are shunted by a switch.
• In the chosen implementation a set of five horizontal folded clock trees drive a global clock grid,where each
clock tree has up to 25 on-chip inductors.
• Achieved power savings in the clock distribution network is up to 24 %.
• Power savings can be utilized to boost clock speed.
• The implementation of RCM in AMD’s Piledriver-based processor lines is restricted however, to be a Rev.2
of an existing clock mesh [60].
This is in concert with some industry-observers stating that in their first shipped Piledriver-based processors
AMD did not implement yet RCM [56].
• Nevertheless, according to a paper the, Piledriver-based Trinity A10-4600M processor already includes RCM
[64].
This publication states that the design uses 92 100 µm-wide inductors, spread out over each dual-core
processor module.
Remark
In 3/2013 AMD quietly introduced the Piledriver-based 6-core Opteron 635O and 4-core 4350 processors with
remarkable increased clock speeds, as shown in the subsequent Table [65].
It can be suspected therefore that these processors are already Rev.2 parts using RCM.
3.3.4. Main features of AMD’s Bulldozer- and Piledriver based Opteron server
lines [65]
3.26. ábra -
255
Created by XMLmind XSL-FO Converter.
Second generation Family 15h
Piledriver-based processor lines
3.3.5. Plans to implement Cyclos’s RCM in ARM Cortex-A15 [66]
• In 2/2013 Global Foundries and ARM have announced plans to implement Cyclos’ RCM in ARM’s CortexA15 in order to boost clock speed.
• The RCM design will include on-chip inductors.
4. 3.3 Piledriver-based GPU-less processor lines
4.1. 3.3.1 Overview of the Piledriver-based GPU-less processor
lines-1
3.27. ábra -
256
Created by XMLmind XSL-FO Converter.
Second generation Family 15h
Piledriver-based processor lines
Piledriver underlies AMD’s Abu Dhabi Opteron 6300 server line and the Vishera FX high performance desktop
line.
The first Piledriver-based GPU-less processor line was introduced in 10/2012 as the Vishera high performance
FX desktop line.
Key features of the Piledriver –based GPU-less processor die:
• 32 nm feature size,
• 315 mm2,
• 1.2 billion transistors.
(These are exactly the same figures as those for the related Bulldozer-based Orochi die).
4.1.1. Comparing the Bulldozer-based and Piledriver-based 4-module (8 cores)
dies [6], [54]
3.28. ábra -
257
Created by XMLmind XSL-FO Converter.
Second generation Family 15h
Piledriver-based processor lines
No significant differences can be observed in the layout of the Bulldozer and Piledriver dies.
4.1.2. Main functional blocks of a Piledriver-based GPU-less processor die [54]
It underlies both the Opteron 6300 Abu Dhabi server line and the Vishera high performance desktop line.
3.29. ábra -
4.2. 3.3.2 The Abu Dhabi Opteron 6300 server line
258
Created by XMLmind XSL-FO Converter.
Second generation Family 15h
Piledriver-based processor lines
3.30. ábra - Sub-families of the Opteron 6300 (Abu Dhabi) server line [51]
Remark
In fact, only the Opteron 6300 4P server line is dubbed as the Abu Dhabi line, whereas the 2P and 1P models are
designated differently and have also different key features, as shown below.
3.31. ábra -
Nevertheless, often the whole Opteron 6300 line (4P/2P/1P) is referred to as the Abu Dhabi line.
4.2.1. Main functional blocks of the dual-chip Opteron 6300 (Abu Dhabi) 4P
server processor [67]
3.32. ábra -
259
Created by XMLmind XSL-FO Converter.
Second generation Family 15h
Piledriver-based processor lines
4.2.2. Die plot of the dual-chip Opteron 6300 (Abu Dhabi) server processor [68]
3.33. ábra -
4.2.3. Model numbers and main features of the Opteron 6300 (Abu Dhabi) 4P
line [69]
3.34. ábra -
260
Created by XMLmind XSL-FO Converter.
Second generation Family 15h
Piledriver-based processor lines
4.2.4. Comparison of the Bulldozer-based Opteron 6200 and the Piledriverbased Opteron 6300 server lines [67]
3.35. ábra -
4.3. 3.3.3 The Vishera high performance FX desktop line
3.36. ábra -
261
Created by XMLmind XSL-FO Converter.
Second generation Family 15h
Piledriver-based processor lines
4.3.1. Main functional blocks of the high performance Vishera FX desktop line
[54]
3.37. ábra -
4.3.2. Die plot of the high performance Vishera FX desktop line [54]
3.38. ábra -
262
Created by XMLmind XSL-FO Converter.
Second generation Family 15h
Piledriver-based processor lines
4.3.3. Model numbers and main features of the high performance Vishera FX
desktop line [60]
3.39. ábra -
4.3.4. Comparing main features of AMD’s Vishera and Zambezi FX desktop
lines [49]
3.40. ábra -
263
Created by XMLmind XSL-FO Converter.
Second generation Family 15h
Piledriver-based processor lines
It can be noticed that high end models of the Piledriver-based Vishera FX line offer about 10% higher base
clock speed than the related models of the previous Bulldozer-based Zambezi line.
4.3.5. Main features of the 9-Series chipset supporting the high performance
Vishera DT [70]
The Vishera FX line makes use of the same chipset (9-Series) as the previous Zambezi FX DT line.
3.41. ábra -
4.3.6. AMD’s high-performance processor roadmap from 10/2011 [44]
3.42. ábra -
264
Created by XMLmind XSL-FO Converter.
Second generation Family 15h
Piledriver-based processor lines
5. 3.4 Piledriver-based Trinity APU lines
5.1. 3.4.1 Overview of the Piledriver-based Trinity APU lines
5.1.1. Piledriver-based Trinity APU lines
3.43. ábra -
5.2. 3.4.2 The Trinity APU die
It underlies AMD’s mainstream desktop and mobile Trinity APU lines.
265
Created by XMLmind XSL-FO Converter.
Second generation Family 15h
Piledriver-based processor lines
The first Trinity APU line was introduced in 5/2012 as the Trinity mobile APU line.
Key features of the Trinity die:
• 32 nm feature size,
• 226 mm2,
• 1.303 billion transistors.
(These are almost the same figures as those of the Llano die).
5.2.1. AMD’s Trinity APU die [71]
3.44. ábra -
5.2.2. Comparing die plots of AMD’s Llano and Trinity dies [72]
3.45. ábra -
266
Created by XMLmind XSL-FO Converter.
Second generation Family 15h
Piledriver-based processor lines
5.2.3. Improvements of the Piledriver APU family over the Llano APU family
a) Enhancements of the microarchitecture of the Trinity APU
b) Improvement of the power management
5.2.4. a) Enhancements of the microarchitecture of the Trinity APU [73]
3.46. ábra - Simplified layout of the digital power monitoring system of the Llano APU
[75]
Here we do not go into details relating to the improvements of the microarchitecture of the Trinity APU but
refer e.g. to the following sources [55], [73].
5.2.5. b) Improvement of the power management
In their Trinity APU family AMD introduced the Turbo Core technology 3.0, first in the the Trinity mobile line
in 5/2012.
267
Created by XMLmind XSL-FO Converter.
Second generation Family 15h
Piledriver-based processor lines
The Turbo Core Technology 3.0 is an improvement of Llano’s Turbo Core Technology.
For this reason – before discussing the introduced improved technology - first let us recap the Turbo Core
Technology of the Llano APU.
5.2.6. The Turbo Core technology of the Llano APU [74], [75]
Based on a patent filed in 2008 by Naffziger (one of the key processor architects of AMD) [74], Llano became
AMD’s second processor including the Turbo Core technology (the first one was the K10.5 based high
performance Thuban desktop processor (2010)).
Llano digitally monitors a large number (95) of relevant events in each core, such as FX and FP operations,
L1/L2 cache accesses etc. to calculate the power dissipation of each unit (4 cores plus the GPU), and also the
entire chip, as indicated below.
3.47. ábra - Simplified layout of the digital power monitoring system of the Trinity APU
[76]
Based on the calculated power consumption the Turbo Core Manager determines the actual energy margins as
the difference between the actual power consumption of the cores and the chip and the related TDP figures.
• Positive margins indicate power headroom
• Negative margins indicate power overage
A power headroom can be utilized for increasing the clock frequency.
Power overages, on the other hand, initiate throttling (clock reduction) of the cores or even the GPU.
If the OS requests higher CPU performance for particular cores and there is a power headroom available, the
Turbo Core Manager initiates a clock frequency increase for the related core.
268
Created by XMLmind XSL-FO Converter.
Second generation Family 15h
Piledriver-based processor lines
Note that the TDP is considered as a given static value and only the clock frequencies of the cores may be
increased, but the clock frequency of the GPU can not be boosted, not even in case when the CPU cores are not
fully utilized or are inactive.
5.2.6.1. The Turbo Core Technology 3.0 of the Trinity APU line [76]
It is an enhancement of Llano’s Turbo Core Technology, as indicated below.
Unlike the Llano APU die the Trinity APU die includes two compute modules (CU0 and CU1) and the GPU,
rather than 4 cores and the GPU.
Accordingly, the basic layout of the digital power monitoring system has been modified, as follows:
3.48. ábra - Example for the operation of the AMD Turbo Core Technology 3.0 [55]
The major enhancement of the Turbo Core technology of the Trinity vs. the Llano APU is that the Trinity APU
implements a bi-directional turbo management, unlike the Llano APU that could boost only the core
frequencies.
This allows now to increase also the clock frequency of the GPU when there is a heavy GPU load and enough
power headroom is available, as the following Figure demonstrates it for the Trinity A10-4600M mobile APU.
3.49. ábra -
269
Created by XMLmind XSL-FO Converter.
Second generation Family 15h
Piledriver-based processor lines
Related to the above Figure we note that in the Trinity A10-4600M mobile APU
• the compute modules have a base clock frequency of 2300 MHz that can be boosted up to 3200 MHz, in steps
of 100 MHz, whereas
• the GPU has a base clock frequency of 496 MHz that can be raised to 685 MHz.
Now, according to the actual load pattern, the Turbo core manager (not shown in the Figure) will increase either
the core frequency of the compute units or the GPU, when the OS requires a performance increase and enough
power headroom is available, as demonstrated in the Figure below.
5.2.7. Illustration of the operation of the Turbo Core Technology 3.0 of the
Trinity APU [77]
3.50. ábra -
270
Created by XMLmind XSL-FO Converter.
Second generation Family 15h
Piledriver-based processor lines
On the other hand, when there is a power overage, the Turbo core manager initiates a clock throttling of the
compute units or the GPU to reduce power dissipation below the allowed TDP limit (not shown in the Figure).
5.3. 3.4.3 The Trinity mainstream desktop APU line
3.51. ábra -
271
Created by XMLmind XSL-FO Converter.
Second generation Family 15h
Piledriver-based processor lines
5.3.1. Positioning the Trinity mainstream desktop APU line [51]
3.52. ábra -
5.3.2. Main components of the Trinity mainstream desktop APU [78]
3.53. ábra -
272
Created by XMLmind XSL-FO Converter.
Second generation Family 15h
Piledriver-based processor lines
5.3.3. Model numbers and main features of the mainstream Trinity desktop
APU line [78] (Virgo platform)
3.54. ábra -
5.3.4. The new FM2 socket of the Trinity mainstream desktop APU line [78]
3.55. ábra -
273
Created by XMLmind XSL-FO Converter.
Second generation Family 15h
Piledriver-based processor lines
Remark
The A55/A75 FCHs (Fusion Controller Hubs) were already introduced for the Llano A-Series APUs, the A85 is
new, it supports the high performance unlocked models of the line.
5.3.5. System architecture of the mainstream Trinity desktop APU with the
A85X FCH [79]
3.56. ábra -
5.3.6. Performance increase achieved over the previous A-Series Llano APU
line [78]
3.57. ábra -
274
Created by XMLmind XSL-FO Converter.
Second generation Family 15h
Piledriver-based processor lines
5.4. 3.4.4 The Trinity mobile APU line
3.58. ábra -
5.4.1. Positioning the Trinity mobile APU line-1 [51]
3.59. ábra -
275
Created by XMLmind XSL-FO Converter.
Second generation Family 15h
Piledriver-based processor lines
5.4.2. Positioning the Trinity mobile APU line-2 [52]
3.60. ábra -
5.4.3. Model numbers and main features of the Trinity mobile APU line [80]
(Comal platform)
3.61. ábra -
276
Created by XMLmind XSL-FO Converter.
Second generation Family 15h
Piledriver-based processor lines
Trinity based A10/A8 mobiles appeared in 5/2012.
5.4.4. The Comal mobile platform including the (Piledriver-based) Trinity APU
and the A70M/A60M FCH [52]
3.62. ábra -
6. 3.5 Piledriver-based Richland APU lines
6.1. 3.5.1 Overview of the Piledriver-based Trinity APU lines
3.63. ábra -
277
Created by XMLmind XSL-FO Converter.
Second generation Family 15h
Piledriver-based processor lines
6.1.1. Positioning the Trinity mainstream desktop and mobile APU lines [52]
3.64. ábra -
6.1.2. Die shot of the Richland APU [81]
3.65. ábra -
278
Created by XMLmind XSL-FO Converter.
Second generation Family 15h
Piledriver-based processor lines
6.1.3. Key features of the Richland mobile APU line as exposed by AMD [82]
3.66. ábra -
6.1.4. Major improvements of the Richland mobile APU line discussed [83], [84]
Richland APUs are based also on the Piledriver cores, are fabricated at the same feature size as the Trinity
APUs and incorporate also the same number of transistors.
Their major improvements are as follows:
• about 10 % faster CPU and 4-7 % faster GPU clock speed both in base mode and turbo mode,
279
Created by XMLmind XSL-FO Converter.
Second generation Family 15h
Piledriver-based processor lines
• new, HD 8000G series GPUs that are based however further on on the Cayman core (Northern Islands
family) including VLIW4 ALUs.
The new GPUs are claimed to provide 20-40 % more graphics performance in high-end models than the
previous HD 7000G series GPUs,
• improved power management, including
• an enhanced power management technique, called the Temperature Smart Turbo Core (TSTC) that increases
battery life (to be detailed subsequentl),
• introducing additional frequency/voltage operating points (P points) to enhance the efficiency of power
management (to be detailed subsequently), and
• innovative software features (to be detailed subsequently).
6.1.5. Principle of operation of the Temperature Smart Turbo Core (TSTC)
technique-1
It enhances turbo core management by including 17 temperature sensors
• 5 on each compute module and
• 7 on the GPU
along with a package sensor (not shown), as indicated in the Figure [82].
3.67. ábra -
6.1.6. Principle of operation of the Temperature Smart Turbo Core (TSTC)
technique-2 [85]
280
Created by XMLmind XSL-FO Converter.
Second generation Family 15h
Piledriver-based processor lines
Taking into account temperature data of the compute modules, the GPU and the package, delivered by the
respective sensors, allows the Turbo Core Manager more sophisticated real-time clock speed settings of both the
compute modules and the GPU according to the actual load pattern while staying within the chip’s thermal
limits.
This results typically in higher clock speed than granted by the previous Trinity APU implementation, as
demonstrated in the next Table.
6.1.7. Comparing clock frequencies of the Richland and the Trinity APU lines
[86]
3.68. ábra - Additional frequency/voltage points (P points) introduced in the Richland
APU [85]
6.1.8. Principle of operation of the Temperature Smart Turbo Core (TSTC)
technique-3 [85]
6.1.8.1. Handling of possible bottlenecks
A further improvement of the power management algorithm relates to handling of possible bottlenecks.
The previous algorithm granted higher clock speed to the compute modules or the GPU if required, regardless
whether or not a higher clock speed could be realized due to possible resource bottlenecks.
The new algorithm takes possible bottlenecks into account, and grants higher clock frequencies only if
bottlenecks do not limit the utilization of the higher clock speed granted.
6.1.9. Introducing additional frequency/voltage operating points
With the Richland APU line AMD added more frequency/voltage operating points, termed as P points.
P points are used to adjust dynamically the operating point of the individual compute modules or the GPU to the
actual performance need, determined by the OS, as indicated in the Figure.
3.69. ábra -
281
Created by XMLmind XSL-FO Converter.
Second generation Family 15h
Piledriver-based processor lines
New frequency/voltage operating points (P points) enable an improved adjustment of the chosen operating point
to the actual performance need.
This results in a more efficient power management in terms of less power consumption for a given workload
(i.e. performance need).
6.1.10. An innovative suite of apps. available typically on the Richland A8 and
A10 models [87]
3.70. ábra -
6.1.11. AMD Face Login [88]
282
Created by XMLmind XSL-FO Converter.
Second generation Family 15h
Piledriver-based processor lines
• It is designed as a convenient tool to help log-in to Windows and many popular web sites quickly but should
not be used to protect the computer and personal information from unwanted access.
• Only available on the Richland A10 and A8 APUs.
• Requires a webcam, and will only operate on PCs running Windows 7 or Windows 8 operating systems and
Internet Explorer version 9 or 10.
6.1.12. AMD Gesture Control [88]
It is designed to enable gesture recognition as a tool for controlling certain applications on the PC.
Only available on Richland A10 and A8 APUs.
Requires a web camera, and will only operate on PCs running Windows 7 or Windows 8.
Supported Windows desktop apps include: Windows Media Player, Windows Photo Viewer, Microsoft
PowerPoint and Adobe Acrobat Reader.
Supported Windows Store apps include: Microsoft Photos, Microsoft Music, Microsoft Reader and Kindle.
6.1.13. AMD Screen Mirror [88]
• It is designed to enable the transmission and display of the PC screen on other compatible networked "mirror"
devices.
• Only available on Richland A10, A8 and A6 APUs.
• AMD Screen Mirror supports almost all popular image, audio and video file formats as well as applications,
but will not mirror protected content.
6.1.14. AMD optimized games [88]
• Provides driver optimizations for a select set of games.
• The optimized-for-AMD software will be pre-loaded on select Richland A-Series APU-based notebooks or is
downloadable from AMD’s website.
6.2. 3.5.2 The Richland mainstream desktop APU line
6.2.1. Overview of the Richland mainstream desktop APU line
3.71. ábra -
283
Created by XMLmind XSL-FO Converter.
Second generation Family 15h
Piledriver-based processor lines
6.2.2. Positioning the Richland mainstream desktop and mobile APU lines [52]
3.72. ábra -
6.2.3. Model numbers and expected key features of the Richland desktop APU
line [89] (Elite Performance platform)
3.73. ábra -
284
Created by XMLmind XSL-FO Converter.
Second generation Family 15h
Piledriver-based processor lines
6.3. 3.5.3 The Richland mobile APU line
3.5.3 Overview of the Richland mobile APU line
3.74. ábra -
6.3.1. Positioning the Richland mobile APU line [52]
3.75. ábra -
285
Created by XMLmind XSL-FO Converter.
Second generation Family 15h
Piledriver-based processor lines
6.3.2. Model numbers and expected main features of the Richland mobile APU
line [84] (Elite performance APU platform)
3.76. ábra -
The new chips have the same socket as the previous Trinity mobile APU line (FS1r2 socket), so they are drop-in
compatible with the previous platforms and OEMs can quickly ramp up systems based on Richland mobile
APUs.
6.3.3. AMD’s graphics performance figures of the Richland mobile APU line vs.
Intel’s Ivy Bridge-based mobile processors [83]
3.77. ábra -
286
Created by XMLmind XSL-FO Converter.
Second generation Family 15h
Piledriver-based processor lines
Remark
At its introduction in 6/2012 the Intel Core i7-3520M was the fastest dual-core, dual threads/core mobile
processor for laptops, based on the Ivy Bridge architecture, with the following key parameters: [90]
3.78. ábra -
287
Created by XMLmind XSL-FO Converter.
4. fejezet - Third generation Family
15h Steamroller-based processor
lines
1. 4.1 Overview of Family 15h Steamroller-based
processor lines (based on [1])
4.1. ábra -
1.1. Brand names of Family 15h Steamroller-based processor
lines
4.2. ábra -
288
Created by XMLmind XSL-FO Converter.
Third generation Family 15h
Steamroller-based processor lines
1.2. Overview of AMD’s Family 15h Steamroller-based processor
lines
4.3. ábra -
2. 4.2 The Steamroller Compute Module
2.1. Planned introduction of the Steamroller compute module
While introducing their Piledriver-based high performance Zambezi DT line (1í/2011) AMD revealed their plan
to introduce Steamroller-based compute modules in 2013, as shown below [44].
4.4. ábra -
289
Created by XMLmind XSL-FO Converter.
Third generation Family 15h
Steamroller-based processor lines
2.2. Preview of the Steamroller compute module (CM)
At the Hot Chips 2012 (8/2012) AMD’s CTO (Chief Technical Officer) gave a preview of the CM of
Steamroller, revealing some high level details of the microarchitecture, as shown in the next Figures [45].
2.3. Block diagram of the Steamroller compute module [45]
4.5. ábra -
2.4. Improvements of the front-end part of the Steamroller
compute module [45]
4.6. ábra -
290
Created by XMLmind XSL-FO Converter.
Third generation Family 15h
Steamroller-based processor lines
2.5. Improving integer scheduling, integer execution and
reducing average load latency in the Steamroller compute
module [45]
4.7. ábra -
2.6. Improving the power efficiency (performance/Watt figure) of
the Steamroller compute module [45]
4.8. ábra -
291
Created by XMLmind XSL-FO Converter.
Third generation Family 15h
Steamroller-based processor lines
2.7. Comparing the block diagrams of three generations of the
Family 15h Bulldozer design-1
A comparison of the block diagrams of subsequent three generations of the Family 15h Bulldozer CM design
reveal that at the high level block diagram AMD did not made any noticeable change, as shown below.
4.9. ábra -
2.8. Improvements made in the microarchitecture of the
Steamroller compute module
292
Created by XMLmind XSL-FO Converter.
Third generation Family 15h
Steamroller-based processor lines
Although not noticeable in the high level block diagram, AMD made a vast number of improvements practically
in all parts of the microarchitecture both in their Piledriver design over the Bulldozer CM, and then in their
Steamroller design over the Piledriver CM.
As far as the efficiency of the microarchitecture is concerned, these improvements aimed at eliminating
bottlenecks brought to light through extensive simulations while using a large number of relevant applications in
order to increase IPC.
Major changes of the microarchitecture are revealed in the "Preliminary BIOS and Kernel Developer's Guide
for AMD Family 15h Models 30h-3Fh Processors" .
From these changes, discussed in [48], we point out the following two:
• In case of integer micro-operations (Cops) Increasing the dispatch bandwidth from 4 to 8 while dispatching
up to 4 micro-operations per cycle to each core like in previous designs).
• Dispatching and retiring up to 2 stores per cycle instead of just one.
Nevertheless, increasing the dispatch bandwidth required further enhancements in the related units to avoid
bottlenecks, including
• Increasing the L1 instruction cache size from 64 KB to 96 KB and changing its associativity from 2-way to 3way
• Increasing the size of associated internal buffers, such as
• Load queue (LDQ) size increased to 48, from 44.
• Store queue (STQ) size increased to 32, from 24.
• Increased L2 BTB size from 5K to 10K and from 8 to 16 banks..
• Increased PFB (Prefetch Buffer) size from 8 to 16 entries; while the 8 additional entries can be used either
for prefetching or as a loop buffer.
In addition, AMD introduced a large number of further enhancements to increase IPC, as listed below [48].
• Optimizations of certain features, including
• Reducing the number of FP pipeline stages from 4 to 3.
• Optimizing store to load forwarding.
• Improved loop prediction.
• Accelerate SYSCALL/SYSRET.
• Increased snoop tag throughput.
• Enhancing the microarchitecture by
• Virtualized interrupt controller.
• Support of the XSA/EOPT instruction.
Remark
In addition to significantly increasing the efficiency of the Steamroller compute unit, in the overall architecture
of the related APU processors AMD made also substantial changes, as will be briefly discussed in the Section
introducing the Kaveri APU (Section xx) [48].
3. 4.3 Steamroller-based Opteron server lines
293
Created by XMLmind XSL-FO Converter.
Third generation Family 15h
Steamroller-based processor lines
3.1. Overview of AMD’s Family 15h Steamroller-based processor
lines
4.10. ábra -
3.2. 4.3.1 Overview of Steamroller-based server lines (based on
[1])
4.11. ábra -
3.2.1. Bringing forward the introduction of the Steamroller based server line
AMD’s 2012-2013 server roadmap (from 2/2012) foresees for 2013 yet Piledriver-based server lines, as
indicated in the next Figure.
3.2.2. AMD’s server roadmap from 2/2012 [27]
4.12. ábra -
294
Created by XMLmind XSL-FO Converter.
Third generation Family 15h
Steamroller-based processor lines
By contrast, in a presentation held for investors in 3/2013 AMD indicated that their Steamroller-based server
lines will be introduce already in 2013, as shown below.
3.2.3. AMD’s indication of introducing the Streamroller based server line
already in 2013 [50]
4.13. ábra -
It remains to see whether or not AMD can realize their intention to introduce their Steamroller-based server
lines already in 2013.
4. 4.4 Overview of Steamroller-based Kaveri desktop
and mobile APU lines (based on [1])
295
Created by XMLmind XSL-FO Converter.
Third generation Family 15h
Steamroller-based processor lines
4.14. ábra -
4.1. AMD’s Family 15h Steamroller-based mobile APU lines
(based on [1])
4.15. ábra -
296
Created by XMLmind XSL-FO Converter.
Third generation Family 15h
Steamroller-based processor lines
4.2. Positioning the Steamrolller-based Kaveri APU line as
mainstream desktop line [51]
4.16. ábra -
4.3. Positioning the Steamroller-based Kaveri APU as
performance/mainstream mobile line [51]
297
Created by XMLmind XSL-FO Converter.
Third generation Family 15h
Steamroller-based processor lines
4.17. ábra -
4.4. Revised positioning the Steamroller-based Kaveri APU line
[52]
4.18. ábra -
4.5. Overview of AMD’s Family 15h Steamroller-based APU lines
4.19. ábra -
298
Created by XMLmind XSL-FO Converter.
Third generation Family 15h
Steamroller-based processor lines
4.6. Main components of Kaveri APUs
• 1-2 Steamroller Compute Modules (2-4 Steamroller CPU cores)
• GCN (Graphics Core Next) GPU (Presumably models of the Sea Islands (HD8000) family)
4.7. Architectural integration of the CPU and the GPU in Kaveri
APU lines
Main features
• Unified address space,
• The GPU uses pageable system memory via CPU pointers and there exist a fully consistent memory between
CPU and GPU, as indicated in the next Figure.
4.8. Evolution of HSA in subsequent mobile APU lines [48]
4.20. ábra -
4.9. GPU co-processing without pointers and data sharing –
Without HSA [91]
4.21. ábra -
299
Created by XMLmind XSL-FO Converter.
Third generation Family 15h
Steamroller-based processor lines
4.10. GPU co-processing with pointers and data sharing – With
HSA [91]
4.22. ábra -
Remark [48]
In order to increase the efficiency of the HSA APU the width of the internal interface, that connects
• the GPU to the coherent system memory space and
• the CPU to the Frame Buffer part of the memory
(termed also as the Fusion Control Link (FCL) or Onion interface, shown in the next Figure) has been widened
from 128-bit to 256-bit in both directions.
300
Created by XMLmind XSL-FO Converter.
Third generation Family 15h
Steamroller-based processor lines
This enhancement increases the data transfer bandwidth between the CPU and the GPU significantly.
4.11. Data transfers in the memory hierarchy of the Llano APU
[53]
4.23. ábra -
301
Created by XMLmind XSL-FO Converter.
5. fejezet - References
[3.1]
Három referenciatabletet demonstrált az AMD az MWC-n, Prohardver,
http://prohardver.hu/hir/harom_referenciatabletet_demonstralt_amd_mwc.html
Febr.
27
2013,
[3.2] Su L., Consumerization, Cloud, Convergence, AMD 2012 Financial Analyst Day, Febr. 2 2012
[3.3] Bright P., Can AMD survive Bulldozer's disappointing debut?, Ars Technica, Oct. 20 2011,
http://arstechnica.com/gadgets/news/2011/10/can-amd-survive-bulldozers-disappointing-debut.ars/1
[3.4]
Heidekrüger
A.,
CPU
/
GPU
Technologies
Now
and
Future,
2010,
http://www.hpcadvisorycouncil.com/events/2011/switzerland_workshop/pdf/Presentations/Day%202/1
0_AMD_CPU.pdf
[3.5]
A
Nagy
AMD
Llano
APU
Megateszt,
Pro
Hardver,
http://prohardver.hu/teszt/amd_llano_apu_megateszt/hammertol_huskyig.html
Aug.
1
2011,
[3.6] White S., High-Performance Power-Efficient X86-64 Server and Desktop Processors, Using the core
codenamed „Bulldozer”, Aug. 19 2011, http://hotchips.org/uploads/hc23/HC23.19.9-DesktopCPUs/HC23.19.940-Bulldozer-White-AMD.pdf
[3.7]
AMD Opteron Platform Overview and Product Strategy, April 2011, http://www.hpsp.ch/events/techcircle/pastEvents/server_storage_juni2011/images/hp_techcircle_bern_amd_part.pdf
[3.8] Valentine B., Introducing Sandy Bridge
[3.9]
Shimpi A.L., The AMD FX (Bulldozer) Scheduling Hotfixes Tested,
http://www.anandtech.com/show/5448/the-bulldozer-scheduling-patch-tested
Jan.
27
2012,
[3.10] Kanter D., AMD's Bulldozer Microarchitecture, Real World Technologies, Aug. 26 2010,
http://www.realworldtech.com/page.cfm?ArticleID=RWT082610181333&p=10
[3.11]
De Gelas J., Intel Core versus AMD’s K8 architecture, AnandTech,
http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=2748&p=1
[3.12]
Expected
Ivy
Bridge
performance,
AnandTech
http://forums.anandtech.com/showthread.php?p=32952446
[3.13]
Wikipedia,
List
of
AMD
Opteron
microprocessors,
http://en.wikipedia.org/wiki/List_of_AMD_Opteron_microprocessors#Opteron_4200series_.22Valencia.22_.2832_nm.29
[3.14]
AMD
Phenom™
II
Processor
Model
Number
and
Feature
Comparisons,
http://www.amd.com/us/products/desktop/processors/phenom-ii/Pages/phenom-ii-model-numbercomparison.aspx
[3.15]
CAS2K11 / UCAR AMD Opteron Platform Overview and Product Strategy, 2011,
http://www.cisl.ucar.edu/dir/CAS2K11/Presentations/laurie/CAS2K11-AMD-September-2011-web.pdf
Forums,
May
Febr.
1
4
2006,
2012,
[3.16] Wesner S., HERMIT – Petaflop/s Performance for Engineering Applications, May 26 2011, http://www.tsystems-sfr.com/e/downloads/2011/vortraege/08Wesner.pdf
[3.17] Kanter D., Intel's Sandy Bridge Microarchitecture, Real World Technologies, Sept. 25 2010,
http://www.realworldtech.com/page.cfm?ArticleID=RWT091810191937&p=10
[3.18]
Bergman
R.,
AMD
Financial
Analyst
Day,
http://www.slideshare.net/AMDUnprocessed/amd-financial-analyst-day
Nov.
11
2009,
[3.19] Schilling A., Bulldozer-Nachfolger kommen im Jahresrhythmus, Hardware Luxx, Oct. 12 2011,
http://www.hardwareluxx.de/index.php/news/hardware/prozessoren/20155-bulldozer-nachfolgerkommen-im-jahresrhythmus.html
302
Created by XMLmind XSL-FO Converter.
References
[3.20] Shilov A., Ex-AMD Engineer Explains Bulldozer Fiasco: Lack of Fine Tuning, Xbit Labs, Oct. 13 2011,
http://www.xbitlabs.com/news/cpu/display/20111013232215_Ex_AMD_Engineer_Explains_Bulldozer
_Fiasco.html
[3.21] Angelini C., Meet AMD Zambezi, Valencia, And Interlagos, Tom’s Hardware, Oct. 12 2011,
http://www.tomshardware.com/reviews/fx-8150-zambezi-bulldozer-990fx,3043-10.html
[3.22] Goto H., AMD has pulled back the veil of Bulldozer chip 8-core version finally, 2011
http://pc.watch.impress.co.jp/docs/column/kaigai/20110830_473823.html
[3.23] AMD Opteron 4200 Series Processor, http://www.siliconmechanics.com/files/BulldozerValencialInfo.pdf
[3.24] BIOS and Kernel Developer’s Guide (BKDG) for AMD Family 15h Models 00h-0Fh Processors, 42301
Rev. 3.08, March 12 2012, http://support.amd.com/us/Processor_TechDocs/42301_15h_Mod_00h0Fh_BKDG.pdf
[3.25] McIntyre H., Arekapudi S., Busta E., Fischer T., Golden M., Horiuchi A., Meneghini T., Naffziger S.,
Vinh J., Design of the Two-Core x86-64 AMD „Bulldozer” Module in 32 nm SOI CMOS, IEEE Vol.
47 No. 1, Jan. 2012, http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=06060836
[3.26] De Gelas J., Bulldozer for Servers: Testing AMD's "Interlagos" Opteron 6200 Series, AnandTech, Nov.
15 2011, http://www.anandtech.com/show/5058/amds-opteron-interlagos-6200/5
[3.27] Shimpi A. L., AMD's 2012 - 2013 Server Roadmap: Abu Dhabi, Seoul & Delhi CPUs, AnandTech, Febr.
2 2012, http://www.anandtech.com/show/5488/amds-2012-2013-server-roadmap-abu-dhabi-seouldelhi-cpus
[3.28] White J., The AMD FX-8150 “Bulldozer CPU and Scorpius (FX) Platform Reviewed – Part One, Future
Looks, Oct. 14 2011, http://www.futurelooks.com/the-amd-fx-8150-bulldozer-cpu-and-scorpius-fxplatform-reviewed-part-one/1/
[3.29] AMD FX-8150 review, Channel Pro, Jan. 25 2012, http://www.channelpro.co.uk/reviews/6503/amd-fx8150-review
[3.30]
Négy csúcs 990FX-es AM3+ alaplap a porondon, Pro Hardver, Dec. 19 2011,
http://prohardver.hu/teszt/negy_csucs_990fx-es_am3_alaplap_a_porondon/a_scorpius_platform.html
[3.31]
Angelini C., Benchmark Results: PCMark 7, Tom’s Hardware, Oct.
http://www.tomshardware.com/reviews/fx-8150-zambezi-bulldozer-990fx,3043-12.html
12
2011,
[3.32]
AMD FX-8150 - Bulldozer im ausführlichen Test,
http://ht4u.net/reviews/2011/amd_bulldozer_fx_prozessoren/
12
2011,
HT4U.net,
Oct.
[3.33] Woligroski D., Best Gaming CPUs For The Money: April 2012, Tom’s Hardware, April 2 2012,
http://www.tomshardware.com/reviews/gaming-cpu-review-overclock,3106.html
[3.34] Rotem E., Power Management Architecture of the 2nd Generation Intel Core Microarchitecture, Formerly
Codename
Sandy
Bridge,
Hot
Chips,
Aug.
2011,
http://www.hotchips.org/wpcontent/uploads/hc_archives/hc23/HC23.19.9-Desktop-CPUs/HC23.19.921.SandyBridge_Power_10Rotem-Intel.pdf
[3.35]
AMD | Bulldozer, Fusion, AM3+, FM1, and What's
http://www.neogaf.com/forum/showthread.php?p=31582137
To
Come,
NeoGaf
Belive,
[3.36] Shimpi A. L., The Bulldozer Review: AMD FX-8150 Tested, AnandTech, Oct. 12 2011,
http://www.anandtech.com/show/4955/the-bulldozer-review-amd-fx8150-tested/4
[3.37] Prior J., AMD Bulldozer - FX 8150 Performance Review, Rage 3D, Oct. 11 2011,
http://www.rage3d.com/reviews/cpu/amd_fx_8150/index.php?p=2
[3.38]
Angelini
C.,
Enabling
Turbo
Core,
Tom’s
Hardware,
Oct.
http://www.tomshardware.com/reviews/fx-8150-zambezi-bulldozer-990fx,3043-8.html
303
Created by XMLmind XSL-FO Converter.
12
2011,
References
[3.39] George V., 45nm Next Generation Intel Core Microarchitecture (Penryn), Hot Chips 19, 2007,
http://www.hotchips.org/archives/hc19/3_Tues/HC19.08/HC19.08.01.pdf
[3.40]
Gelsinger
P.,
“Invent
the
new
reality,”
IDF
Fall
2008,
San
http://download.intel.com/pressroom/kits/events/idffall_2008/PatGelsinger_day1.pdf
Francisco
[3.41] Glaskowsky P., Explaining Intel’s Turbo Boost technology, cnet News, Sept. 28. 2009,
http://news.cnet.com/8301-13512_3-10362882-23.html
[3.42] McGrath D., Former IBM, Lenovo exec takes the helm at AMD, EE Times, Aug. 25 2011,
http://www.eetimes.com/electronics-news/4219307/AMD-appoints-former-Lenovo-exec-CEO
[3.43] Patel N., AMD lays off 1400 employees, including some senior execs, The Verge, Nov. 3 2011,
http://www.theverge.com/2011/11/3/2536299/amd-lays-off-1400-employees-including-some-seniorexecs
[3.44]
AMD High-Performance Core Roadmap 2011-2014, 3D Center, Oct.
http://www.3dcenter.org/abbildung/amd-high-performance-core-roadmap-2011-2014
13
2011,
[3.45] Papermaster M., The Surround Computing Era, Hot Chips Symposium, Aug. 28 2012,
http://www.hotchips.org/wp-content/uploads/2012/08/HC24.28.key1-SurroundComputingEraPapermaster-AMD.pdf
[3.46] Butler M., Barnes L., Sarma D.D., Gelinas B., Bulldozer: An Approach to Multithreaded Compute
Performance, IEEE Micro, Vol. 31, Issue 2, March-April 2011
[3.47] Angelini C., The Piledriver Architecture: Improving On Bulldozer, Tom’s Hardware, Oct. 23 2012,
http://www.tomshardware.com/reviews/fx-8350-vishera-review,3328-3.html
[3.48] Pollice M., Analysis: AMD Kaveri APU and Steamroller Core Architectural Enhancements Unveiled,
BSN, March 6 2013, http://www.brightsideofnews.com/news/2013/3/6/analysis-amd-kaveri-apu-andsteamroller-core-architectural-enhancements-unveiled.aspx
[3.49] Shimpi A.L., The Vishera Review: AMD FX-8350, FX-8320, FX-6300 and FX-4300 Tested,
AnandTech, Oct. 23 2012, http://www.anandtech.com/show/6396/the-vishera-review-amd-fx8350fx8320-fx6300-and-fx4300-tested
[3.50] AMD Q1 2013 Investor Presentation, March 2013
[3.51] Su L.,Consumerization, Cloud, Convergence, AMD 2012 Financial Analyst Day, Febr. 2 2012, AMD
Product
and
Technology
Roadmaps,
http://ir.amd.com/phoenix.zhtml?c=74093&p=irol2012analystday
[3.52] Su L., Consumers and the World of Surround Computing, CES 2013 Press Conference, Jan. 7 2013,
http://www.slideshare.net/AMD/amd-ces-2013-press-conference
[3.53] Spafford K., Meredith J.S., Lee S., Li D., Roth P.C., Vetter J.S., The Tradeoffs of Fused Memory
Hierarchies
in
Heterogeneous
Computing
Architectures,
May
15-17
2012,
http://ft.ornl.gov/~dol/papers/cf12_llano.pdf
[3.54] Bennett K., AMD FX-8350 Piledriver Processor - IPC and Overclocking, Hard OCP, Oct. 22 2012,
http://www.hardocp.com/article/2012/10/22/amd_fx8350_piledriver_processor_ipc_overclocking/#.Uc
LRYthVbps
[3.55] Walton J., The AMD Trinity Review (A10-4600M): A New Hope, AnandTech, May 15 2012,
http://www.anandtech.com/show/5831/amd-trinity-review-a10-4600m-a-new-hope
[3.56] Angelini C., AMD FX-8350 Review: Does Piledriver Fix Bulldozer's Flaws?, Tom’s Hardware, Oct. 22
2012, http://www.tomshardware.com/reviews/fx-8350-vishera-review,3328-3.html
[3.57] Payne D., Clock Design for SOCs with Lower Power and Better Specs, SemiWiki, Dec. 15 2011,
http://www.semiwiki.com/forum/content/917-clock-design-socs-lower-power-better-specs.html
304
Created by XMLmind XSL-FO Converter.
References
[3.58]
Clock
Distribution,
Acsel-lab.com,
lab.com/Projects/clocking/clock_distribution.htm
July
28
2004,
http://www.acsel-
[3.59] Restle P.J., A Clock Distribution Network for Microprocessors, IEEE Journal of Solid-State Circuits,
Vol. 36, No. 5, May 2001, http://weble.upc.es/ifsin/Block5/00918917.pdf
[3.60]
AMD FX-8350: Vishera, a lánctalpas cölöpverő, Prohardver,
http://prohardver.hu/teszt/amd_fx8350_vishera_piledriver_teszt/piledriver_v2_bulldozer_kipofozva.html
Oct.
23
2012,
[3.61] Chan S.C., A Resonant Global Clock Distribution for the Cell Broadband Engine Processor, IEEE
Journal
of
Solid-State
Circuits,
Vol.
44,
No.
1,
Jan.
2009,
http://www.ece.ncsu.edu/asic/ece733/2011/docs/ResonantClock.pdf
[3.62] Ishii A.T., A Resonant-Clock 200MHz ARM926EJ-S TM Microcontroller, ESSCIRC, 2009
[3.63] Sathe V., Arekapudi S., Ishii A., Ouyang C., Papaefthymiou M., Naffziger S., Resonant Clock Design for
a
Power-efficient,
High-volume
x86-64
Microprocessor,
http://www.eecs.umich.edu/eecs/about/articles/2012/ISSCC_2012_Piledriver_final_submission.pdf
[3.64] Courtland R., Power-Saving Clock Scheme in New PCs, IEEE Spectrum, June 28 2012,
http://spectrum.ieee.org/semiconductors/processors/powersaving-clock-scheme-in-new-pcs
[3.65] Shilov A., AMD Quietly Starts to Sell Two New Six-Core and Quad-Core FX Processors, Xbit Labs,
March
11
2013,
http://www.xbitlabs.com/news/cpu/display/20130311070551_AMD_Quietly_Starts_to_Sell_Two_Ne
w_Six_Core_and_Quad_Core_FX_Processors.html
[3.66] Shilov A., GlobalFoundries Teams Up with Cyclos to Speed Up ARM Cortex-A15 Designs, Xbit Labs,
Febr.
6
2013,
http://www.xbitlabs.com/news/cpu/display/20130206061859_GlobalFoundries_Teams_Up_with_Cycl
os_to_Speed_Up_ARM_Cortex_A15_Designs.html
[3.67] AMD Opteron 6300 Series Processor, Codenamed „Abu Dhabi”, Sales-in Presentation, July 2012,
http://www.abacus.cz/web/Konfig/AMD/Prodejn%C3%AD%20argumenty.pdf
[3.68] De Gelas J., AMD Launches Opteron 6300 series with "Piledriver" cores, AnandTech, Nov. 5 2012,
http://www.anandtech.com/show/6430/amd-launches-opteron-6300-series-with-piledriver-cores
[3.69] Hruska J., AMD Launches New Piledriver-Based Opteron 6300 Family, Hot Hardware, Nov. 5 2012,
http://hothardware.com/News/AMD-Launches-New-PiledriverBased-Opteron-6300-Family/
[3.70] Kozak A., AMD Desktop Platforms, 2012 AMD FX, Oct. 2012, http://www.xbitlabs.com/hot-gallery/20
[3.71] Shimpi A.L., AMD A10-5800K & A8-5600K Review: Trinity on the Desktop, Part 1, AnandTech, Sept.
27 2012, http://www.anandtech.com/show/6332/amd-trinity-a10-5800k-a8-5600k-review-part-1
[3.72] Woligroski D., AMD A10-4600M Review: Mobile Trinity Gets Tested, Tom’s Hardware, May 15 2012,
http://www.tomshardware.com/reviews/a10-4600m-trinity-piledriver,3202-4.html
[3.73] Wasson S., AMD's A10-4600M 'Trinity' APU reviewed, Tech Report, May 16 2012,
http://techreport.com/review/22932/amd-a10-4600m-trinity-apu-reviewed
[3.74] Naffziger S., US Patent 8010824, Aug. 30 2011
[3.75]
Foley D., AMD’s „LLANO” Fusion APU, Hot Chips 23, Aug. 19 2011,
http://www.hotchips.org/archives/hc23/HC23-papers/HC23.19.9-Desktop-CPUs/HC23.19.930-LlanoFusion-Foley-AMD.pdf
[3.76] Angelini C., AMD Trinity On The Desktop: A10, A8, And A6 Get Benchmarked!, Tom’s Hardware,
Sept. 26 2012, http://www.tomshardware.com/reviews/a10-5800k-a8-5600k-a6-5400k,3224-3.html
305
Created by XMLmind XSL-FO Converter.
References
[3.77]
Két 65 wattos Trinity: A10-5700 és A8-5500, Prohardver, Febr. 27 2013,
http://prohardver.hu/teszt/ket_65_wattos_trinity_a10-5700_es_a8-5500/az_a10-5700_es_a8-5500.html
[3.78]
Kozak A., AMD Desktop Platforms, 2012
http://enfasys.net/ar/news/imagenes/pdf/review_amd.pdf
AMD
A-Series,
Sept.
25
2012,
[3.79] Valich T., AMD Virgo Uncovered: Trinity Gives You Wings?, BSN, Sept. 27 2012,
http://www.brightsideofnews.com/news/2012/9/27/amd-virgo-uncovered-trinity-gives-you-wings.aspx
[3.80] Altavilla D., AMD Trinity A10-4600M Processor Review, Hot Hardware, May 15 2012,
http://hothardware.com/Reviews/AMD-Trinity-A104600M-Processor-Review/
[3.81] Richland Die Shot AMD APU, Tom’s Hardware, http://www.tomshardware.com/gallery/DieShot,0101375851-0-2-3-1-png-.html
[3.82] Valich T., AMD Launches “Elite APU” with Richland, Successor to Trinity, BSN, March 12 2013,
http://www.brightsideofnews.com/news/2013/3/12/amd-launches-e2809celite-apue2809d-withrichland2c-successor-to-trinity.aspx
[3.83] Sakr S., AMD Richland chips will arrive in notebooks next month, promise better graphics, battery life
and a few extras, Engadget, March 12 2013, http://www.engadget.com/2013/03/12/amd-richlanddetails/
[3.84] Hruska J., AMD’s new Richland APU boosts clocks and adds features, but is ultimately just a minor
Trinity refresh, Extreme Tech, March 12 2013, http://www.extremetech.com/computing/150451-amdsnew-richland-apu-boosts-clocks-and-adds-features-but-its-a-just-modest-refresh
[3.85] Broekhuijsen N., New Details Revealed on AMD's Upcoming Richland Chips, Tom’s Hardware, March
12 2013, http://www.tomshardware.com/news/Richland-APU-AMD,21318.html
[3.86] AMD Richland APU Preview: Trinity Gets a Facelift, Hardware Canucks, March 10 2013,
http://www.hardwarecanucks.com/forum/hardware-canucks-reviews/60112-amd-richland-apupreview-trinity-gets-facelift.html
[3.87] Richland, Kaveri, Kabini & Temash; AMD’s 2013 APU Lineup Examined, Hardware Canucks, Jan. 9
2013,
http://www.hardwarecanucks.com/forum/hardware-canucks-reviews/59053-richland-kaverikabini-temash-amd-s-2013-apu-lineup-examined.html
[3.88] New AMD A-Series APU Offers Mobile PC Users Innovative Experiences, Elite Graphics Performance
and Longer Battery Life, March 12 2013, http://www.amd.com/us/press-releases/Pages/new-amd-aseries-2013mar12.aspx
[3.89] Hagedoorn H., AMD Richland Desktop APU Lineup Details, Guru3D, Jan. 24 2013,
http://www.guru3d.com/news_story/amd_richland_apu_lineup_leaked.html
[3.90] Hinum K., Intel Core i7-3520M, Notebook Check, May 3 2012, http://www.notebookcheck.net/IntelCore-i7-3520M-Notebook-Processor.74446.0.html
[3.91] Rogers P., Macri J., Marinkovic S., AMD Heterogeneous Uniform Memory Access, Apr. 30 2013,
http://events.csdn.net/AMD/130410%20-%20hUMA_v6.6_FINAL.PDF
306
Created by XMLmind XSL-FO Converter.