Post-Dennard Scaling and the final Years of Moore`s Law
Transcription
Post-Dennard Scaling and the final Years of Moore`s Law
Post-Dennard Scaling and the final Years of Moore’s Law Consequences for the Evolution of Multicore-Architectures Prof. Dr.-Ing. Christian Märtin [email protected] Technical Report Hochschule Augsburg University of Applied Sciences Faculty of Computer Science September 2014 Forschungsbericht 2014 Informatik und Interaktive Systeme Post-Dennard Scaling and the final Years of Moore’s Law Consequences for the Evolution of Multicore-Architectures This paper undertakes a critical review of the current chal- the effects of such pessimistic forecasts will affect embe- lenges in multicore processor evolution, underlying trends dded applications and system environments in a milder and design decisions for future multicore processor im- way than the software in more conservative standard plementations. It is a short version of a paper presented at and high-performance computing environments. In the following we discuss the reasons for these the Embedded World 2014 conference [1]. For keeping up with Moore´s law during the last decade, the VLSI developments together with other future challenges for scaling rules for processor design had to be dramatically multicore processors. We also examine possible solution However, in a similar way as the necessary transition from complex s However, in ato similar ay as topics. the necessary discussing transition from complex s approaches some w of operating frequencies to the multicore When processors with moderate fre operating frequencies to multicore processors with moderate fr the performance of thermal multicore systems, we must a complex sin dark silicon will be unavoidable and chip architects will exponentially growing design power (TDP) have of the exponentially growing thermal design power (TDP) of the complex sin linear performance improvements [3], the ongoing multicore evolution look on adequate multicore performance models that have to find new ways for balancing further performancelinear performance improvements [3], the ongoing multicore evolution will undergo dramatic changes during the next several years [4]. As ne both consider the effects of during Amdahl’s law on several different gains, energy efficiency and software complexity. The pa-will undergo dramatic changes next [4]. As n show [5], [6], power problems and the lthe imited degree of iyears nherent applica [ 5], [ 6], p ower p roblems a nd t he l imited d egree inherent aTpplic multicore architectures and workloads, and on theopf rocessors. per compares the various architectural alternatives on theshow percentages of dark or dim silicon in future multicore his m f dark r off dim silicon iwith n future ulticore processors. This Itm consequences ofothese models to multicore basis of specific analytical models for multicore systems.percentages have to be oswitched or operated at regard low mfrequencies all the time. have to be switched off or operated at low frequencies all the time. I effects of and such energy pessimistic forecasts We will affect embedded power requirements. use the models alsoapplication effects of way such than pessimistic forecasts will affect embedded application milder the software in more conservative standard an Introduction to introduce the different architectural classes for multimilder way than the software in more conservative standard an environments. core processors. More than 12 years after IBM started into the age of environments. As will be shown, the trend towards In the following we discuss the reasons for these developments toge heterogeneous and/or dynamic architectures and multicore processors with the IBM Power4, the first In more the following we discuss he reasons for these dsolution evelopments toge for multicore processors. We talso examine possible approach innovative design directions mitigatepossible several of the must commercial dual core processor chip, Moore´s law ap- for multicore processors. We also examine solution approach discussing the performance of can multicore systems, we have discussing the problems. performance of multicore systems, we must have expected pears still to be valid as demonstrated by Intel´s fast track performance models that both consider the effects of Amdahl’s law on and workloads, and on the consequences of these models with regard from 32 to 22nm mass production and towards its new performance models that both consider the effects of Amdahl’s law on requirements. We and use the consequences models also to the different ar and workloads, and on the of introduce these models with regard Moore´s law dark silicon 14nm CMOS process with even smaller and at the same processors. As will be shown, the trend towards more heterogeneous a requirements. We use the models also to introduce the different a time more energy efficient structures every two years processors. As will be shown, the trend towards more heterogeneous a The major reason for thecan current situation and innovative design directions mitigate several of the the expected problem [2]. At the extreme end of the performance spectrum, innovative upcoming trend towards large areas of dark silicon design directions can mitigate several of the are expected proble changed. In future multicore designs large quantities of Prof. Dr.-Ing. Christian Märtin Hochschule Augsburg Fakultät für Informatik Telefon +49 (0) 821 5586-3454 [email protected] Fachgebiete ■ ■ ■ ■ Rechnerarchitektur Intelligente Systeme Mensch-Maschine-Interaktion Software-Technik Moore´s law is also expressed by the industry´s multi- the new scaling rules for VLSI design. Dennard’s scaling law ain nd dark ilicon billion transistor multicore and many-core server chips 2 rulesMoore´s [7] were perceived 1974 and shave held for more The m ajor r eason f or t he c urrent s ituation nd the upower pcoming trend to 2 Moore´s l aw a nd d ark s ilicon and GPUs. Obviously the transistor raw material needed than 30 years until around 2005. As is well aknown, for integrating even more processor cores and larger caches onto future chips for all application areas and performance levels is still available. are the ajor new scaling frules for VLSI design. Dennard’s rules [7] wt The or tbe he current sas: ituation and the scaling upcoming trend in m CMOSreason chips can modeled held for more than 30 years until around 2005. As is well known, powe are the new scaling rules for VLSI design. Dennard’s scaling rules [7] w as: held for more than 30 years until around 2005. As is well known, powe 𝑃𝑃 = 𝑄𝑄𝑄𝑄𝑄𝑄𝑉𝑉 ! + 𝑉𝑉𝐼𝐼 (1) (1) as: (1) 𝑃𝑃 = 𝑄𝑄𝑄𝑄𝑄𝑄𝑉𝑉 ! + 𝑉𝑉𝐼𝐼!"#$#%" !"#$#%" However, in a similar way as the necessary transition (1) 𝑃𝑃 = 𝑄𝑄𝑄𝑄𝑄𝑄𝑉𝑉 ! + 𝑉𝑉𝐼𝐼! !"#$#%" ∝ 𝑄𝑄 is the number of transistor devices, he operating frequency of t operating Q number= of (transistor 𝑟𝑟) =devices, 𝑟𝑟 ∝/! 𝑓𝑓 tthe (2) 𝑃𝑃 is =the 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 could be neglected 𝐼𝐼and !"#$#%" ting frequencies to multicore processors with moderate the operating voltage. The leakage current of the chip, C the capacitance V the 𝑄𝑄 frequency is the number of transistor devices, 𝑓𝑓 the operating frequency of larger than 65nm. frequencies was caused by the exponentially growing the operating voltage. The leakage current operating voltage. The leakage current 𝐼𝐼!"#$#%" could be could be neglected With Dennard’s scaling rules the total chip power for a given area si larger than 65nm. thermal design power (TDP) of the complex single core neglected until 2005 with device structures larger than generation to process generation. At the same time, with a scaling fact processors for reaching linear performance improve65nm. With Dennard’s scaling rules the total chip power for a given area s at a rate of 1/𝑆𝑆 (the scaling ratio), transistor count doubled (Moore´s la ments [3], the ongoing multicore evolution has again generation to process generation. At the same time, with a scaling fac With Dennard’s scaling rules the total chip power by 40 % [5] every two years. With feature sizes below 65nm, these rul at a rate of (the scaling ratio), transistor count doubled (Moore´s hit the power wall and will undergo dramatic changes because for a given size stayed the same from process of 1/𝑆𝑆 the area exponential growth of the leakage current. To lessen by 40 % [5] every two years. With feature sizes below 65nm, these ru moving to 45nm, introduced extremely efficient new Hafnium based during the next several years [4]. As new analytical generation to process generation. At the same time, of the because exponential growth of the leakage current. To lessen When moving to 22nm, optimized models and studies show [5], [6], power problems and processor. with a scaling factor S=√2, featureIntel size shrinked at the a switching pr moving to 45nm, introduced new Hafnium based transistors that are currently uextremely sed in the Hefficient aswell processors and will also the limited degree of inherent application parallelism processor. rate of When 1/S (themoving scaling to ratio), transistor count doubled 22nm, Intel optimized the switching However, even these remarkable breakthroughs could not revive thep will lead to rising percentages of dark or dim silicon intransistors (Moore´s the frequency increased byprocessors 40 % [5] and will als that law) are and currently used in the Haswell because no further scaling of the threshold voltage is possible as long a future multicore processors. This means that large parts at However, every two years. With feature sizes below 65nm, even these very remarkable breakthroughs cthese ould not revive th the already low level. Therefore operating voltage has current of the chip have to be switched off or operated at low because no further scaling of the threshold voltage is possible as long a rules1could longer be sustained, because of the expoaround V for sno everal processor chip generations. current already very level. Therefore voltage has frequencies all the time. It has to be studied, whether at the !low like nential growth of the leakage current. To lessen the leaWith Post-‐Dennard scaling, Dennard soperating caling of t (𝑓𝑓) = processor with the number 𝑆𝑆!"#$!! ! c hip around 1 V f or s everal g enerations. !!! ! frequency with 𝑆𝑆 from generation to generation, i.e. the potential com from complex single core architectures with high opera- 2 ! ! Post-‐Dennard scaling, like generations. with Dennard scaling the number oaf 𝑆𝑆 With or 2.8 between two process Transistor capacitance frequency with 𝑆𝑆 f rom generation to generation, i.e. the potential com scaling regimes. However, as threshold and thus operating voltage can 𝑆𝑆 !longer possible to keep the power envelope constant from generation t or 2.8 between two process generations. Transistor capacitance a ars [4]. As new analytical models and studies erent application parallelism will lead to rising essors. This means that large parts of the chip ll the time. It has to be studied, whether the ures with high and system environments in a d applications used by and the high-‐performance computing standard s for reaching ower wall and pments together with other future challenges ls and studies approaches l on lead to rising to some of the topics. When must have a look on adequate multicore rts of the chip dahl’s law on different multicore architectures , whether the s onments with regard in a to multicore power and energy different architectural classes for multicore ce computing erogeneous and/or dynamic architectures and ected problems. ure challenges topics. When ate multicore architectures ming trend towards large areas of dark silicon er and energy ng rules [7] were perceived in 1974 and have for multicore kage current, Intel, when moving to 45nm, introduced known, power in CMOS chips can be modeled hitectures and extremely efficient new Hafnium based gate insulators for the Penryn processor. When moving to 22nm, Intel optimized the switching process by using new 3D Finrequency of the chip, 𝐶𝐶 the capacitance and 𝑉𝑉 FET transistors that are currently used in the Haswell be neglected until 2005 with device structures of dark silicon processors and will also be scaled down to 14nm. 974 and have n be modeled However, even these remarkable breakthroughs given area size stayed the same from process could revive the scaling of the operating voltage, a scaling factor 𝑆𝑆 =not2 , feature size shrinked d (Moore´s law) and the frequency increased because no further scaling of the threshold voltage is nm, these rules could no longer be sustained, possible as long as the operating frequency is kept at t. To lessen 𝑉𝑉the acitance and leakage current, Intel, when thegate current already very level. Therefore operating fnium based insulators for the low Penryn vice structures switching process new at3D FinFET value of around 1 V voltage by has using remained a constant s and will also scaled processor down to 14nm. forbe several chip generations. from process not r evive t he s caling o f the operating scaling, voltage, like with Dennard e size shrinked With Post-Dennard ble as long as the operating frequency is kept ncy increased scaling the number of transistors grows with S2 and the voltage has remained at a constant value of be sustained, frequency with S from generation to generation, i.e. the t, Intel, when 3 computing or the Penryn e number of potential transistors grows with performance 𝑆𝑆 ! and the increases by S or 2.8 ew 3D FinFET between two process generations. Transistor capacitance otential computing performance increases by ! 14nm. also scales both scaling regimes. Hoapacitance also scales down down to to uunder nder both ! rating voltage, voltage cannot be scaled any longer, it is no wever, as threshold and thus operating voltage cannot quency is kept generation to generation and simultaneously stant value be of scaled any longer, it is no longer possible to keep the with Dennard scaling power remains constant power envelope constant from generation to generation 2 per generation for the wer increase of 𝑆𝑆 ! = ! ! and simultaneously reach theof potential performance omputing resources decreases with a rate the ith 𝑆𝑆 and !! e increases by improvements. Whereas with Dennard scaling power ! under both remains constant between generations, Post-Dennard ! onger, it is no Scaling leads to a power increase of S2 = 2 per generatiimultaneously on for the same die area [8]. At the same time utilization mains constant of a chip’s computing resources decreases with a rate of ration for the ! ith a rate of ! per generation. ! This means that at runtime large quantities of the Forschungsbericht 2014 Informatik und Interaktive Systeme industry. With the advances in transistor and process technology, new architectural approaches, larger chip sizes, introduction of multicore processors and the partial use of the die area for huge caches microprocessor performance increased about 1000 times in the last two decades and 30 times every ten years [4]. But with the new voltage scaling constraints this trend cannot be maintained. Without significant architectural breakthroughs Shekar Borkar based on Intel’s internal data only predicts a six times performance increase for standard multicore based microprocessor chips in the decade between 2008 and 2018. This projection corresponds to the theoretical 40% increase per generation and over five generations in Post-Dennard scaling that amounts to S5=5.38. Therefore, computer architects and system designers have to find effective strategic solutions for handling these major technological challenges. Formulated as a question, we have to ask: Are there ways to increase performance by substantially more than 40% per generation, when novel architectures or heterogeneous systems are applied that are extremely energy-efficient and/or use knowledge about the software structure of the application load to make productive use of the dark silicon? Modeling Performance and Power of Multicore Architectures When Gene Amdahl wrote his famous article on how the amount of the serial software code to be executed transistors on the chip have to be switched off com- influences the overall performance of parallel compu- pletely, operated at lower frequencies or organized in ting systems, he could not have envisioned that a very completely different and more energy efficient ways. fast evolution in the fields of computer architecture and For a given chip area energy efficiency can only be microelectronics would lead to single chip multipro- improved by 40 % per generation. This dramatic effect, cessors with dozens and possibly hundreds of processor called dark silicon, already can be seen in current cores on one chip. However, in his paper, he made a multicore processor generations and will heavily affect visionary statement, valid until today, that pinpoints the future multicore and many-core processors. Appendix A current situation: “the effort expended on achieving high shows the amount of dark and dim (i.e. lower frequency) parallel processing rates is wasted unless it is accompa- silicon for the next two technology generations. nied by achievements in sequential processing rates of These considerations are mainly based on physical laws applied to MOSFET transistor scaling and CMOS very nearly the same magnitude” [10], [11]. Today we know, that following Pollack’s rule [4], technology. However, the scaling effects on perfor- single core performance can only be further improved mance evolution are confirmed by researchers from with great effort. When using an amount of r transistor 3 considerations are mainly based on physical laws applied to MOSFET scaling internal and data only predicts a six significant architectural breakthroughs Shekar Borkar transistor based on theoretical Intel’s aling constraints this trend cannot be These maintained. Without 2008 and 2018. This projection corresponds to the 40% increase per generation an CMOS technology. However, the scaling effects on performance evolution are confirmed by researchers hekar Borkar based on Intel’s internal data only predicts a six times performance increase for standard multicore based microprocessor chips in the decade between generations in Post-‐Dennard scaling that amounts to 𝑆𝑆 ! = 5.38. from industry. With the advances in 2transistor process technology, approaches, 2008 and 018. This and projection corresponds to new the tarchitectural heoretical 40% increase per generation and over five multicore based microprocessor chips in the decade between ! area fdesigners architects have to find effective strategic so larger chip sizes, of multicore rocessors acomputer nd the partial se of tand he ie or huge caches ds to the theoretical 40% increase per generation and introduction over five generations = 5.38. in PpTherefore, ost-‐Dennard scaling that aumounts to d𝑆𝑆system microprocessor performance increased ahandling bout 1000 times in the last two decades and 30 times every these major technological challenges. Formulated as ten a question, we have to ask: Are t mounts to 𝑆𝑆 ! = 5.38. Forschungsbericht 2014 computer architects system designers have to find effective strategic solutions for years [4]. But with the new Therefore, voltage scaling constraints this and trend cannot be more maintained. Without to increase performance by substantially than 40% per generation, when novel archit handling these major technological challenges. Formulated as a predicts question, e have to ask: Are there ways ystem Informatik designers have to find Interaktive effective strategic solutions for und Systeme significant architectural breakthroughs Shekar Borkar based on Intel’s internal data only a wsix heterogeneous systems are applied that are extremely energy-‐efficient and/or use knowledge by substantially more than 40% per pgeneration, when novel or nges. Formulated as a question, we htimes performance increase for standard multicore based microprocessor chips in the decade between ave to ask: Are there ways to increase performance software structure of the a pplication l oad t o m ake roductive u se of the dark architectures silicon? more than 40% per generation, when novel or heterogeneous systems are applied that are extremely energy-‐efficient and/or use knowledge about the 2008 and 2architectures 018. This projection corresponds to the theoretical 40% increase per generation and over five software structure of the application load to make productive use of the dark silicon? are extremely energy-‐efficient and/or use knowledge about the generations in Post-‐Dennard scaling that amounts to 𝑆𝑆 ! = 5.38. to make productive use of the dark silicon? 3 Modeling Performance nd Power of M have to find effective a strategic solutions for ulticore Architecture Therefore, computer architects and system designers When Gene Amdahl wrote his famous article on how the of the serial software c handling these major technological c hallenges. F ormulated a s a q uestion, w e h ave t o a sk: A re t here w ays 3 Modeling Performance and Power of Multicore amount Architectures influences executed the generation, overall performance of parchitectures arallel computing to increase performance by substantially more than 40% per when novel or systems, he could not have and Power of Multicore A rchitectures When Gene Amdahl his famous article the amount of the serial software code to be lea in the that a very fwrote ast evolution fields on of chow omputer architecture and microelectronics would heterogeneous systems are applied that are extremely energy-‐efficient and/or use knowledge about the the overall performance of parallel computing systems, he could not have envisioned s article on how the amount of the serial software code to be executed influences chip multiprocessors with dozens and possibly hundreds of processor cores on one chip. Howe software structure of the application load to make productive use of the dark silicon? evolution in the fields f cSymmetric omputer multicore architecture and icroelectronics would lead single L1 nce of parallel computing systems, he could not have envisioned that a very fast Fig.o1. processor: eachmblock is a basic core equivalent (BCE)to including paper, he made a visionary statement, valid until today, that pinpoints the current situation: are not modeled. The computing performance of a BCE is normalized to 1. omputer architecture and microelectronics w ould lead to single chip multiprocessors with dozens and possibly hundreds of processor cores on one chip. However, in his Fig. 1. Symmetric multicore processor: each block isexpended a basic core equivalent (BCE) including L1 and L2 caches. L3 cache andis onchip-network on achieving high parallel processing rates wasted unless it is accompanied by ach paper, he made a visionary statement, valid until today, that pinpoints the current situation: “the effort sibly hundreds of processor cores on one chip. However, in his The computing performance of a BCE normalized to p 1.rocessing rates of very nearly the same magnitude” [10], [11]. in issnd equential 3 are not modeled. Modeling Performance Power of M3.1 ulticore rchitectures Standard M odels expended on aachieving high parallel p rocessing rA ates is w asted unless is accompanied by aL1 chievements lid until today, that pinpoints the current situation: “the effort Fig. 1. Symmetric multicore processor: each block is a basic coreit equivalent (BCE) including and L2 caches. L3 Today we know, that following Pollack’s rule [4], single are rnot modeled. The performance a BCE is normalized to be 1. core performance can only When bStandard Gene Amdahl famous article on how amount of the serial software code in shis equential processing ates othe f very ncomputing early the same mof agnitude” [10], [to 11]. essing rates is wasted unless y achievements it is accompanied In 2008 Hill and Marty started this discussion with performance mo 3.1 Mwrote odels improved with great effort. When using an amount of 𝑟𝑟nvisioned transistor resources, scalar performan influences the overall performance of parallel computing systems, he symmetric could not h(fig. ave e1), ly the same magnitude” [10], [11]. executed architectures: asymmetric (fig. and In 2008 Hill and Marty started this discussion with performance models for three types of multicore Today we know, that following Pollack’s rule [4], single core performance can 2), only be dynamic further ( Fig. 1. Symmetric each block a basic core (BCE) including L1the and L2 caches. L3 cache and where onchip-network that multicore a very scalar fprocessor: ast evolution in tishe fields oequivalent f c𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 omputer architecture and m icroelectronics would lead to Mtime odels 𝑟𝑟 3.1 = 𝑟𝑟. Standard At same single core power 𝑃𝑃for s(ingle or the TDP) rises 𝑃𝑃 = 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 as in Amdahl’s law, the Speedup S achieved by with characterize resources, performance rises to evaluated their speedup models different workloads, architectures: symmetric (fig. 1), asymmetric (fig. 2), and dynamic (fig. 3) multicore chips [13] and improved with great effort. When using an amount of 𝑟𝑟 transistor resources, scalar performance rises to ack’s rule [4], single core areperformance can only be further not modeled. The computing performance of a BCE is normalized to 1. chip multiprocessors with dozens and possibly hundreds of processor cores on one chip. However, in his 𝛼𝛼 = 1.75 or nearly the square of the targeted scalar performance. For the rest of this paper w In 2008 Hill and Marty started this discussion with performance models for three code that can be parallelized as in Amdahl’s law, where the Speedup parallel At the same their time speedup single core power P (or rises time evaluated models workloads, characterized the is respective degree 𝑓𝑓 of with 𝑃𝑃 = 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 ! [3] with 𝑆𝑆 n amount of 𝑟𝑟 transistor resources, scalar performance rises to 𝑟𝑟for = different 𝑟𝑟. the At TDP) the same single processing core by power 𝑃𝑃defi (or ned the as: TDP) rises 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 paper, he made a visionary statement, valid until today, that pinpoints the current situation: “the effort that 𝑃𝑃 includes dynamic as well as leakage power. Thus, there is a direct relation between th architectures: symmetric (fig. 1), asymmetric (fig. 2), and dynamic (fig. 3) multic defined ! code that can be parallelized as in Amdahl’s law, where the Speedup 𝑆𝑆a s: achieved by parallel processing is 𝛼𝛼 = 1.75 o r nearly the square of the targeted scalar performance. For the rest of this paper we assume e core power 𝑃𝑃 (or the TDP) with 𝑃𝑃 = 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 [ 3] with with with or nearly the square 3.1 rises Standard M odels expended on achieving high parallel processing rates is wtasted unless it resources is accompanied bdifferent y achievements sequential core, he their transistor a!nd the performance. Power characterized can be expressed, as sres ho evaluated speedup models for workloads, by the defined as: that includes dynamic as well as leakage power. Thus, there is a direct relation between the TDP of a eted scalar performance. For the rest of this paper we assume (3) (3) 𝑆𝑆!"#$!! In 2008 Hill and Marty started this discussion with performance models for three types of multicore of targeted scalar performance. the rest of this in the sequential processing rates of v𝑃𝑃ery nFor early the same magnitude” [10], [(𝑓𝑓) 11]. = by: ! code that can be parallelized as in Amdahl’s law, where the Speedup 𝑆𝑆 a chieved by p !!! ! sequential (fig. core, the transistor resources nd the performance. ower can be expressed, as shown in [12] e power. Thus, there is a direct relation between the TDP of a ! 1), asymmetric architectures: symmetric dynamic 3) amulticore chips ! [13] Pand as: ∝(fig. =know, (fig. following dynamic 2), and defined (3) 𝑆𝑆!"#$!! we (𝑓𝑓) assume that well !as ! includes Today Pollack’s rule [4], performance can only be further by: = ( single 𝑟𝑟) =core 𝑟𝑟 ∝/! respective degree (2) 𝑃𝑃 =as 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 nd the performance. Power can be epaper xpressed, awe s shown in !that [P 12] !!! evaluated their speedup models for different workloads, characterized by the 𝑓𝑓 of ! ! equation ƒ is improved with great effort. When using an amount of 𝑟𝑟 transistor resources, scalar performance rises to In this equation 𝑓𝑓 i s the fraction of possible parallel work and 𝑛𝑛 is t leakage power. Thus, there is a direct relation between In this the fraction of ! (2) possible (3) parallel work 𝑆𝑆=!"#$!! ! code that can be parallelized as in Amdahl’s law, where the Speedup 𝑆𝑆 a=chieved by parallel processing is = (When Grochowski derived his equation from comparing Intel processor generations from t 𝑟𝑟)∝ power 𝑟𝑟 ∝/!𝑃𝑃 (𝑓𝑓) 𝑃𝑃 =time 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃single !!! ! rises Amdahl’s law uses fixed pplication sizes, the Amdahl’s achievable maximum spee 𝑟𝑟 of = a sequential 𝑟𝑟. At the same core (or the TDP) with 𝑃𝑃 of =a available 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 ! [3] with As 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 ! the number In this equation 𝑓𝑓 i s the fraction of possible parallel work and 𝑛𝑛 i s the number of available cores. As the TDP core, the transistor resources and is cores. law defined a s: ! (2) 𝛼𝛼 Amdahl’s = 1.75 or nearly the square of the targeted scalar performance. For the rest of this paper we assume the the Pentium 4 Cedarmill (all peedup, of them normalized to l65nm = multicore . With a evolution fixed appliw multicores are uw sed, is always imited technology), to 𝑆𝑆o!"# application izes, achievable maximum hether multiprocessors r maximum ! law uses fixed !!! and the performance. Power can be sexpressed, as shown When Grochowski derived his equation from comparing Intel processor generations from the i486 to uses fisxed application sizes, the achievable ! In this equation 𝑓𝑓 i s the fraction of possible parallel work and 𝑛𝑛 is the number o that 𝑃𝑃 i ncludes dynamic as well as leakage power. Thus, there is a direct relation between the TDP of a gathering momentum and the Post-‐Dennard scaling effects only were in their initial phase. W = re used, i s always limited to 𝑆𝑆 (3) 𝑆𝑆!"#$!! (𝑓𝑓)multicores !a = (all . Wof ith them a fby Hill and Marty also reach their performance limits relatively soon, an ixed application to size, the technology), models presented the Pentium a4 nd Cedarmill normalized 65nm evolution was slowly !"# n from comparing Intel processor generations from the i486 to !!! in [12]!!! by:!c!ore, the transistor multiprocessors or[multicore multicores arespeedup, law speedup, ses fixed pplication sizes, he achievable maximum whethe sequential resources the pAmdahl’s erformance. Puower can whether bae expressed, as sthown in 12] technology, leakage currents and not further down-‐scalable supply voltages would forbid the large number of cores does not s𝑓𝑓eem !adequate. rmalized to 65nm technology), multicore evolution was slowly gathering momentum and the Post-‐Dennard scaling effects only were in their initial phase. With current by Hill and Marty also reach their performance limits relatively soon, and for workloads with < 0.99 a by: entire die multicores area of mainstream microprocessor chips for a single core processor. Therefore . W ith a afixed application size, th are used, is always limited to to 𝑆𝑆!"# = used, is always limited With fi xed !!! technology, leakage currents and not further down-‐scalable supply voltages would forbid the use of the ard scaling effects only were in their initial phase. With current In this equation 𝑓𝑓 is the fraction of possible parallel work and 𝑛𝑛 fis the number of available cores. As large number of cores does not seem adequate. multicore models or existing or envisioned architectural alternatives have to consider the impl ! ∝ ∝/! by Hill and Marty also reach their performance limits relatively soon, and for worklo application size, the models presented by Hill and Marty (2) = ( 𝑟𝑟) = 𝑟𝑟 (2) 𝑃𝑃 = 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 entire die area of mainstream microprocessor chips for a single core processor. Therefore analytical her down-‐scalable supply voltages would forbid the use of the Amdahl’s law uses fixed application sizes, the achievable maximum speedup, whether m ultiprocessors or different workload characteristics, both, to the performance, and the power requirement ! of cores not seem adequate. or ealarge xisting opplication r envisioned adrchitectural aplternatives h ave to consider the and implications of cessor chips for a single core processor. analytical also reach their performance limits relatively soon, odels . Wfith fixed naumber size, toes he m odels resented multicores are uTherefore sed, is always limited tmulticore o 𝑆𝑆!"# =m architectures. !!! When Grochowski derived his equation from comparing Intel processor generations from the i486 to characteristics, both, to the performance, and the power requirements of these d architectural alternatives have to consider the implications of different workload for workloads with evolution a large number of cores When Grochowski derived his from comparing by Hill and Marty also reach their performance limits relatively soon, and for workloads with 𝑓𝑓 < 0.99 awas requirements the Pentium 4 Cedarmill of equation them normalized to 65nm technology), multicore slowly architectures. to the performance, and the power of these (all large number oprocessor f cores does not seem adequate. Intel generations from the i486 to the Pentium does not seem adequate. gathering momentum and the Post-‐Dennard scaling effects only were in their initial phase. With current 4technology, leakage currents and not further down-‐scalable supply voltages would forbid the use of the Cedarmill (all of them normalized to 65nm technolo entire die area of mainstream microprocessor chips for a single core processor. Therefore analytical gy), multicore evolution was slowly gathering momen multicore models for existing or envisioned architectural alternatives have to consider the implications of tum and the Post-Dennard scaling both, effectsto only different workload characteristics, the were performance, and the power requirements of these architectures. in their initial phase. With current technology, leakage currents and not further down-scalable supply voltages Fig. 2. Asymmetric multicore processor consisting of a complex core with 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 = 4 and 12 BCE would the use of the entire die area of main Fig. 2. forbid Asymmetric multicore processor consisting of a complex core with 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 = 4 and 12 BCEs. In the engineering and embedded system domains there are many s stream microprocessor chips for a single core processor. with moderate data size, irregular and data flow structure 4 and 12 BCEs. Fig. 2. Asymmetric multicore processor consisting of a complex core control with 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 = In the engineering and embedded system domains there are many standard and real time workloads Therefore analytical multicore models for existing or Fig. 2. Asymmetric multi- the easily exploitable parallelism. For these workload characteristics, with moderate data size, irregular control and data flow structures and with a limited degree of core processor consisting envisioned architectural have to characteristics, consider the the Marty with some extensions can lead to valuable insights of how diffe In the engineering and embedded system domains there are many standard and exploitable parallelism. alternatives For these workload easily understandable study of Hill and of a complex core with 4 andmoderate 12 BCEs.influence Fig. 2. Asymmetric multicore processor consisting of a complex core with 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 =with the airregular chievable control performance. data size, and data flow structures and with a implications of different workload characteristics, both, Marty with some extensions can lead to valuable insights of how different multicore architectures may Perƒ=√4 and 12easily BCEs.understandab exploitable parallelism. For these workload characteristics, the influence t he a chievable p erformance. to the performance, and the power requirementsMarty with some extensions can lead to valuable insights of how different multicor of these In the engineering and embedded system domains there are many standard and real time workloads with moderate structures a limited and degree of data size, irregular control and data flow architectures. the engineering embedded system domains there with influence the In aand chievable performance. exploitable parallelism. For these workload characteristics, the easily understandable study of Hill and are many standard and real time workloads with mode Marty with some extensions can lead to valuable insights of how different multicore architectures may rate data size, irregular control and data fl ow structures influence the a chievable performance. and with a limited degree of exploitable parallelism. For these workload characteristics, the easily understandable Fig. 1. Symmetric mul ticore processor: each study of Hill and Marty with some extensions can lead block is a basic core equi- to valuable insights of how different multicore architecvalent (BCE) including L1 tures may influence the achievable performance. and L2 caches. L3 cache and onchip-network are not modeled. The computing performance of a BCE is normalized to 1. Standard Models In 2008 Hill and Marty started this discussion with performance models for three types of multicore architectures: symmetric (fig. 1), asymmetric (fig. 2), and dynamic (fig. 3) multicore chips [13] and evaluated their speedup models for different workloads, characterized by the respective degree ƒ of code that can be parallelized 4 Fig. 3. Dynamic multicore processor: 16 BCEs or one large core with Perƒ=√16. Forschungsbericht 2014 Informatik und Interaktive Systeme . Symmetric multicore processor: each block is a basic core equivalent (BCE) including L1 and L2 caches. L3 cache and onchip-network ot modeled. The computing performance of a BCE is normalized to 1. Standard Models n 2008 Hill and Marty started this discussion with performance models for three types of multicore symmetric 1), L1asymmetric (fig. 2), andand dynamic (fig. 3) multicore chips [13] and aitectures: basic core equivalent (BCE) (fig. including and L2 caches. L3 cache onchip-network their to speedup models for different workloads, characterized by the respective degree 𝑓𝑓 of Euated is normalized 1. e that can be parallelized as in Amdahl’s law, where the Speedup 𝑆𝑆 achieved by parallel processing is ned as: ! discussion with performance models for three types of multicore (3) #$!! (𝑓𝑓) = ! mmetric (fig. !!! 2), !and dynamic (fig. 3) multicore chips [13] and ! Thecharacterized idea behind the models is that ifferent workloads, by different the respective degree 𝑓𝑓 othe f giahl’s law, where the Speedup 𝑆𝑆 achieved by parallel processing is n this equation is the fraction of possible parallel work and s the number of available cores. As ven 𝑓𝑓chip area allows for the implementation of 𝑛𝑛 isimple dahl’s law uses fixed aso pplication sizes, core the aequivalents chievable m(BCEs) aximum with speedup, whether multiprocessors or cores or called basic a ! = . W ith a f ixed a pplication size, the models presented ticores a re u sed, i s a lways l imited t o 𝑆𝑆 given (3) !!! performance of 1. !"# To construct the architectural Fig. 3. Dynamic multicore processor: 16 BCEs or one large core with 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 = 16. Hill and Marty also reach their performance limits relatively soon, and for workloads with 𝑓𝑓 < 0.99 a alternatives, more complex cores are modeled by consue number of cores does not seem adequate. 𝑛𝑛 possible parallel work and s the number of available cores. As mingThe idea behind the different models is that the given chip area allows for the implementation of r of the 𝑛𝑛 iBCEs, leading to a performance perf(r)= simple cores or so called basic core equivalents (BCEs) with a given performance of 1. To construct the s, the achievable m aximum s peedup, w hether m ultiprocessors o r √r per core, if we go along with Pollack’s performance ! architectural alternatives, more complex cores are modeled by consuming 𝑟𝑟 of the 𝑛𝑛 BCEs, leading to a 𝑆𝑆!"# = . With a fixed application size, the models presented Fig. 4. !!! rule for complex cores. 𝑟𝑟 = 𝑟𝑟 per core, if we go along with Pollack’s performance rule for complex cores. Asymmetric performance 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 Fig. 3. Dynamic multicore processor: 16 BCEs or one large with 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 rmance limits relatively soon, and for workloads with 𝑓𝑓 core < 0.99 a = 16. multicore with 2 complex These leadto to the following speedup These cconsiderations onsiderations lead the following speedup equations for the three alternatives: cores and 8 BCEs dequate. equations for the three alternatives: The idea behind the different models is that the given chip area allows for the implementation of 𝑛𝑛 simple cores or so called basic core equivalents (BCEs) with a given performance of 1. To construct the architectural alternatives, ! more complex cores are modeled by consuming 𝑟𝑟 of the 𝑛𝑛 BCEs, leading to a 𝑆𝑆!"# 𝑓𝑓, 𝑛𝑛, 𝑟𝑟 = !!! (4) (4) !∙! 𝑟𝑟 = ! !"#$ 𝑟𝑟 p!er core, if we go along with Pollack’s performance rule for complex cores. performance 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝!"#$(!) ∙! These considerations lead to the following speedup equations for the three alternatives: . Asymmetric multicore processor consisting of a complex core with 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 = 4 and 12 BCEs. ! (5) (5) 𝑆𝑆!"#$ (𝑓𝑓, 𝑛𝑛, 𝑟𝑟) = !!! ! ! ! 𝑆𝑆!"# 𝑓𝑓, 𝑛𝑛, 𝑟𝑟 = !!! (4) !"#$(!) !"#$ !∙! ! !!!! n the engineering and embedded system domains there are many standard and real time workloads ! !"#$(!) !"#$ ! ∙! Fig. 5. Dynamic multicore with 4 cores and frequency scaling using the power budget of 8 BCEs. Currently one core is at full core TDP, two cores are at ½ core TDP. One core is switched off. h moderate data size, irregular control and data flow structures and with a limited degree of oitable parallelism. For these workload characteristics, the easily understandable study of Hill and ty with some extensions can lead to valuable insights of how different multicore architectures may a complex core with 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 = 4 and 12 BCEs.! 𝑆𝑆!"# (𝑓𝑓, 𝑛𝑛, 𝑟𝑟) = !!! !! (6) uence the achievable erformance. ! 𝑟𝑟) = !"#$ (6) (5) 𝑆𝑆!"#$ (𝑓𝑓,p𝑛𝑛, ! !!!! ! ! ystem domains there are many standard and real time workloads !"#$(!) !"#$ ! !!!! ontrol and data Note, flow tstructures with a climited degree (of hat for the and asymmetric ase in equation 5), the parallel work is done together by the large core and the the load characteristics, understandable of Hill and or the small cores would be running, to keep the 𝑛𝑛 − easily 𝑟𝑟 small cores. If either study the large core, Note, thatpfor the asymmetric casethe in equation (5), thechange to: o valuable insights of how different multicore architectures may available ower budget in balance, equation w ould ! 𝑛𝑛, 𝑟𝑟) = (6) 𝑆𝑆!"# (𝑓𝑓,work parallel is done together by the large core and the !!! ! ! ! !!! (7) 𝑆𝑆!"#$ (𝑓𝑓, 𝑛𝑛, 𝑟𝑟) =!"#$ ! ! ! ! n - r small cores. If!"#$(!) either !!! the large core, or the small Fig. 6. Heterogeneous with a large cores would to keepcthe power Note, that be for running, the asymmetric ase available in equation (5), the parallel work is done together by the large cmulticore ore addition o achievable erformance, power (TDP) and energy consumption are important indicators and In the − 𝑟𝑟 stmall cores. If peither the change large core, the four BCEs, budget in𝑛𝑛 balance, the equation would to: or the small cores would be running, to keep core, for the pfeasibility and in appropriateness of multi-‐ and many-‐core-‐architectures with symmetric two and accelerators or available ower budget balance, the equation would change to: asymmetric structures. ! Woo and Lee have extended the work of Hill and Marty towards modeling the co-processors of type 𝑆𝑆average 𝑛𝑛, 𝑟𝑟) =envelope, !!! workloads (7)that (7) can be modeled with Amdahl’s law are executed on D, E, each. Each A, B, !"#$ (𝑓𝑓, power !when ! !"#$(!) multicore processors [14]. !!! co-processor/accelerator uses the same transistor If awe have at the evolution of commercial in are addition to the already In ddition o a alook chievable performance, power (TDP) amulticore nd energy processors, consumption important indicators In addition to tachievable performance, power (TDP) budget as a BCE. discussed architectural alternatives, we meanwhile can many-‐core-‐architectures see new architectural variants for asymmetric for the feasibility and appropriateness of multi-‐ and with symmetric and and energy consumption are important indicators for the multicore systems (fig. 6). systems ( fig. 4 ), d ynamic s ystems ( fig. 5 ) a nd h eterogeneous asymmetric structures. Woo and Lee have extended the work of Hill and Marty towards modeling the feasibility and appropriateness multi- and many-coreaverage power envelope, when ofworkloads that can be modeled with Amdahl’s law are executed on Models for Heterogeneous Multicores multicore processors [14]. architectures with symmetric and asymmetric structures. Woo and have Lee have extended the work of of commercial Hill and Marty Heterogeneous on already first glance look similar If we a look at the evolution multicore processors, in architectures addition to the discussed architectural alternatives, meanwhile can see new architectural variants for asymmetric towards modeling the average powerwe envelope, when to asymmetric multicores. However, in addition to a systems (fig. 4), dynamic systems (fig. 5) and heterogeneous multicore systems (fig. 6). workloads that can be modeled with Amdahl’s law are conventional complex core, they introduce unconventi executed on multicore processors [14]. onal or U-cores [12] that represent custom logic, GPU If we have a look at the evolution of commercial multicore processors, in addition to the already dis cussed architectural alternatives, we meanwhile can see resources, or FPGAs. Such cores can be interesting for specific application types with SIMD parallelism, GPU-like multithreaded parallelism, or specific parallel new architectural variants for asymmetric systems (fig. algorithms that have been mapped to custom or FPGA Fig. 4. Asymmetric multicore with 2 complex cores and 8 BCE 4), dynamic systems (fig. 5) and heterogeneous multilogic. To model such an unconventional core, Chung et core systems (fig. 6). al. suggest to take the same transistor budget for a U Fig. 4. Asymmetric multicore with 2 complex cores and 8 BCE 5 Forschungsbericht 2014 Informatik und Interaktive Systeme Fig. 5. Dynamic multicore with 4 cores and frequency scaling using the power budget of 8 BCEs. Currently one core is at full core TDP, two cores are at ½ core TDP. One core is switched off. C C C C A A B B co-processors budget as a D E D E Fig. 6. Heterogeneous multicore with a large core, four BCEs, two acceler-ators or of type A, B, D, E, each. Each co-processor/accelerator uses the same transistor BCE. as for a simple BCE core. The U-core then executes core BCE resources. If μi = 1, the coprocessor can be interpre- for Heterogeneous ulticores ted as a cluster of ri simple cores working in parallel. a3.2 specificModels parallel application section with aMrelative Heterogeneous architectures on first glance look similar to asymmetric multicores. However, in performance of μ, while consuming a relative power of addition to a conventional complex core, they introduce unconventional or U-‐cores [12] that represent Other Studies φcustom logic, GPU resources, or FPGAs. Such cores can be interesting for specific application types with compared to a BCE with μ = φ = 1. If μ > 1 for a specific workload, the U-core worksmultithreaded as an accelerator. With φ <or 1 specific In [16] it is shown, that there will be no limits to multiSIMD parallelism, GPU-‐like parallelism, parallel algorithms that have been Fig. 5. Dynamic multicore with 4 cores and frequency scaling using the power budget of 8 BCEs. Currently one core is at full core TDP, two mapped cconsumes ustom or core Fless PGA logic. Toff. o mcan odel help such to an make unconventional ore, Chung et scaling, al. suggest to tapplication ake the the and core cperformance if the size grows coresU-core are at t½o core TDP. One is power switched same transistor budget for a U-‐core as for a simple BCE core. The U-‐core then executes a specific parallel better use of the dark silicon. The resulting speedup with the number of available cores, as was stated first application section with a relative performance of 𝜇𝜇 , while consuming a relative power of 𝜑𝜑 compared to equation by𝜇𝜇Chung derived from the U-‐core for classic byWGustafson a BCE with = 𝜑𝜑 =et1. al. If can 𝜇𝜇 >directly 1 for a sbe pecific workload, works multiprocessors as an accelerator. ith 𝜑𝜑 < 1 [17]. Sun and C C C C the U-‐core consumes less power and can help to make better use of the dark silicon. The equation (7): Chen show that Gustafson’s lawresulting also holds for multicore speedup equation by Chung et al. can directly be derived from equation (7): architectures and that the multicore speedup scales as a A A ! (8) 𝑆𝑆!!"!#$%!&!$'( (𝑓𝑓, 𝑛𝑛, 𝑟𝑟, 𝜇𝜇) = !!! linear function of n, if the workload is also growing with (8) ! ! !"#$(!) !(!!!) B B the system size. They show, that even, if continuously growing parts of the workload will reside in slow DRAM This equation is very similar to equation (7) that This equation is very similar to equation (7) that gives the speedup model for asymmetric multicore Fig. 6. Heterogeneous multicore with a large core, four BCEs, two acceler-ators or E E D systems. In their D evaluation, and Lee multicore [14] that asymmetric multicores are much more unlimited speedup is still possible, gives the speedup model for Woo asymmetric co-processors of typeshow A, B,sysD, E, such each.memory, Each co-processor/accelerator uses thescaling same transistor appropriate architectures, either with complex or simple cores. Going a BCE. budget as a for power scaling than symmetric as long as the DRAM access time is a constant. Note, tems. In their evaluation, Woo and Lee [14] show that step further, it follows that heterogeneous multicores will even scale better, if 𝜇𝜇 > 1 and 𝜑𝜑 < 1 or at such multicores are much more least 𝜑𝜑 =Models 1. As prototypical results in current research towards a better utilization of the dark silicon in 3.2 asymmetric for Heterogeneous Mappropriate ulticores that without scaling up the workload of the LINPACK logic chips demonstrate, future architectures might similar profit in the performance energy benchmark, multi-million-core supercomputers for power scaling than symmetricon architectures, Heterogeneous architectures first glance either look to mostly asymmetric multicores. and However, in consumption, if not only one, but many additional coprocessors or accelerator cores are added to one or addition to a conventional complex core, they introduce unconventional or U-‐cores [12] that represent presented in the Top 500 list twice every year, would not with complex or simple cores. Going a step further, it a small number of large cores [15]. custom logic, GPU resources, or FPGAs. Such cores can be interesting for specific application types with make sense at all. follows that heterogeneous multicores will even scale SIMD parallelism, multithreaded parallelism, or different specific aparallel algorithms that chave If, in addition to GPU-‐like a conservative complex core, there are ccelerator/co-‐processor ores obeen n an recent andaepplication much more pessimistic better, if μ > 1 and φ < 1 or at least φ = 1. As prototypical mapped to custom FPGA logic. To mbodel such oan n duifferent nconventional hung t al. suggest to take Lthe heterogeneous chip, otr hey will typically e active parts oAf ctore, he eCntire workload. et study by same t ransistor b udget f or a U -‐core a s f or a s imple B CE c ore. T he U -‐core t hen e xecutes a s pecific p arallel [5] constructs a very fine grained results in current research towards utilization 𝑟𝑟! B CEs with 𝑟𝑟 denoting us assume that the entire die area can better hold resources for 𝑛𝑛 = 𝑟𝑟Esmaeilzadeh + 𝑟𝑟! + 𝑟𝑟! + ⋯et+al. application section wcith a relative paerformance f 𝜇𝜇 , resources while consuming relative power f 𝜑𝜑c core. ompared to processors. the resources for the omplex core nd 𝑟𝑟! giving otfuture he fperformance or each as pecific coprocessor We can model for ofuture multicore of the dark silicon in logic chips demonstrate, a BCE with 𝜇𝜇a =more 𝜑𝜑 =flexible 1. If 𝜇𝜇 speedup > 1 for aequation specific wthat orkload, the the U-‐core orks as an astructure ccelerator. With 𝜑𝜑 < 1 now derive mirrors real wapplication and that can Theuse model includes separate partial models for Postarchitectures might profi t mostly performance and the U-‐core consumes power in and can phelp make better of the dark silicon. The resulting easily be used to model less application-‐specific ower to and energy consumption: speedup equation b y Cifhung t al. one, can dbut irectly be dadditierived from eDennard quation (7): device scaling, microarchitecture core scaling, energy consumption, not eonly many ! (9) 𝑆𝑆!"#$%&"# = !!! ! ! ! ! onal coprocessors or𝑟𝑟, accelerator (𝑓𝑓, ! 𝑛𝑛, !!! cores! are added to one (8) and the complete multicore. The model’s starting point 𝑆𝑆!!"!#$%!&!$'( !!!𝜇𝜇) !! ∙!= !!"#(!) ! ! !"#$(!)[15]. !(!!!) is the 45nm i960 Intel Nehalem quadcore processor or a small number of large cores that is mapped to all future process generations until a If, in addition to a conservative complex core, This equation is very similar to equation (7) that gives the speedup model for asymmetric multicore fictitious 8nm multicores process. Following ITRS there are different accelerator/co-processor cores on systems. In their evaluation, Woo and Lee [14] show that such asymmetric are much optimistic more appropriate f or p ower s caling t han s ymmetric a rchitectures, e ither w ith c omplex o r s imple c ores. G oing a scaling, the downscaling of a core will result in a 3.9 x an heterogeneous chip, they will typically be active on step further, it follows that heterogeneous multicores will even scale better, if 𝜇𝜇 > 1 and 𝜑𝜑 < 1 or at core performance improvement and an 88% reduction different parts of the entire application workload. Let least 𝜑𝜑 = 1. As prototypical results in current research towards a better utilization of the dark silicon in in core power However, if the more uslogic assume that the entire die area can hold resources for chips demonstrate, future architectures might profit mostly in consumption. performance and energy o nly o ne, b ut m any a dditional c oprocessors o r a ccelerator c ores a re a dded t o o ne o r are applied, conservative projections by Shekar Borkar nconsumption, = r + r1 + r2 + ...if + nrot BCEs with r denoting the resources l a s mall n umber o f l arge c ores [ 15]. for the complex core and r giving the resources for each the performance will only increase by 34% with a power i If, cin coprocessor addition to a core. conservative omplex core, here are different accelerator/co-‐processor on an using the reduction of 74%. The modelcores is evaluated specifi We cancnow derive a tmore heterogeneous chip, they will typically be active on different parts of the entire application workload. Let 12 real benchmarks of the PARSEC suite for parallel flexible speedup equation that mirrors the real applius assume that the entire die area can hold resources for 𝑛𝑛 = 𝑟𝑟 + 𝑟𝑟! + 𝑟𝑟! + ⋯ + 𝑟𝑟! BCEs with 𝑟𝑟 denoting evaluation. cation structure and that can easily be used to model the resources for the complex core and 𝑟𝑟! giving the resources fperformance or each specific coprocessor core. We can now derive a more flexible speedup equation that mirrors the real application that can The results of structure the studyand show that the entire future application-specific power and energy consumption: easily be used to model application-‐specific power and energy c8nm onsumption: multicore chip will only reach a 3.7 x (conservative 𝑆𝑆!"#$%&"# = 6 ! ! !!! ! !!!! ! !! ∙!! !!"#(!) (9) (9) scaling) to 7.9 x speedup (ITRS scaling). These numbers are the geometric mean of the 12 PARSEC benchmarks executed on the model. The study also shows that 32 The ƒi are exploitable fractions of the application workload, , ƒ = ƒ1 + ƒ2 + ... + ƒl with ƒ ≤ 1 and μi is the to 35 cores at 8nm will be enough to reach 90% of the relative performance of a coprocessor, when using ri relatively much exploitable parallelism, there is not ideal speedup. Even though the benchmarks contain Forschungsbericht 2014 Informatik und Interaktive Systeme enough parallelism to utilize more cores, even with an cessors, we have compared and extended some straight- unlimited power budget. For more realistic standard ap- forward performance and power models. Although vari- plications, the speedup would be limited by the available ous results from research and industrial studies predict power envelope, rather than by exploitable parallelism. that huge parts of future multicore chips will be dark or The power gap would lead to large fractions of dark dim all the time, it can already be seen in contemporary silicon on the multicore chip. The executable model by commercial designs, how these areas can be dealt with Esmaeilzadeh et al. can be executed for different load for making intelligent use of the limited power budget specifications and hardware settings. The authors have with heterogeneous coprocessors, accelerators, larger made it available at a URL at the University of Wiscon- caches, faster memory buffers, and improved communi- sin [6]. cation networks. Although this model was devised very carefully In the future, the success of many-core architectures and on the basis of available facts and plausible future will first and foremost depend on the exact knowledge trends, it cannot predict future breakthroughs that of specific workload properties as well as easy-to-use might influence such hardware parameters that are software environments and software developing tools currently treated like immovable objects. An example that will support programmers to exploit the explicit is the slow DRAM access time, which is responsible and implicit regular and heterogeneous parallelism of for the memory wall and has existed for decades. In the application workloads. Esmaeilzadeh’s model it is kept constant through to the 8nm process, whereas off-chip memory bandwidth References is believed to scale linearly. Currently memory makers [1] Märtin, C.,”Multicore Processors: Challenges, Op- like Samsung and HMC are working on disruptive 3D- portunities, Emerging Trends”, Proceedings Embed- DRAM technologies that will organize DRAM chips ded World Conference 2014, 25-27 February, 2014, vertically and speedup access time and bandwidth per chip by a factor of ten, without needing new DRAM cells [18]. Such and other technology revolutions might affect all of the discussed models and could lead to more Nuremberg, Germany, Design & Elektronik, 2014 [2] Bohr, M., Mistry, K., “Intel’s revolutionary 22 nm Transistor Technology,” Intel Corporation, May 2011 [3] Grochowski, E., ”Energy per Instruction Trends in optimistic multicore performance scenarios. Read the Intel Microprocessors,” Technology@Intel Maga- full paper for a discussion of the ongoing evolution zine, March 2006 of commercial multicore processors, current research trends, and the consequences for software developers [1]. [4] Borkar, S., Chien, A.A.,”The Future of Microprocessors,” Communications of the ACM, May 2011, Vol. 54, No. 5, pp. 67-77 Conclusion As VLSI scaling puts new challenges on the agenda of chip makers, computer architects and processor designers have to find innovative solutions for future multicore processors that make the best and most efficient use of the abundant transistor budget offered by Moore’s law [5] Esmaeilzadeh, H. et al., “Power Challenges May End the Multicore Era,” Communications of the ACM, February 2013, Vol. 56, No. 2, pp. 93-102 [6] Esmaeilzadeh et al., “Executable Dark Silicon Performance Model”: http://research.cs.wisc.edu/ vertical/DarkSilicon, last access Jan. 20, 2014 [7] Dennard, R.H. et al., “Design of Ion-Implanted that will hopefully be valid for another decade. In this MOSFET’s with Very Small Physical Dimensions,” paper, we have clarified the reasons of the Post-Dennard IEEE J. Solid-State Circuits, vol. SC-9, 1974, pp. scaling regime and discussed the consequences for future multicore designs. For the four typical architectural classes: symmetric, asymmetric, dynamic, and heterogeneous multicore pro- 256-268 [8] Taylor, M.B., “A Landscape of the New Dark Silicon Design Regime,” IEEE Micro, September/October 2013, pp. 8-19 7 Forschungsbericht 2014 Informatik und Interaktive Systeme [9] Borkar, S., “3D Integration for Energy Efficient System Design,” DAC’11, June 5-10, San Diego, California, USA [10] Amdahl, G.M., “Validity of the Single Processor [14] Woo, D.H., Lee, H.H., “Extending Amdahl’s Law for Energy-Efficient Computing in the Many-Core Era,” IEEE Computer, December 2008, pp. 24-31 [15] Venkatesh, G. et al., “Conservation Cores: Reducing Approach to Achieving Large Scale Computing Ca- the Energy of Mature Computations,” Proc. 15th Ar- pabilities,” Proc. AFIPS Conf., AFIPS Press, 1967, chitectural Support for Programming Languages and pp. 483-485 [11] Getov, V., “A Few Notes on Amdahl’s Law,” IEEE Computer, December 2013, p. 45 [12] Chung, E.S., et al., “Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs?” Proc. 43rd Annual IEEE/ACM International Symposium on Microarchitecture, 2010, pp. 225-236 [13] Hill, M.D., Marty, M., “Amdahl’s Law in the Multicore Era,” IEEE Computer, July 2008, pp. 33-38 Operating Systems Conf., ACM, 2010, pp. 205-218 [16] S un, X.-H., Chen, Y., ”Reevaluating Amdahl’s law in the multicore era,” J. Parallel Distrib. Comput. (2009), doi: 10.1016/j.jpdc2009.05.002 [17] G ustafson, J.L, “Reevaluating Amdahl’s Law,” Communications of the ACM, Vol. 31, No. 5, 1988, pp. 532-533 [18] C ourtland, R., “Chipmakers Push Memory Into the Third Dimension,” http://spectrum.ieee.org, last access: January 20, 2014 Appendix A: Dark and dim silicon in symmetric multicore processor generations from 22nm to 11nm. Each BE represents a RISC core with 8 MB last-level cache. Communication infrastructure is included in the transistor count. 8 Forschungsbericht 2014 Informatik und Interaktive Systeme 9