Compiler and API for Low Power Co p e a d o ow owe High
Transcription
Compiler and API for Low Power Co p e a d o ow owe High
Co p e aand Compiler d API for o Low ow Power owe High Performance Computing on Multicore and Manycore Processors Hironori Kasahara Professor, Dept. of Computer Science & Engineering Director, Advanced Multicore Processor Research Institute Waseda University, Tokyo, Japan IEEE Computer Society Board of Governors http://www.kasahara.cs.waseda.ac.jp UIUC UPCRC Seminar, 16:00-17:00, Oct. 29 , 2009 Multi-core Everywhere Multi-core Multi core from embedded to supercomputers ¾ Consumer Electronics (Embedded) Mobile Phone, Game, Digital TV, Car Navigation, DVD, Camera, IBM/ Sony/ Toshiba Cell, Fuijtsu FR1000, NEC/ARMMPCore&MP211, Panasonic Uniphier, Renesas SH multi-core(4 core RP1, 8 core RP2) Tilera Tile64, Tile64 SPI Storm-1(16 Storm 1(16 VLIW cores) ¾ PCs, Servers OSCAR Type yp Multi-core Chip p by y Renesas in METI/NEDO Multicore for Real-time Consumer Electronics Project (Leader: Prof.Kasahara) Intel Quad Xeon, Core 2 Quad, Montvale, Nehalem(8core), 80 core, Larrabee(32core) AMD Quad Core Opteron, Phenom ¾ WSs, Deskside & Highend Servers IBM Power4,5,5+,6 Sun Niagara(SparcT1,T2), Rock ¾ Supercomputers S t Earth Simulator:40TFLOPS, 2002, 5120 vector proc. IBM Blue Gene/L: 360TFLOPS, 2005, Low power CMP d 128K processor chips, BG/Q :20PFLOPS.2011, BlueWaters: Effective ff i 1PFLOPS, 1 O S Julty2011,NCSA 2011 CSA UIUC C High quality application software, Productivity, Cost performance, Low power consumption are important Ex Mobile phones, Ex, phones Games Compiler cooperated multi-core processors are 2 promising to realize the above futures METI/NEDO National Project Multi-core Multi core for Real Real-time time Consumer Electronics <Goal> R&D of compiler cooperative multicore processor technology for consumer electronics like Mobile phones, Games, DVD, Digital TV, Car navigation systems. <Period> Period From July 2005 to March 2008 <Features> ・Good cost performance (2005.7~2008.3)** 新マルチコア 新マルチコア プロセッサ プロセッサ I/O I/O Devices •高性能 •高性能 CMP 0 ( マルチコアチップ0) PC0 CMP m CPU (プロセッ サコア0) (プロセッサ) LDM/ D-cache LPM/ IC h I-Cache (ローカルプロ グラムメモリ/ 命令キャッシュ) ((ローカルデータ メモリ/L1データ デ キャッシュ) DTC (データ 転送コン トローラ) PC 1 (プロ セッサ コア1) (マルチ コア・ チップm) PC n (プロセッサ コアn) DSM (分散共有メモリ) NI(ネットワークイン ターフェイス) (入出力装置) CSM j •低消費電力 •低消費電力 •短HW/SW開 短HW/SW開 •短HW/SW開 短 開 発期間 発期間 CSM •各チップ間でア •各チップ間でア プリケーション プリケーション 共用可 共用可 •高信頼性 •高信頼性 I/O CSP k (入出力 用 マルチコ ア・ チップ) (集中 共有 メモリ) ・Short hardware and software development periods •半導体集積度 •半導体集積度 マルチコア , , と共に性能向上 ・Low power consumption と共に性能向上 統合ECU ・Scalable S l bl performance f iimprovementt 開発マルチコアチップは情報家電へ 開発マルチコアチップは情報家電へ with the advancement of semiconductor ・Use Use of the same parallelizing compiler for multi-cores from different vendors using newly developed API IntraCCN ((チップ内結合網: 複数バス、クロスバー等 ) CSM / L2 Cache (集中共有メモリあるいはL2キャッシュ ) InterCCN (チップ間結合網: 複数バス、クロスバー、多段ネットワーク等 ) API:Application Programming Interface **Hitachi, Renesas, Fujitsu, Toshiba, Panasonic, NEC OSCAR Multi-Core Architecture CMP 0 (chip multiprocessor 0) CMP m PE 0 PE 1 CPU LPM/ I-Cache FVR LDM/ D-cache PE I/O I/O Devices Devices CSM j n DTC I/O CMP k DSM N t Network k IInterface t f CS CSM Intra-chip connection network (Multiple Buses, Crossbar, etc) CSM / L2 Cache FVR FVR FVR FVR FVR FVR FVR Inter-chip connection network (Crossbar, Buses, Multistage network, etc) CSM: central shared mem. LDM : local data mem. DSM: distributed shared mem. LPM : local program mem. DTC: Data Transfer Controller FVR: frequency / voltage control register ILRAM I$ Core#0 URAMDLRAM D$ Core#2 Core#1 SNC0 LBSC Renesas-Hitachi-Waseda 8 core RP2 Chi Chip Ph Photo and dS Specifications ifi i Core#3 Core#4 DBSC DDRPAD CSM Core#6 S SNC1 SHWY VSWC Core#7 Core#5 GCPG Process 90nm, 8-layer 90nm 8-layer, tripleTechnology Vth, CMOS Chip Size 104.8mm 104 8mm2 (10.61mm x 9.88mm) CPU Core 6.6mm2 Size (3.36mm x 1.96mm) Supply pp y 1.0V–1.4V ((internal), ), Voltage 1.8/3.3V (I/O) Power 17 ((8 CPUs, Domains 8 URAMs, common) IEEE ISSCC08: Paper No. 4.5, M.ITO, … and H. Kasahara, “A 8640 MIPS “An SS SoC C with i Independent Power-off ff Control C off 8 CPUs and 8 RAMs by an Automatic Parallelizing Compiler” 8 Core RP2 Chip Block Diagram PCR3 PCR2 PCR1 PCR0 C 0 I$ D$ LocalFPU memory CPU 16K 16K I$ D$ I:8K, D:32K Local 16K CCN I$16K D$memory I:8K, D:32K Local memory User RAM 64K 16K 16K BAR IUser I:8K, 8Kmemory D:32K D 32K64K Local RAM I:8K, D:32K64K User RAM URAM 64K Snoop con ntroller 1 LCPG0 Core #3 Core #2 FPU CPU Core #1 FPU CPU I$ D$ Core #0 CPU 16K FPU16K Barrier Sync. Lines Snoop con ntroller 0 Cluster #0 Cl Cluster #1 Core #7 Core #6 CPU FPU Core #5 CPU FPU D$ I$ Core #4 CPU 16KFPU 16K D$ I$ CPU FPU 16K 16K D$ I$ I:8K, D:32K 16K D$ 16KI$ CCN UserI:8K, RAMD:32K 64K 16K 16K BAR IRAM I:8K, 8K memory D:32K D64K 32K Local User UserI:8K, RAMD:32K 64K URAM 64K LCPG1 PCR7 PCR6 PCR5 PCR4 C On-chip system bus (SuperHyway) LCPG: Local clock pulse generator DDR2 SRAM DMA PCR: Power Control Register control control control CCN/BAR:Cache controller/Barrier Register 6 URAM: User RAM Demo of NEDO Multicore for Real Time Consumer Electronics at the Council of Science and Engineering Policy on April 10, 2008 CSTP Members Prime Minister: Mr. Y. FUKUDA Minister of State for S i Science, T Technology h l and Innovation Policy: Mr. F. KISHIDA Chief Cabinet Secretary: Mr. N. MACHIMURA Minister of Internal Affairs and Communications : Mr. H. MASUDA Mi i t off Finance Minister Fi : Mr. F. NUKAGA Minister of Education, Culture, Sports, Science and Technology: Mr. K. TOKAI Minister of Economy,Trade and Industry: Mr. A. AMARI 1987 OSCAR(Optimally Scheduled Advanced Multiprocessor) OSCAR(Optimally Scheduled Advanced Multiprocessor) HOST COMPUTER CONTROL & I/O PROCESSOR RISC Processor I/O Processor CENTRALIZED SHARED MEMORY1 (Simultaneous Readable) Bank1 Bank2 Bank3 Addr.n Addr.n Addr.n Data Memory Prog. Memory Distributed Shared Memory CSM2 CSM3 Read & Write Requests Arbitrator Bus Interface Distributed Shared Memory (Dual Port) (CP) -5MFLOPS,32bit RISC Processor (64 Registers) -2 Banks of Program Memory -Data Memory -Stack Memory -DMA Controller PE1 (CP) .. PE5 (CP) .. PE6 5PE CLUSTER (SPC1) 8PE PROCESSOR CLUSTER (LPC1) PE8 PE9 (CP) (CP) .. PE10 PE11 PE15 PE16 SPC3 SPC2 LPC2 OSCAR PE (Processor Element) SYSTEM BUS BUS INTERFACE LOCAL BUS 1 LOCAL BUS 2 DSM DMA LPM LSM LDM IPU FPU REG INSC INSTRUCTION BUS D M A : DMA CONTROLLER L P M : LOCAL PROGRAM MEMORY (128KW * 2BANK) I N S C : INSTRUCTION CONTROL UNIT : DISTRIBUTED DSM SHARED MEMORY (2KW) L S M : LOCAL STACK MEMORY (4KW) DP L D M : LOCAL DATA MEMORY (256KW) : DATA PATH DP I P U : INTEGER PROCESSING UNIT : FLOATING FPU PROCESSING UNIT R E G : REGISTER FILE (64 REGISTERS) 1987 OSCAR PE Board OSCAR Memory Space (Global and Local Address Space) SYSTEM MEMORY SPACE 00000000 LOCAL MEMORY SPACE 00000000 PE15 DSM UNDEFINED (Local Memory Area) (Distributed Shared Memory) 00100000 00000800 BROAD CAST PE 1 PE 0 NOT USE 00200000 00010000 NOT USE 01000000 CP (Control Processor) L P M(Bank0) (Local Program Memory) L P M(Bank1) 00030000 01100000 NOT USE 00020000 NOT USE 00040000 02000000 PE 0 LDM (Local Data Memory) 02100000 PE 1 00080000 NOT USE 000F0000 02200000 : : : CONTROL 00100000 02F00000 PE15 SYSTEM ACCESSING 03000000 CSM1, 2, 3 (Centralized Shared Memory) AREA 03400000 NOT USE FFFFFFFF FFFFFFFF OSCAR Parallelizing Compiler To improve effective performance performance, cost-performance cost performance and software productivity and reduce power Multigrain Parallelization coarse-grain parallelism among loops and subroutines, near fine grain parallelism among statements in addition to loop parallelism 1 Data Localization Automatic data management for distributed shared memory, cache and local memory Data Transfer Overlapping Data transfer overlapping using Data Transfer Controllers (DMAs) Power Reduction Reduction of consumed power by compiler control DVFS and Power gating with hardware supports. 6 12 14 18 24 13 17 3 2 5 4 7 10 11 9 15 16 19 25 21 22 23 20 26 dlg2 dlg1 dlg3 dlg0 28 29 Data Localization Group 27 8 31 33 32 30 Generation of coarse grain tasks Macro-tasks (MTs) ¾ Block oc of o Pseudo seudo Assignments ss g e ts ((BPA): ): Basic as c Block oc ((BB)) ¾ Repetition Block (RB) : natural loop ¾ Subroutine Block (SB): subroutine Program BPA Near fine grain parallelization RB Loop level parallelization Near fine grain of loop body C Coarse grain i parallelization SB Total System 1st. Layer grain Coarse g parallelization 2nd. Layer BPA RB SB BPA RB SB BPA RB SB BPA RB SB BPA RB SB BPA RB SB 3rd. Layer 14 Earliest Executable Condition Analysis for coarse grain tasks (Macro-tasks) (Macro tasks) Data Dependency Control flow Conditional branch 1 2 1 BPA BPA 3 4 BPA BPA BPA Block of Psuedo Assignment Statements RB Repetition Block 2 7 5 3 4 RB BPA 8 6 BPA 6 RB BPA 15 BPA 7 5 RB 6 9 RB 11 RB 8 BPA BPA 9 BPA 11 12 RB 14 15 7 Data dependency 12 Extended control dependency BPA BPA 13 END 10 Conditional branch 13 OR AND 14 RB RB 10 A Macro Flow Graph Original control flow A Macro Task Graph 15 Automatic processor assignment in su2cor • Usingg 14 p processors – Coarse grain parallelization within DO400 of subroutine LOOPS 16 MTG of Su2cor-LOOPS-DO400 Coarse grain parallelism PARA_ALD = 4.3 DOALL Sequential LOOP SB BB 17 Data-Localization Loop oop Aligned g ed Decomposition eco pos o • Decompose multiple loop (Doall and Seq) into CARs and LRs consideringg inter-loop p data dependence. p – – Most data in LR can be passed through LM. LR: Localizable Region, CAR: Commonly Accessed Region C RB1(Doall) DO I=1,101 A(I) 2*I A(I)=2*I ENDDO C RB2(Doseq) DO I=1,100 B(I)=B(I-1) +A(I)+A(I+1) ENDDO RB3(Doall) DO I=2,100 C(I)=B(I)+B(I-1) () () ( ) ENDDO C LR DO I=1,33 CAR DO I=34,35 LR DO I=36,66 CAR DO I=67,68 LR DO I=69,101 DO I=1,33 DO I=34,34 DO I=35,66 DO I=67 I=67,67 67 DO I=68,100 DO II=22,34 34 DO II=35 35,67 67 DO II=68 68,100 100 18 Data Localization 1 1 PE0 2 3 6 3 4 2 5 4 5 6 12 7 10 11 9 8 7 14 8 9 13 11 24 17 13 15 MTG 19 25 21 22 23 26 dlg3 dlg0 28 29 Data Localization Group 27 31 32 12 1 2 3 6 7 4 14 8 18 15 5 19 9 25 11 29 10 13 16 17 20 22 26 21 30 23 24 27 28 20 dlg2 dlg1 12 16 10 18 14 15 PE1 30 32 33 31 MTG after Division A schedule for two processors 19 An Example of Data Localization for Spec95 Swim DO 200 J=1,N DO 200 I=1,M UNEW(I+1,J) = UOLD(I+1,J)+ 1 TDTS8*(Z(I+1,J+1)+Z(I+1,J))*(CV(I+1,J+1)+CV(I,J+1)+CV(I,J) 2 +CV(I+1,J))-TDTSDX*(H(I+1,J)-H(I,J)) VNEW(I,J+1) = VOLD(I,J+1)-TDTS8*(Z(I+1,J+1)+Z(I,J+1)) 1 *(CU(I+1,J+1)+CU(I,J+1)+CU(I,J)+CU(I+1,J)) 2 -TDTSDY*(H(I,J+1)-H(I,J)) PNEW(I,J) = POLD(I,J)-TDTSDX*(CU(I+1,J)-CU(I,J)) 1 -TDTSDY*(CV(I,J+1)-CV(I,J)) 200 CONTINUE DO 210 J=1,N UNEW(1,J) = UNEW(M+1,J) VNEW(M+1,J+1) = VNEW(1,J+1) PNEW(M+1,J) = PNEW(1,J) 210 CONTINUE DO 300 J=1 J=1,N N DO 300 I=1,M UOLD(I,J) = U(I,J)+ALPHA*(UNEW(I,J)-2.*U(I,J)+UOLD(I,J)) VOLD(I,J) = V(I,J)+ALPHA*(VNEW(I,J)-2.*V(I,J)+VOLD(I,J)) POLD(I J) = P(I,J)+ALPHA POLD(I,J) P(I J)+ALPHA*(PNEW(I (PNEW(I,J) J)-2 2.*P(I P(I,J)+POLD(I,J)) J)+POLD(I J)) 300 CONTINUE (a) An example of target loop group for data localization cache size 0 1 VN PO H 2 PN CU 3 UO CV 4MB UN VO Z UN VN PN U VN PO V PN P UO UN VO Cache line conflicts occurs among arrays which share the same location on cache (b) Image of alignment of arrays on cache accessed by target loops 20 Data Layout for Removing Line Conflict Misses byy Arrayy Dimension Paddingg Declaration part of arrays in spec95 swim after padding before padding PARAMETER (N1=513, N2=513) PARAMETER (N1=513, N2=544) COMMON U(N1,N2), V(N1,N2), P(N1,N2), * UNEW(N1,N2), VNEW(N1,N2), 1 PNEW(N1,N2), UOLD(N1,N2), * VOLD(N1,N2), POLD(N1,N2), 2 CU(N1,N2), CV(N1,N2), * Z(N1,N2), H(N1,N2) COMMON U(N1,N2), V(N1,N2), P(N1,N2), * UNEW(N1,N2), VNEW(N1,N2), 1 PNEW(N1,N2), UOLD(N1,N2), * VOLD(N1,N2), POLD(N1,N2), 2 CU(N1,N2), CV(N1,N2), * Z(N1,N2), H(N1,N2) 4MB 4MB padding Box: Access range of DLG0 21 Power Reduction by Power Supply, Clock Frequency and Voltage Control by OSCAR Compiler • Shortest execution time mode Centralized scheduling code Generated Multigrain Parallelized Code (The nested coarse grain task parallelization is realized by only OpenMP “section”, “Flush” and “Critical” directives.) SECTIONS SECTION 1st layer Distributed scheduling code d MT1_1 MT1_2 DOALL MT1_4 MT1 4 RB MT1_3 SB T1 T2 T3 T4 T5 SYNC SEND MT1-4 1_3_2 1_3_3 1_3_4 T6 T7 MT1_1 SYNC RECV MT1_2 1_3_1 1_4_1 T0 SECTION 1_4_1 1_4_1 1_4_2 1_4_2 1_4_3 1_4_3 1_4_4 1_4_4 MT1-3 1_3_1 1_3_1 1_3_1 1_3_2 1_3_2 1_3_2 1_3_3 1_3_3 1_3_3 1_3_4 1_3_4 1_3_4 1_3_5 1_3_5 1_3_5 1 3 6 1_3_6 1 3 6 1_3_6 1 3 6 1_3_6 1_3_5 1_3_6 1_4_21_4_3 1_4_4 2nd layer END SECTIONS 2nd layer Thread group0 Thread group1 23 Compilation Flow Using OSCAR API Waseda Univ. OSCAR Parallelizing Compiler ¾ Coarse grain task parallelization ¾ Global data Localization ¾ Data transfer overlapping using DMA ¾ Power reduction control using DVFS, Clock and Power gating Hitachi, Renesas, Fujitsu, Toshiba, Panasonic, NEC Parallelized Fortran or C program with APII Proc0 Code with directives Thread 0 Proc1 Code with directives d ec ves Thread 1 Backend compiler API E i ti Existing Analyzer sequential compiler Backend compiler API Existing Analyzer sequential compiler Generation of parallel machine codes using sequential compilers p Machine codes Multicore from Vendor A Machine codes 実行 コード Multicore from Vendor B Backend compiler p OpenMP Compiler OSCAR: Optimally Scheduled Advanced Multiprocessor API: Application Program Interface Executaable on vaarious mu ulticores Application Program Fortran or Parallelizable C ( Sequential program) OSCAR API for RealReal-time Low Power High Performance Multicores Directives for thread generation, memory, data transfer using DMA, power managements Shred memory servers 24 OSCAR API Now Open! (http://www (http://www.kasahara.cs.waseda.ac.jp/) kasahara cs waseda ac jp/) • Targeting mainly realtime consumer electronics devices – embedded computing – various i kinds ki d off memory architecture hit t • SMP, local memory, distributed shared memory, ... • Developed with Japanese 6 companies – Fujitsu, Hitachi, NEC, Toshiba, Panasonic, Renesas – Supported by METI/NEDO • Based on the subset of OpenMP – very popular parallel processing API – shared memory programming model • Six Categories – – – – – – Parallel Execution (4 directives from OpenMP) Memory Mapping (Distributed Shared Memory, Local Memory) Data Transfer Overlapping Using DMA Controller Power Control (DVFS, Clock Gating, Power Gating) Timer for Real Time Control Synchronization (Hierarchical Barrier Synchronization) Kasahara-Kimura Lab., Waseda University 25 Current Application Development Environment for Multicores • Parallel API – pthread (SMP) • Old thread library – OpenMP (Shared Memory), MPI (Distributed Memory) • for Scientific Applications – Co-array Fortran (PGAS), UPC (PGAS) • Language extension – MCAPI (Distributed Memory) • for Embedded Applications • Message Passing API – NO Good API for Low-Power and Real-time Multicores! • Parallelizing Compilers for Scientific Applications (Fortran Applications) – Several aggressive compilers from academia • Polaris, SUIF, CETUS, Pluto,OSCAR, … • Source-to-source Compiler – Parallelizingg Compilers p for Low-Power and Real-time Multicores z OSCAR 26 Parallel Execution • Start of parallel execution – #pragma p g omp pp parallel sections ((C)) – !$omp parallel sections (Fortran) • Specifying critical section – #pragma omp critical (C) – !$omp !$ critical iti l (Fortran) (F t ) • Enforcing an order of the memory operations – #pragma omp flush (C) – !$omp p flush ((Fortran)) • These are from OpenMP. Kasahara-Kimura Lab., Waseda University 27 Thread Execution Model #pragma omp parallel sections { #pragma p g omp p section main_vpc0(); #pragma omp section main_vpc1(); #pragma omp section main_vpc2(); #pragma omp section main_vpc3(); } main_ vpc0() main_ vpc1() main_ vpc2() main_ vpc3() Parallel Execution ((A thread and a core are bound one-by-one) VPC: Virtual Processor Core Kasahara-Kimura Lab., Waseda University 28 Memory Mapping • Placing variables on an onchip centralized shared memory (onchipCSM) • #pragma oscar onchipshared (C) • !$oscar onchipshared (Fortran) • Placing variables on a local data memory (LDM) • #pragma omp threadprivate (C) • !$omp threadprivate (Fortran) • This Thi di directive ti is i an extension t i to t OpenMP O MP • Placing variables on a distributed shared memory (DSM) • #pragma oscar distributedshared (C) • !$oscar distributedshared (Fortran) Kasahara-Kimura Lab., Waseda University 29 Data Transfer • Specifying data transfer lists – #pragma oscar dma_transfer (C) – !$oscar !$ d dma_transfer t f (Fortran) (F t ) – Containing following parameter directives • Specifying a contiguous data transfer – #pragma oscar dma_contiguous_parameter (C) – !$oscar dma_contiguous_parameter (Fortran) • Specifying S if i a stride t id data d t ttransfer f – #pragma oscar dma_stride_parameter – !$oscar dma_stride_parameter dma stride parameter – This can be used for scatter/gather data transfer • Data transfer synchronization – #pragma # oscsar dma_flag_check d fl h k – !$oscar dma_flag_check Kasahara-Kimura Lab., Waseda University 30 Power Control • Making a module into specifying frequency and voltage state – #pragma # oscar fvcontrol f (C) – !$oscar fvcontrol (Fortran) – state examples • • • • 100: max frequency 50: half frequency 0: clock off -1: power off fvcontrol(100) fvcontrol(50) ( ) fvcontrol(-1) fvcontrol(100) ( ) • Getting a frequency and voltage state of a module – #pragma p g oscar gget_fvstatus _ ((C)) – !$oscar get_fvstatus (Fortran) Kasahara-Kimura Lab., Waseda University 31 Low-Power Optimization with OSCAR API Scheduled Result byy OSCAR Compiler p VC1 VC0 Generate Code Image by OSCAR Compiler void void main_VC0() { main_VC1() { MT2 MT2 MT1 MT1 Sl Sleep #pragma oscar fvcontrol ¥ ((OSCAR_CPU(),0)) (( (), )) Sleep #pragma p g oscar fvcontrol ¥ (1,(OSCAR_CPU(),100)) MT3 MT4 MT4 MT3 } } 32 Timer • Getting an elapsed wall clock time in microseconds – #pragma oscar get_current_time get current time (C) – !$oscar get_current_time (Fortran) • For F realtime lti execution ti Kasahara-Kimura Lab., Waseda University 33 Hierarchical Barrier Synchronization • Specifying a hierarchical group barrier – #pragma # oscar group_barrier b i (C) – !$oscar group_barrier (Fortran) 1st layer → 2nd layer → 3rd layer → PG0-0 8cores PG0(4cores) PG0-1 PG0-2 PG1(4cores) PG0-3 PG1-0(2cores) Kasahara-Kimura Lab., Waseda University PG1-1(2cores) 34 OSCAR Heterogeneous Multicore • DTU – • LPM – • Distributed Shared Memory CSM – • Local Data Memory DSM – • Local Program Memory LDM – • Data Transfer U i Unit Centralized Shared Memory FVR – Frequency/Volta ge Control Register 35 An Image of Static Schedule for Heterogeneous Multicore with Data Transfer Overlapping pp g and Power Control TIME E 36 Performance of OSCAR Compiler on IBM p6 595 Power6 (4 2GHz) based 32 (4.2GHz) 32-core core SMP Server OpenMP codes generated by OSCAR compiler accelerate IBM XL Fortran for AIX Ver.12.1 about 3.3 times on the average Compile Option: (*1) Sequential: -O3 –qarch=pwr6, XLF: -O3 –qarch=pwr6 –qsmp=auto, OSCAR: -O3 –qarch=pwr6 –qsmp=noauto (*2) Sequential: -O5 -q64 –qarch=pwr6, XLF: -O5 –q64 –qarch=pwr6 –qsmp=auto, OSCAR: -O5 –q64 –qarch=pwr6 –qsmp=noauto (Others) Sequential: -O5 –qarch=pwr6, XLF: -O5 –qarch=pwr6 –qsmp=auto, OSCAR: -O5 –qarch=pwr6 –qsmp=noauto Performance of OSCAR Compiler Using the Multicore API on Intel Quad-core Xeon Intel Ver.10.1 OSCAR 9 8 speeedup ratioo 7 6 5 4 3 2 1 SPEC95 • SPEC2000 OSCAR Compiler gives us 2.1 times speedup on the average against Intel Compiler ver.10.1 apsi applu mgrid m swim wave5 w ffpppp apsi tuurb3d applu mgrid m hyddro2d suu2cor swim tom mcatv 0 Performance of OSCAR compiler on 16 cores SGI Altix 450 Montvale server 6 sppeedup ratio 5 Compiler options for the Intel Compiler: for Automation parallelization: -fast -parallel. for OpenMP codes generated by OSCAR: -fast -openmp Intel Ver.10.1 OSCAR 4 3 2 1 spec95 • spec2000 OSC OSCAR compiler co p e gave us 2.3 .3 ttimes es speedup against Intel Fortran Itanium Compiler revision 10.1 apsi applu mgrid swim wave5 fpppp apsi turb3d applu mgrid hhydro2d su2cor swim toomcatv 0 Performance of OSCAR Compiler for Fortran Programs on a IBM p550q 8core Deskside Server 2.7 times speedup against loop parallelizing compiler on 8 cores Loop parallelization Multigrain parallelizition 9 8 speed dup ratio 7 6 5 4 3 2 1 0 t tomcatvswim t i su2corhydro2d 2 h d 2dmgrid id applu l turb3d t b3d apsii fpppp f wave55 swim i mgrid id applu l apsii spec95 spec2000 Performance for C programs on IBM p55 550Q 8.00 7 05 7.05 Sp peed-up ra atio 7.00 6.00 3.00 2.00 3.72 5.30 5.24 4.96 5 00 5.00 4.00 6.16 6.06 4.10 3.63 3.67 3.09 2.41 2 03 2.03 1 94 1.94 1 97 1.97 1 90 1.90 3.55 1 91 1.91 1.00 1core 2core 4core 8core 0 00 0.00 5.8 times speedup against g one processor on average 41 Performance of OSCAR compiler on NEC NaviEngine(ARM-NEC N iE i (ARM NEC MPcore) MP ) 4.5 g77 4 OSCAR speedup ratio 3.5 3 2.5 2 1.5 1 0.5 0 1PE 2PE mgrid 4PE 1PE 2PE su2cor S PEC9 5 • 4PE 1PE 2PE 4PE hydro2d Compile Opiion : -O3 OSCAR compiler gave us 3.43 times speedup against 1 core on 42 ARM/NEC MPCore with 4 ARM 400MHz cores Performance of OSCAR Compiler Using the multicore API on Fujitsu FR1000 Multicore 4.0 3.76 3.75 3 47 3.47 3.5 3.0 2.90 2.81 2.55 2.54 2.5 2.12 1.94 2.0 1.98 1.98 1 80 1.80 1.5 1.0 1.00 1.00 1.00 1.00 0.5 0.0 1 2 3 MPEG2enc 4 1 2 3 MPEG2dec 4 1 2 3 MP3enc 4 1 2 3 4 JPEG 2000enc 3.38 times speedup on the average for 4 cores against a single core execution Performance of OSCAR Compiler Using the Developed API on 4 core (SH4A) OSCAR T Type M Multicore lti 4.0 3.5 3.50 3.35 3.24 3.17 3.0 2.71 2.57 2.53 2.5 2 07 2.07 2.0 1.88 1.86 1.78 1.83 1.5 1.0 1.00 1.00 1.00 1.00 0.5 0.0 1 2 3 MPEG2enc 4 1 2 3 MPEG2dec 4 1 2 3 MP3enc 4 1 2 3 4 JPEG 2000enc 3.31 times speedup on the average for 4cores against 1core Processing Performance on the Developed Multicore Using Automatic Parallelizing Compiler Speedup against single core execution for audio AAC encoding *) Ad Advanced dA Audio di Coding C di 7.0 Speedup ps 60 6.0 5.0 4.0 30 3.0 58 5.8 2.0 3.6 1.0 0.0 1.9 1.0 1 2 4 Numbers of processor cores 8 Performance of OSCAR Compiler for C Programs P ffor 44core SMP on RP2 4.00 3.50 3.35 3.27 Spee Speedup d-up ratio ratioo 3.00 2 54 2.54 2 54 2.54 2.50 2.00 1.73 1 61 1.61 1.81 2core 1.40 1.50 1core 4core 1.00 0.50 0.00 179 t 179.art 183 183.equake k mpeg2encode 2 d AAC AACencode d 2.9 times speedup on average against one core 46 Power Reduction by OSCAR Parallelizing Compiler for Secure Audio Encoding AAC Encoding + AES Encryption with 8 CPU cores 7 Without Power Control (Voltage:1.4V) 7 6 6 5 5 4 4 3 3 2 2 1 1 0 With Power Control (Frequency, Resume Standby: Power shutdown & Voltage lowering 1.4V1.0V) Avg. Power 88.3% 88 3% Power Reduction Avg. Power 5.68 [W] 0.67 [W] 47 0 Power Reduction by OSCAR Parallelizing Compiler for MPEG2 Decoding MPEG2 Decoding with 8 CPU cores 7 Without Power C Control l (Voltage:1.4V) 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 W Power With owe Control Co o (Frequency, Resume Standby: Power shutdown & Voltage lowering 1.4V1 4V 1.0V) Avg. Power 73.5% 73 5% Power Reduction Avg. Power 5.73 [W] 1.52 [W] 48 Low Power High Performance Multicore Computer with Multicore Computer with Solar Panel Clean Energy Autonomous gy ¾ Servers operational in deserts ¾ Green Computing Systems R&D Center Waseda University Supported by METI (Mar. 2011 Completion) <R & D Target> Hardware, Software, Application for Super Low-Power Manycore Processors ¾64-128 cores ¾Natural air cooling(No cooling fun) Cool Compact Cool, Compact, Clear Clear, Quiet ¾Possibly operational by Solar Panel <Industry, Government, Academia> Fujitsu, Hitachi, Renesas, Toshiba, NEC, etc <Ripple Effect> Low CO2 ((Carbon Dioxide)) Emissions ¾Creation Value Added Products ¾Consumer Electronics, Automobiles, Servers ¾ Beside Subway Waseda Station, Near Waseda Univ. Main Campus 50 Conclusions ¾ OSCAR compiler cooperative real-time low power multicore with high effective performance, short software development period will be important in wide range of IT systems from consumer electronics l i to robotics b i and d automobiles. bil ¾ The OSCAR compiler with API allow us boosts up the performance of various multicores and servers and reduce consumed power significantly ¾ 3.3 times on IBM p p595 SMP server using g Power6 ¾ 2. 1 times on Intel Quad core Xeon ¾ 2.3 times on SGI Altix450 usingg Intel Itanium2 ((Montvale)) ¾ 88% power reduction by the compiler power control on the Renesas-Hitachi-Waseda 8 core (SH4A) multicore RP2 for realtime lti secure AAC encoding di ¾ 70% power reduction on the multicore for MPEG2 decoding ¾ Currently, C tl we are extending t di API ffor H Hetero t and dM Many core processors.