Compiler and API for Low Power Co p e a d o ow owe High

Transcription

Compiler and API for Low Power Co p e a d o ow owe High
Co p e aand
Compiler
d API for
o Low
ow Power
owe
High Performance Computing on
Multicore and Manycore Processors
Hironori Kasahara
Professor, Dept. of Computer Science & Engineering
Director, Advanced Multicore Processor Research Institute
Waseda University, Tokyo, Japan
IEEE Computer Society Board of Governors
http://www.kasahara.cs.waseda.ac.jp
UIUC UPCRC Seminar, 16:00-17:00, Oct. 29 , 2009
Multi-core Everywhere
Multi-core
Multi
core from embedded to supercomputers
¾ Consumer Electronics (Embedded)
Mobile Phone, Game, Digital TV, Car Navigation,
DVD, Camera,
IBM/ Sony/ Toshiba Cell, Fuijtsu FR1000,
NEC/ARMMPCore&MP211, Panasonic Uniphier,
Renesas SH multi-core(4 core RP1, 8 core RP2)
Tilera Tile64,
Tile64 SPI Storm-1(16
Storm 1(16 VLIW cores)
¾ PCs, Servers
OSCAR Type
yp Multi-core Chip
p by
y Renesas in
METI/NEDO Multicore for Real-time Consumer
Electronics Project (Leader: Prof.Kasahara)
Intel Quad Xeon, Core 2 Quad, Montvale, Nehalem(8core),
80 core, Larrabee(32core)
AMD Quad Core Opteron, Phenom
¾ WSs, Deskside & Highend Servers
IBM Power4,5,5+,6
Sun Niagara(SparcT1,T2), Rock
¾ Supercomputers
S
t
Earth Simulator:40TFLOPS, 2002, 5120 vector proc.
IBM Blue Gene/L: 360TFLOPS, 2005, Low power CMP d
128K processor chips, BG/Q :20PFLOPS.2011,
BlueWaters: Effective
ff i 1PFLOPS,
1
O S Julty2011,NCSA
2011 CSA UIUC
C
High quality application software, Productivity, Cost
performance, Low power consumption are important
Ex Mobile phones,
Ex,
phones Games
Compiler cooperated multi-core processors are
2
promising to realize the above futures
METI/NEDO National Project
Multi-core
Multi
core for Real
Real-time
time Consumer Electronics
<Goal> R&D of compiler cooperative multicore processor technology for consumer
electronics like Mobile phones, Games,
DVD, Digital TV, Car navigation systems.
<Period>
Period From July 2005 to March 2008
<Features> ・Good cost performance
(2005.7~2008.3)**
新マルチコア
新マルチコア
プロセッサ
プロセッサ
I/O
I/O
Devices
•高性能
•高性能
CMP 0 ( マルチコアチップ0)
PC0
CMP m
CPU
(プロセッ
サコア0)
(プロセッサ)
LDM/
D-cache
LPM/
IC h
I-Cache
(ローカルプロ
グラムメモリ/
命令キャッシュ)
((ローカルデータ
メモリ/L1データ
デ
キャッシュ)
DTC
(データ
転送コン
トローラ)
PC 1
(プロ
セッサ
コア1)
(マルチ
コア・
チップm)
PC n
(プロセッサ
コアn)
DSM
(分散共有メモリ)
NI(ネットワークイン ターフェイス)
(入出力装置)
CSM j
•低消費電力
•低消費電力
•短HW/SW開
短HW/SW開
•短HW/SW開
短
開
発期間
発期間
CSM
•各チップ間でア
•各チップ間でア
プリケーション
プリケーション
共用可
共用可
•高信頼性
•高信頼性
I/O
CSP k
(入出力
用
マルチコ
ア・
チップ)
(集中
共有
メモリ)
・Short hardware and software
development periods
•半導体集積度
•半導体集積度
マルチコア
,
,
と共に性能向上
・Low power consumption
と共に性能向上
統合ECU
・Scalable
S l bl performance
f
iimprovementt 開発マルチコアチップは情報家電へ
開発マルチコアチップは情報家電へ
with the advancement of semiconductor
・Use
Use of the same parallelizing compiler
for multi-cores from different vendors
using newly developed API
IntraCCN ((チップ内結合網: 複数バス、クロスバー等 )
CSM / L2 Cache
(集中共有メモリあるいはL2キャッシュ )
InterCCN (チップ間結合網: 複数バス、クロスバー、多段ネットワーク等 )
API:Application Programming Interface
**Hitachi, Renesas, Fujitsu,
Toshiba, Panasonic, NEC
OSCAR Multi-Core Architecture
CMP 0 (chip multiprocessor 0)
CMP m
PE 0 PE
1
CPU
LPM/
I-Cache
FVR
LDM/
D-cache
PE
I/O
I/O
Devices
Devices
CSM j
n
DTC
I/O
CMP k
DSM
N t
Network
k IInterface
t f
CS
CSM
Intra-chip connection network
(Multiple Buses, Crossbar, etc)
CSM / L2 Cache
FVR
FVR
FVR
FVR
FVR
FVR
FVR
Inter-chip connection network (Crossbar, Buses, Multistage network, etc)
CSM: central shared mem.
LDM : local data mem.
DSM: distributed shared mem.
LPM : local program mem.
DTC: Data Transfer Controller
FVR: frequency / voltage control register
ILRAM
I$
Core#0
URAMDLRAM
D$
Core#2
Core#1
SNC0
LBSC
Renesas-Hitachi-Waseda 8 core
RP2 Chi
Chip Ph
Photo and
dS
Specifications
ifi i
Core#3
Core#4
DBSC
DDRPAD
CSM
Core#6
S
SNC1
SHWY
VSWC
Core#7
Core#5
GCPG
Process
90nm, 8-layer
90nm
8-layer, tripleTechnology Vth, CMOS
Chip Size 104.8mm
104 8mm2
(10.61mm x 9.88mm)
CPU Core 6.6mm2
Size
(3.36mm x 1.96mm)
Supply
pp y
1.0V–1.4V ((internal),
),
Voltage
1.8/3.3V (I/O)
Power
17 ((8 CPUs,
Domains
8 URAMs, common)
IEEE ISSCC08: Paper No. 4.5, M.ITO, … and H. Kasahara,
“A 8640 MIPS
“An
SS
SoC
C with
i Independent Power-off
ff Control
C
off 8
CPUs and 8 RAMs by an Automatic Parallelizing Compiler”
8 Core RP2 Chip Block Diagram
PCR3
PCR2
PCR1
PCR0
C 0
I$
D$
LocalFPU
memory
CPU
16K
16K
I$
D$
I:8K,
D:32K
Local
16K
CCN
I$16K
D$memory
I:8K,
D:32K
Local
memory
User
RAM
64K
16K
16K
BAR
IUser
I:8K,
8Kmemory
D:32K
D 32K64K
Local
RAM
I:8K,
D:32K64K
User RAM
URAM 64K
Snoop con
ntroller 1
LCPG0
Core #3
Core
#2 FPU
CPU
Core
#1 FPU
CPU
I$
D$
Core
#0
CPU
16K FPU16K
Barrier
Sync. Lines
Snoop con
ntroller 0
Cluster #0
Cl
Cluster
#1
Core #7
Core #6
CPU
FPU
Core #5
CPU
FPU
D$
I$
Core #4
CPU
16KFPU 16K
D$
I$
CPU
FPU 16K
16K
D$
I$
I:8K,
D:32K
16K D$ 16KI$
CCN
UserI:8K,
RAMD:32K
64K 16K
16K
BAR
IRAM
I:8K,
8K memory
D:32K
D64K
32K
Local
User
UserI:8K,
RAMD:32K
64K
URAM 64K
LCPG1
PCR7
PCR6
PCR5
PCR4
C
On-chip system bus (SuperHyway)
LCPG: Local clock pulse generator
DDR2 SRAM DMA PCR: Power Control Register
control control control
CCN/BAR:Cache controller/Barrier Register
6
URAM: User RAM
Demo of NEDO Multicore for Real Time Consumer Electronics
at the Council of Science and Engineering Policy on April 10, 2008
CSTP Members
Prime Minister:
Mr. Y. FUKUDA
Minister of State for
S i
Science,
T
Technology
h l
and Innovation
Policy:
Mr. F. KISHIDA
Chief Cabinet
Secretary:
Mr. N. MACHIMURA
Minister of Internal
Affairs and
Communications :
Mr. H. MASUDA
Mi i t off Finance
Minister
Fi
:
Mr. F. NUKAGA
Minister of
Education, Culture,
Sports, Science and
Technology:
Mr. K. TOKAI
Minister of
Economy,Trade and
Industry:
Mr. A. AMARI
1987 OSCAR(Optimally Scheduled Advanced Multiprocessor)
OSCAR(Optimally Scheduled Advanced Multiprocessor)
HOST COMPUTER
CONTROL & I/O PROCESSOR
RISC Processor
I/O Processor
CENTRALIZED SHARED MEMORY1
(Simultaneous Readable)
Bank1
Bank2 Bank3
Addr.n Addr.n Addr.n
Data
Memory
Prog.
Memory
Distributed
Shared
Memory
CSM2
CSM3
Read & Write Requests
Arbitrator
Bus Interface
Distributed
Shared Memory
(Dual Port)
(CP)
-5MFLOPS,32bit RISC
Processor
(64 Registers)
-2 Banks of Program
Memory
-Data Memory
-Stack Memory
-DMA Controller
PE1
(CP)
..
PE5
(CP)
..
PE6
5PE CLUSTER (SPC1)
8PE PROCESSOR CLUSTER (LPC1)
PE8
PE9
(CP)
(CP)
..
PE10 PE11
PE15 PE16
SPC3
SPC2
LPC2
OSCAR PE (Processor Element)
SYSTEM BUS
BUS INTERFACE
LOCAL BUS 1
LOCAL BUS 2
DSM
DMA
LPM
LSM
LDM
IPU
FPU
REG
INSC
INSTRUCTION BUS
D M A : DMA CONTROLLER
L P M : LOCAL PROGRAM MEMORY
(128KW * 2BANK)
I N S C : INSTRUCTION
CONTROL UNIT
:
DISTRIBUTED
DSM
SHARED MEMORY (2KW)
L S M : LOCAL
STACK MEMORY (4KW)
DP
L D M : LOCAL DATA MEMORY
(256KW)
:
DATA
PATH
DP
I P U : INTEGER
PROCESSING UNIT
:
FLOATING
FPU
PROCESSING UNIT
R E G : REGISTER FILE
(64 REGISTERS)
1987 OSCAR PE Board
OSCAR Memory Space (Global and Local Address Space)
SYSTEM MEMORY SPACE
00000000
LOCAL MEMORY SPACE
00000000
PE15
DSM
UNDEFINED
(Local Memory Area)
(Distributed
Shared Memory)
00100000
00000800
BROAD CAST
PE 1
PE 0
NOT USE
00200000
00010000
NOT USE
01000000
CP
(Control Processor)
L P M(Bank0)
(Local
Program Memory)
L P M(Bank1)
00030000
01100000
NOT USE
00020000
NOT USE
00040000
02000000
PE 0
LDM
(Local
Data Memory)
02100000
PE 1
00080000
NOT USE
000F0000
02200000
:
:
:
CONTROL
00100000
02F00000
PE15
SYSTEM
ACCESSING
03000000
CSM1, 2, 3
(Centralized
Shared Memory)
AREA
03400000
NOT USE
FFFFFFFF
FFFFFFFF
OSCAR Parallelizing Compiler
To improve effective performance
performance, cost-performance
cost performance
and software productivity and reduce power
Multigrain Parallelization
coarse-grain parallelism among loops
and subroutines, near fine grain
parallelism among statements in
addition to loop parallelism
1
Data Localization
Automatic data management for
distributed shared memory, cache
and local memory
Data Transfer Overlapping
Data transfer overlapping using Data
Transfer Controllers (DMAs)
Power Reduction
Reduction of consumed power by
compiler control DVFS and Power
gating with hardware supports.
6
12
14
18
24
13
17
3
2
5
4
7
10
11
9
15
16
19
25
21
22
23
20
26
dlg2
dlg1
dlg3
dlg0
28
29
Data Localization Group
27
8
31
33
32
30
Generation of coarse grain tasks
Macro-tasks (MTs)
„
¾ Block
oc of
o Pseudo
seudo Assignments
ss g e ts ((BPA):
): Basic
as c Block
oc ((BB))
¾ Repetition Block (RB) : natural loop
¾ Subroutine Block (SB): subroutine
Program
BPA
Near fine grain parallelization
RB
Loop level parallelization
Near fine grain of loop body
C
Coarse
grain
i
parallelization
SB
Total
System
1st. Layer
grain
Coarse g
parallelization
2nd. Layer
BPA
RB
SB
BPA
RB
SB
BPA
RB
SB
BPA
RB
SB
BPA
RB
SB
BPA
RB
SB
3rd. Layer
14
Earliest Executable Condition Analysis for
coarse grain tasks (Macro-tasks)
(Macro tasks)
Data Dependency
Control flow
Conditional branch
1
2
1
BPA
BPA
3
4
BPA
BPA
BPA
Block of Psuedo
Assignment Statements
RB
Repetition Block
2
7
5
3
4
RB
BPA
8
6
BPA
6
RB
BPA
15
BPA
7
5
RB
6
9
RB
11
RB
8
BPA
BPA
9
BPA
11
12
RB
14
15
7
Data dependency
12
Extended control dependency
BPA
BPA
13
END
10
Conditional branch
13
OR
AND
14
RB
RB
10
A Macro Flow Graph
Original control flow
A Macro Task Graph
15
Automatic processor assignment in su2cor
• Usingg 14 p
processors
–
Coarse grain parallelization within DO400 of subroutine
LOOPS
16
MTG of Su2cor-LOOPS-DO400
„
Coarse grain parallelism PARA_ALD = 4.3
DOALL
Sequential LOOP
SB
BB
17
Data-Localization
Loop
oop Aligned
g ed Decomposition
eco pos o
•
Decompose multiple loop (Doall and Seq) into CARs and
LRs consideringg inter-loop
p data dependence.
p
–
–
Most data in LR can be passed through LM.
LR: Localizable Region, CAR: Commonly Accessed Region
C RB1(Doall)
DO I=1,101
A(I) 2*I
A(I)=2*I
ENDDO
C RB2(Doseq)
DO I=1,100
B(I)=B(I-1)
+A(I)+A(I+1)
ENDDO
RB3(Doall)
DO I=2,100
C(I)=B(I)+B(I-1)
() () ( )
ENDDO
C
LR
DO I=1,33
CAR
DO I=34,35
LR
DO I=36,66
CAR
DO I=67,68
LR
DO I=69,101
DO I=1,33
DO I=34,34
DO I=35,66
DO I=67
I=67,67
67
DO I=68,100
DO II=22,34
34
DO II=35
35,67
67
DO II=68
68,100
100
18
Data Localization
1
1
PE0
2
3
6
3
4
2
5
4
5
6
12
7
10
11
9
8
7
14
8
9
13
11
24
17
13
15
MTG
19
25
21
22
23
26
dlg3
dlg0
28
29
Data Localization Group
27
31
32
12
1
2
3
6
7
4
14
8
18
15
5
19
9
25
11
29
10
13
16
17
20
22
26
21
30
23
24
27
28
20
dlg2
dlg1
12
16
10
18
14
15
PE1
30
32
33
31
MTG after Division
A schedule for two processors
19
An Example of Data Localization for Spec95 Swim
DO 200 J=1,N
DO 200 I=1,M
UNEW(I+1,J) = UOLD(I+1,J)+
1 TDTS8*(Z(I+1,J+1)+Z(I+1,J))*(CV(I+1,J+1)+CV(I,J+1)+CV(I,J)
2
+CV(I+1,J))-TDTSDX*(H(I+1,J)-H(I,J))
VNEW(I,J+1) = VOLD(I,J+1)-TDTS8*(Z(I+1,J+1)+Z(I,J+1))
1
*(CU(I+1,J+1)+CU(I,J+1)+CU(I,J)+CU(I+1,J))
2
-TDTSDY*(H(I,J+1)-H(I,J))
PNEW(I,J) = POLD(I,J)-TDTSDX*(CU(I+1,J)-CU(I,J))
1
-TDTSDY*(CV(I,J+1)-CV(I,J))
200 CONTINUE
DO 210 J=1,N
UNEW(1,J) = UNEW(M+1,J)
VNEW(M+1,J+1) = VNEW(1,J+1)
PNEW(M+1,J) = PNEW(1,J)
210 CONTINUE
DO 300 J=1
J=1,N
N
DO 300 I=1,M
UOLD(I,J) = U(I,J)+ALPHA*(UNEW(I,J)-2.*U(I,J)+UOLD(I,J))
VOLD(I,J) = V(I,J)+ALPHA*(VNEW(I,J)-2.*V(I,J)+VOLD(I,J))
POLD(I J) = P(I,J)+ALPHA
POLD(I,J)
P(I J)+ALPHA*(PNEW(I
(PNEW(I,J)
J)-2
2.*P(I
P(I,J)+POLD(I,J))
J)+POLD(I J))
300 CONTINUE
(a) An example of target loop group for data localization
cache size
0
1
VN
PO
H
2
PN
CU
3
UO
CV
4MB
UN
VO
Z
UN
VN
PN
U
VN
PO
V
PN
P
UO
UN
VO
Cache line conflicts occurs
among arrays which share the
same location on cache
(b) Image of alignment of arrays on
cache accessed by target loops
20
Data Layout for Removing Line Conflict Misses
byy Arrayy Dimension Paddingg
Declaration part of arrays in spec95 swim
after padding
before padding
PARAMETER (N1=513, N2=513)
PARAMETER (N1=513, N2=544)
COMMON U(N1,N2), V(N1,N2), P(N1,N2),
*
UNEW(N1,N2), VNEW(N1,N2),
1
PNEW(N1,N2), UOLD(N1,N2),
*
VOLD(N1,N2), POLD(N1,N2),
2
CU(N1,N2), CV(N1,N2),
*
Z(N1,N2), H(N1,N2)
COMMON U(N1,N2), V(N1,N2), P(N1,N2),
*
UNEW(N1,N2), VNEW(N1,N2),
1
PNEW(N1,N2), UOLD(N1,N2),
*
VOLD(N1,N2), POLD(N1,N2),
2
CU(N1,N2), CV(N1,N2),
*
Z(N1,N2), H(N1,N2)
4MB
4MB
padding
Box: Access range of DLG0
21
Power Reduction by Power Supply, Clock Frequency
and Voltage Control by OSCAR Compiler
•
Shortest execution time mode
Centralized
scheduling
code
Generated Multigrain Parallelized Code
(The nested coarse grain task parallelization is realized by only
OpenMP “section”, “Flush” and “Critical” directives.)
SECTIONS
SECTION
1st layer
Distributed
scheduling
code
d
MT1_1
MT1_2
DOALL
MT1_4
MT1
4
RB
MT1_3
SB
T1 T2
T3
T4 T5
SYNC SEND
MT1-4
1_3_2 1_3_3 1_3_4
T6
T7
MT1_1
SYNC RECV
MT1_2
1_3_1
1_4_1
T0
SECTION
1_4_1
1_4_1
1_4_2
1_4_2
1_4_3
1_4_3
1_4_4
1_4_4
MT1-3
1_3_1
1_3_1
1_3_1
1_3_2
1_3_2
1_3_2
1_3_3
1_3_3
1_3_3
1_3_4
1_3_4
1_3_4
1_3_5
1_3_5
1_3_5
1 3 6
1_3_6
1 3 6
1_3_6
1 3 6
1_3_6
1_3_5 1_3_6
1_4_21_4_3 1_4_4
2nd layer
END SECTIONS
2nd layer
Thread
group0
Thread
group1
23
Compilation Flow Using OSCAR API
Waseda Univ.
OSCAR
Parallelizing Compiler
¾ Coarse grain task
parallelization
¾ Global data
Localization
¾ Data transfer
overlapping using
DMA
¾ Power reduction
control using DVFS,
Clock and Power
gating
Hitachi, Renesas,
Fujitsu, Toshiba,
Panasonic, NEC
Parallelized
Fortran or C
program with
APII
Proc0
Code with
directives
Thread 0
Proc1
Code with
directives
d
ec ves
Thread 1
Backend compiler
API
E i ti
Existing
Analyzer sequential
compiler
Backend compiler
API
Existing
Analyzer sequential
compiler
Generation of
parallel machine
codes using
sequential
compilers
p
Machine
codes
Multicore from
Vendor A
Machine
codes
実行
コード
Multicore from
Vendor B
Backend compiler
p
OpenMP
Compiler
OSCAR: Optimally Scheduled Advanced Multiprocessor
API: Application Program Interface
Executaable on vaarious mu
ulticores
Application Program
Fortran or Parallelizable C
( Sequential program)
OSCAR API for RealReal-time Low Power High
Performance Multicores
Directives for thread generation, memory,
data transfer using DMA, power
managements
Shred memory
servers
24
OSCAR API
Now Open! (http://www
(http://www.kasahara.cs.waseda.ac.jp/)
kasahara cs waseda ac jp/)
• Targeting mainly realtime consumer electronics devices
– embedded computing
– various
i
kinds
ki d off memory architecture
hit t
• SMP, local memory, distributed shared memory, ...
• Developed with Japanese 6 companies
– Fujitsu, Hitachi, NEC, Toshiba, Panasonic, Renesas
– Supported by METI/NEDO
• Based on the subset of OpenMP
– very popular parallel processing API
– shared memory programming model
• Six Categories
–
–
–
–
–
–
Parallel Execution (4 directives from OpenMP)
Memory Mapping (Distributed Shared Memory, Local Memory)
Data Transfer Overlapping Using DMA Controller
Power Control (DVFS, Clock Gating, Power Gating)
Timer for Real Time Control
Synchronization (Hierarchical Barrier Synchronization)
Kasahara-Kimura Lab., Waseda University
25
Current Application Development
Environment for Multicores
• Parallel API
– pthread (SMP)
• Old thread library
– OpenMP (Shared Memory), MPI (Distributed Memory)
• for Scientific Applications
– Co-array Fortran (PGAS), UPC (PGAS)
• Language extension
– MCAPI (Distributed Memory)
• for Embedded Applications
• Message Passing API
– NO Good API for Low-Power and Real-time Multicores!
• Parallelizing Compilers for Scientific Applications (Fortran
Applications)
– Several aggressive compilers from academia
• Polaris, SUIF, CETUS, Pluto,OSCAR, …
• Source-to-source Compiler
– Parallelizingg Compilers
p
for Low-Power and Real-time Multicores
z OSCAR
26
Parallel Execution
• Start of parallel execution
– #pragma
p g
omp
pp
parallel sections ((C))
– !$omp parallel sections (Fortran)
• Specifying critical section
– #pragma omp critical (C)
– !$omp
!$
critical
iti l (Fortran)
(F t
)
• Enforcing an order of the memory operations
– #pragma omp flush (C)
– !$omp
p flush ((Fortran))
• These are from OpenMP.
Kasahara-Kimura Lab., Waseda University
27
Thread Execution Model
#pragma omp parallel sections
{
#pragma
p g
omp
p section
main_vpc0();
#pragma omp section
main_vpc1();
#pragma omp section
main_vpc2();
#pragma omp section
main_vpc3();
}
main_
vpc0()
main_
vpc1()
main_
vpc2()
main_
vpc3()
Parallel Execution
((A thread and a core are bound
one-by-one)
VPC: Virtual Processor Core
Kasahara-Kimura Lab., Waseda University
28
Memory Mapping
• Placing variables on an onchip centralized
shared memory (onchipCSM)
• #pragma oscar onchipshared (C)
• !$oscar onchipshared (Fortran)
• Placing variables on a local data memory
(LDM)
• #pragma omp threadprivate (C)
• !$omp threadprivate (Fortran)
• This
Thi di
directive
ti is
i an extension
t i to
t OpenMP
O
MP
• Placing variables on a distributed shared
memory (DSM)
• #pragma oscar distributedshared (C)
• !$oscar distributedshared (Fortran)
Kasahara-Kimura Lab., Waseda University
29
Data Transfer
• Specifying data transfer lists
– #pragma oscar dma_transfer (C)
– !$oscar
!$
d
dma_transfer
t
f (Fortran)
(F t
)
– Containing following parameter directives
• Specifying a contiguous data transfer
– #pragma oscar dma_contiguous_parameter (C)
– !$oscar dma_contiguous_parameter (Fortran)
• Specifying
S if i a stride
t id data
d t ttransfer
f
– #pragma oscar dma_stride_parameter
– !$oscar dma_stride_parameter
dma stride parameter
– This can be used for scatter/gather data transfer
• Data transfer synchronization
– #pragma
#
oscsar dma_flag_check
d
fl
h k
– !$oscar dma_flag_check
Kasahara-Kimura Lab., Waseda University
30
Power Control
• Making a module into specifying frequency and
voltage state
– #pragma
#
oscar fvcontrol
f
(C)
– !$oscar fvcontrol (Fortran)
– state examples
•
•
•
•
100: max frequency
50: half frequency
0: clock off
-1: power off
fvcontrol(100)
fvcontrol(50)
( )
fvcontrol(-1)
fvcontrol(100)
(
)
• Getting a frequency and voltage state of a
module
– #pragma
p g
oscar gget_fvstatus
_
((C))
– !$oscar get_fvstatus (Fortran)
Kasahara-Kimura Lab., Waseda University
31
Low-Power Optimization with OSCAR API
Scheduled Result
byy OSCAR Compiler
p
VC1
VC0
Generate Code Image by OSCAR Compiler
void
void
main_VC0() {
main_VC1() {
MT2
MT2
MT1
MT1
Sl
Sleep
#pragma oscar fvcontrol ¥
((OSCAR_CPU(),0))
((
(), ))
Sleep
#pragma
p g
oscar fvcontrol ¥
(1,(OSCAR_CPU(),100))
MT3
MT4
MT4
MT3
}
}
32
Timer
• Getting an elapsed wall clock time in
microseconds
– #pragma oscar get_current_time
get current time (C)
– !$oscar get_current_time (Fortran)
• For
F realtime
lti
execution
ti
Kasahara-Kimura Lab., Waseda University
33
Hierarchical Barrier
Synchronization
• Specifying a hierarchical group barrier
– #pragma
#
oscar group_barrier
b i (C)
– !$oscar group_barrier (Fortran)
1st layer →
2nd layer →
3rd layer → PG0-0
8cores
PG0(4cores)
PG0-1
PG0-2
PG1(4cores)
PG0-3
PG1-0(2cores)
Kasahara-Kimura Lab., Waseda University
PG1-1(2cores)
34
OSCAR Heterogeneous Multicore
•
DTU
–
•
LPM
–
•
Distributed
Shared Memory
CSM
–
•
Local Data
Memory
DSM
–
•
Local Program
Memory
LDM
–
•
Data Transfer
U i
Unit
Centralized
Shared Memory
FVR
–
Frequency/Volta
ge Control
Register
35
An Image of Static Schedule for Heterogeneous Multicore with Data Transfer Overlapping
pp g and Power Control
TIME
E
36
Performance of OSCAR Compiler on IBM p6 595 Power6
(4 2GHz) based 32
(4.2GHz)
32-core
core SMP Server
OpenMP codes generated by OSCAR compiler accelerate IBM XL
Fortran for AIX Ver.12.1 about 3.3 times on the average
Compile Option:
(*1) Sequential: -O3 –qarch=pwr6, XLF: -O3 –qarch=pwr6 –qsmp=auto, OSCAR: -O3 –qarch=pwr6 –qsmp=noauto
(*2) Sequential: -O5 -q64 –qarch=pwr6, XLF: -O5 –q64 –qarch=pwr6 –qsmp=auto, OSCAR: -O5 –q64 –qarch=pwr6 –qsmp=noauto
(Others) Sequential: -O5 –qarch=pwr6, XLF: -O5 –qarch=pwr6 –qsmp=auto, OSCAR: -O5 –qarch=pwr6 –qsmp=noauto
Performance of OSCAR Compiler Using the
Multicore API on Intel Quad-core Xeon
Intel Ver.10.1
OSCAR
9
8
speeedup ratioo
7
6
5
4
3
2
1
SPEC95
•
SPEC2000
OSCAR Compiler gives us 2.1 times speedup on the average
against Intel Compiler ver.10.1
apsi
applu
mgrid
m
swim
wave5
w
ffpppp
apsi
tuurb3d
applu
mgrid
m
hyddro2d
suu2cor
swim
tom
mcatv
0
Performance of OSCAR compiler on
16 cores SGI Altix 450 Montvale server
6
sppeedup ratio
5
Compiler options for the Intel Compiler:
for Automation parallelization: -fast -parallel.
for OpenMP codes generated by OSCAR: -fast -openmp
Intel Ver.10.1
OSCAR
4
3
2
1
spec95
•
spec2000
OSC
OSCAR
compiler
co p e gave us 2.3
.3 ttimes
es speedup
against Intel Fortran Itanium Compiler revision 10.1
apsi
applu
mgrid
swim
wave5
fpppp
apsi
turb3d
applu
mgrid
hhydro2d
su2cor
swim
toomcatv
0
Performance of OSCAR Compiler for Fortran
Programs on a IBM p550q 8core Deskside Server
„
2.7 times speedup against loop
parallelizing compiler on 8 cores
Loop parallelization
Multigrain parallelizition
9
8
speed
dup ratio
7
6
5
4
3
2
1
0
t
tomcatvswim
t i su2corhydro2d
2 h d 2dmgrid
id applu
l turb3d
t b3d apsii fpppp
f
wave55 swim
i mgrid
id applu
l apsii
spec95
spec2000
Performance for C programs
on IBM p55 550Q
8.00
7 05
7.05
Sp
peed-up ra
atio
7.00
6.00
3.00
2.00
3.72
5.30
5.24
4.96
5 00
5.00
4.00
6.16
6.06
4.10
3.63
3.67
3.09
2.41
2 03
2.03
1 94
1.94
1 97
1.97
1 90
1.90
3.55
1 91
1.91
1.00
1core
2core
4core
8core
0 00
0.00
5.8 times speedup against
g
one
processor on average
41
Performance of OSCAR compiler on
NEC NaviEngine(ARM-NEC
N iE i (ARM NEC MPcore)
MP
)
4.5
g77
4
OSCAR
speedup ratio
3.5
3
2.5
2
1.5
1
0.5
0
1PE
2PE
mgrid
4PE
1PE
2PE
su2cor
S PEC9 5
•
4PE
1PE
2PE
4PE
hydro2d
Compile Opiion : -O3
OSCAR compiler gave us 3.43 times speedup against 1 core on
42
ARM/NEC MPCore with 4 ARM 400MHz cores
Performance of OSCAR Compiler Using the multicore
API on Fujitsu FR1000 Multicore
4.0
3.76
3.75
3 47
3.47
3.5
3.0
2.90
2.81
2.55
2.54
2.5
2.12
1.94
2.0
1.98
1.98
1 80
1.80
1.5
1.0
1.00
1.00
1.00
1.00
0.5
0.0
1
2
3
MPEG2enc
4
1
2
3
MPEG2dec
4
1
2
3
MP3enc
4
1
2
3
4
JPEG 2000enc
3.38 times speedup on the average for 4 cores against
a single core execution
Performance of OSCAR Compiler Using the Developed
API on 4 core (SH4A) OSCAR T
Type M
Multicore
lti
4.0
3.5
3.50
3.35
3.24
3.17
3.0
2.71
2.57
2.53
2.5
2 07
2.07
2.0
1.88
1.86
1.78
1.83
1.5
1.0
1.00
1.00
1.00
1.00
0.5
0.0
1
2
3
MPEG2enc
4
1
2
3
MPEG2dec
4
1
2
3
MP3enc
4
1
2
3
4
JPEG 2000enc
3.31 times speedup on the average for 4cores against 1core
Processing Performance on the Developed
Multicore Using Automatic Parallelizing Compiler
Speedup against single core execution for audio AAC encoding
*) Ad
Advanced
dA
Audio
di Coding
C di
7.0
Speedup
ps
60
6.0
5.0
4.0
30
3.0
58
5.8
2.0
3.6
1.0
0.0
1.9
1.0
1
2
4
Numbers of processor cores
8
Performance of OSCAR Compiler for
C Programs
P
ffor 44core SMP on RP2
4.00
3.50
3.35
3.27
Spee
Speedup
d-up ratio
ratioo
3.00
2 54
2.54
2 54
2.54
2.50
2.00
1.73
1 61
1.61
1.81
2core
1.40
1.50
1core
4core
1.00
0.50
0.00
179 t
179.art
183
183.equake
k
mpeg2encode
2
d
AAC
AACencode
d
2.9 times speedup on average against one core
46
Power Reduction by OSCAR Parallelizing Compiler
for Secure Audio Encoding
AAC Encoding + AES Encryption with 8 CPU cores
7
Without Power
Control
(Voltage:1.4V)
7
6
6
5
5
4
4
3
3
2
2
1
1
0
With Power Control
(Frequency,
Resume Standby:
Power shutdown &
Voltage lowering 1.4V1.0V)
Avg. Power 88.3%
88 3% Power Reduction Avg. Power
5.68 [W]
0.67 [W] 47
0
Power Reduction by OSCAR Parallelizing Compiler
for MPEG2 Decoding
MPEG2 Decoding with 8 CPU cores
7
Without Power
C
Control
l
(Voltage:1.4V)
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
W Power
With
owe Control
Co o
(Frequency,
Resume Standby:
Power shutdown &
Voltage lowering 1.4V1 4V
1.0V)
Avg. Power 73.5%
73 5% Power Reduction Avg. Power
5.73 [W]
1.52 [W] 48
Low Power High Performance Multicore Computer with
Multicore Computer with Solar Panel Clean Energy Autonomous
gy
¾ Servers operational in deserts
¾
Green Computing Systems R&D Center
Waseda University
Supported by METI (Mar. 2011 Completion)
<R & D Target>
Hardware, Software, Application for
Super Low-Power Manycore Processors
¾64-128 cores
¾Natural air cooling(No cooling fun)
Cool Compact
Cool,
Compact, Clear
Clear, Quiet
¾Possibly operational by Solar Panel
<Industry, Government, Academia>
Fujitsu, Hitachi, Renesas, Toshiba, NEC, etc
<Ripple Effect>
Low CO2 ((Carbon Dioxide)) Emissions
¾Creation Value Added Products
¾Consumer Electronics, Automobiles, Servers
¾
Beside Subway Waseda Station,
Near Waseda Univ. Main Campus
50
Conclusions
¾ OSCAR compiler cooperative real-time low power multicore with
high effective performance, short software development period
will be important in wide range of IT systems from consumer
electronics
l
i to robotics
b i and
d automobiles.
bil
¾ The OSCAR compiler with API allow us boosts up the
performance of various multicores and servers and reduce
consumed power significantly
¾ 3.3 times on IBM p
p595 SMP server using
g Power6
¾ 2. 1 times on Intel Quad core Xeon
¾ 2.3 times on SGI Altix450 usingg Intel Itanium2 ((Montvale))
¾ 88% power reduction by the compiler power control on the
Renesas-Hitachi-Waseda 8 core (SH4A) multicore RP2 for
realtime
lti
secure AAC encoding
di
¾ 70% power reduction on the multicore for MPEG2 decoding
¾ Currently,
C
tl we are extending
t di API ffor H
Hetero
t
and
dM
Many
core processors.