Oakforest-PACS (OFP)

Transcription

Oakforest-PACS (OFP)
Oakforest-PACS (OFP) :
Japan’s #1 System driven by KNL and OPA
Taisuke Boku
Deputy Director, Center for Computational Sciences
University of Tsukuba
(with courtesy of JCAHPC members)
1
2017/04/11
IXPUG Spring Conference 2017,
Cambridge
Center for Computational Sciences, Univ. of Tsukuba
Towards Exascale Computing
Future
Exascale
PF
1000
Post K Computer
Tier-1 and tier-2 supercomputers
form HPCI and move forward to
Exascale computing like two wheels
100
Riken AICS
OFP
10
JCAHPC(U. Tsukuba and
U. Tokyo)
Tokyo Tech.
1 TSUBAME2.0
T2K
U. of Tsukuba
U. of Tokyo
Kyoto U.
2008
2
2010
2012
2014
2016
2018
2020
IXPUG Spring Conference 2017, Cambridge
2017/04/11
Center for Computational Sciences, Univ. of Tsukuba
Deployment plan of 9 supercomputing center (Feb. 2017)
Power consumption indicates maximum of power supply (includes cooling facility)
Fiscal Year
Hokkaido
2014
2015
2016
2017
2018
HITACHI SR16000/M1(172TF, 22TB)
Cloud System BS2000 (44TF, 14TB)
Tsukuba
2020
2021
2022
3.2 PF (UCC + CFL/M) 0.96MW
0.3 PF (Cloud) 0.36MW
Data Science Cloud / Storage HA8000 / WOS7000
(10TF, 1.96PB)
Tohoku
2019
NEC SX~30PF, ~30PB/s
SX-ACE(707TF,160TB, 655TB/s)
9他
LX406e(31TF), Storage(4PB), 3D Vis, 2MW
(60TF)
HA-PACS (1166 TF)
PACS-X 10PF (TPF) 2MW
2023
2024
2025
30 PF (UCC +
CFL-M) 2MW
Mem BW (CFL-D/CFL-M)
~3MW
COMA (PACS-IX) (1001 TF)
Oakforest-PACS 25 PF (UCC + TPF) 4.2 MW
Tokyo
Fujitsu FX10
(1PFlops, 150TiB, 408 TB/s),
Reedbush 1.8〜1.9 PF 0.7 MW
50+ PF (FAC) 3.5MW
Hitachi SR16K/M1 (54.9 TF, 10.9 TiB, 28.7 TB/s)
Tokyo Tech.
Nagoya
TSUBAME 2.5 (5.7 PF,
110+ TB, 1160 TB/s),
1.4MW
TSUBAME 2.5 (3~4 PF, extended)
TSUBAME3.012PF,UCC/TPF2.0MW
Fujitsu FX100 (2.9PF, 81 TiB)
Fujitsu CX400 (774TF, 71TiB)
SGI UV2000 (24TF, 20TiB)
2MW in total
FX10(90TF)
CX400(470T
F)
Cray: XE6 + GB8K +
XC30 (983TF)
Kyoto
HA8000 (712TF, 242 TB)
SR16000 (8.2TF, 6TB)
FX10 (272.4TF, 36 TB)2.0MW
CX400 (966.2 TF, 183TB)
15-20 PF (UCC/TPF) 2.6MW
FX10
(90.8TFLOPS)
K Computer (10PF, 13MW)
3
2017/04/11
0.7-1PF (UCC)
????
IXPUG Spring Conference 2017,
Cambridge
100+ PF
(FAC/UCC+CFLM)up to 4MW
50-100+ PF
(FAC/TPF + UCC) 1.8-2.4 MW
3.2PB/s, 5-10Pflop/s, 1.0-1.5MW (CFL-M)
NEC SX-ACE NEC Express5800
(423TF) (22.4TF)
Kyushu
TSUBAME 4.0 (100+ PF,
>10PB/s, ~2.0MW)
50+ PF (FAC/UCC + CFL-M)
up to 4MW
5 PF(FAC/TPF)
1.5 MW
Cray XC30 (584TF)
Osaka
100+ PF 4.5MW
(UCC + TPF)
200+ PF
(FAC) 6.5MW
25.6 PB/s, 50100Pflop/s,1.52.0MW
100-150 PF
(FAC/TPF + UCC/TPF)3MW
Post-K Computer (??. ??)
JCAHPC
Joint Center for Advanced High Performance Computing
(http://jcahpc.jp)
n
Very tight collaboration for “post-T2K” with two
universities
n
n
n
n
4
For main supercomputer resources, uniform specification to
single shared system
Each university is financially responsible to introduce the
machine and its operation
-> unified procurement toward single system with largest scale
in Japan
To manage everything smoothly, a joint organization was
established
-> JCAHPC
IXPUG Spring Conference 2017, Cambridge
2017/04/11
Center for Computational Sciences, Univ. of Tsukuba
Procurement Policies of JCAHPC
based on the spirit of T2K, introducing open advanced technology
n
n
n
n
n
massively parallel PC cluster
advanced processor for HPC
easy to use and efficient interconnection
large scale shared file system flatly shared by all nodes
joint procurement by two universities
n
n
n
n
the largest class of budget as national universities’ supercomputer in Japan
the largest system scale as PC cluster in Japan
no accelerator to support wide variety of users and application fields
-> not chasing absolute peak performance and inheriting traditional application
codes (basically)
goodness of single system
n
n
n
scale-merit by merging budget -> largest in Japan
ultra large scale single job execution at special occasion such as “Gordon Bell Prize
Challenge”
⇒ Oakforest-PACS (OFP)
5
2017/04/11
IXPUG Spring Conference 2017, Cambridge
Center for Computational Sciences, Univ. of Tsukuba
Oakforest-PACS (OFP)
U. Tokyo convention U. Tsukuba convention
⇒ Don’t call it just “Oakforest” !
“OFP” is much better
• 25 PFLOPS peak
• 8208 KNL CPUs
• FBB Fat-Tree by
OmniPath
• HPL 13.55 PFLOPS
#1 in Japan
#6 in World
• HPCG #3 in World
• Green500 #6 in World
• Full operation started
Dec. 2016
• Official Program started
on April 2017
6
IXPUG Spring Conference 2017, Cambridge
2017/04/11
Computation node & chassis
Water cooling
wheel & pipe
Chassis with 8 nodes, 2U size
Computation node (Fujitsu next generation PRIMERGY)
with single chip Intel Xeon Phi (Knights Landing, 3+TFLOPS)
and Intel Omni-Path Architecture card (100Gbps)
7
IXPUG Spring Conference 2017, Cambridge
2017/04/11
Water cooling pipes and IME (burst buffer)
8
IXPUG Spring Conference 2017, Cambridge
2017/04/11
Specification of Oakforest-PACS
Total peak performance
25 PFLOPS
Total number of compute nodes
8,208
Compute
node
Product
Fujitsu Next-generation PRIMERGY server for HPC (under
development)
Processor
Intel® Xeon Phi™ (Knights Landing)
Xeon Phi 7250 (1.4GHz TDP) with 68 cores
Memory
Interconnect
Login
node
9
High BW
16 GB, > 400 GB/sec (MCDRAM, effective rate)
Low BW
96 GB, 115.2 GB/sec (DDR4-2400 x 6ch, peak rate)
Product
Intel® Omni-Path Architecture
Link speed
100 Gbps
Topology
Fat-tree with full-bisection bandwidth
Product
Fujitsu PRIMERGY RX2530 M2 server
# of servers
20
Processor
Intel Xeon E5-2690v4 (2.6 GHz 14 core x 2 socket)
Memory
256 GB, 153 GB/sec (DDR4-2400 x 4ch x 2 socket)
IXPUG Spring Conference 2017,2017/04/11
Cambridge
Center for Computational Sciences, Univ. of Tsukuba
Specification of Oakforest-PACS (I/O)
Parallel File
System
Type
Lustre File System
Total Capacity
26.2 PB
Meta
data
Product
DataDirect Networks MDS server + SFA7700X
# of MDS
4 servers x 3 set
MDT
7.7 TB (SAS SSD) x 3 set
Product
DataDirect Networks SFA14KE
# of OSS
(Nodes)
10 (20)
Aggregate BW
~500 GB/sec
Object
storage
Fast File
Type
Cache System Total capacity
10
Burst Buffer, Infinite Memory Engine (by DDN)
940 TB (NVMe SSD, including parity data by
erasure coding)
Product
DataDirect Networks IME14K
# of servers (Nodes)
25 (50)
Aggregate BW
~1,560 GB/sec
IXPUG Spring Conference 2017, Cambridge
2017/04/11
Center for Computational Sciences, Univ. of Tsukuba
Full bisection bandwidth Fat-tree by
Intel® Omni-Path Architecture
12 of
768 port Director Switch
(Source by Intel)
2
2
Uplink: 24
362 of
48 port Edge Switch
Downlink: 24
1
...
24
25
...
48
49
...
72
Firstly, to reduce switches&cables, we considered :
• All the nodes into subgroups are connected with FBB Fat-tree
• Subgroups are connected with each other with >20% of FBB
But, HW quantity is not so different from globally FBB, and globally FBB is
preferredfor flexible job management.
11
IXPUG Spring Conference 2017, Cambridge
2017/04/11
Compute Nodes
8208
Login Nodes
20
Parallel FS
64
IME
Mgmt, etc.
Total
300
8
8600
Facility of Oakforest-PACS system
Power consumption
4.2 MW (including cooling)
# of racks
102
Cooling
system
Compute
Node
Others
12
Type
Warm-water cooling
Direct cooling (CPU)
Rear door cooling (except CPU)
Facility
Cooling tower & Chiller
Type
Air cooling
Facility
PAC
IXPUG Spring Conference 2017, Cambridge
2017/04/11
Software of Oakforest-PACS
Compute node
Login node
OS
CentOS 7, McKernel
Red Hat Enterprise Linux 7
Compiler
MPI
Library
gcc, Intel compiler (C, C++, Fortran)
Intel MPI, MVAPICH2
Intel MKL
LAPACK, FFTW, SuperLU, PETSc, METIS, Scotch, ScaLAPACK,
GNU Scientific Library, NetCDF, Parallel netCDF, Xabclib,
ppOpen-HPC, ppOpen-AT, MassiveThreads
mpijava, XcalableMP, OpenFOAM, ABINIT-MP, PHASE system,
FrontFlow/blue, FrontISTR, REVOCAP, OpenMX, xTAPP,
AkaiKKR, MODYLAS, ALPS, feram, GROMACS, BLAST, R
packages, Bioconductor, BioPerl, BioRuby
Globus Toolkit, Gfarm
Fujitsu Technical Computing Suite
Allinea DDT
Intel VTune Amplifier, Trace Analyzer & Collector
Application
Distributed FS
Job Scheduler
Debugger
Profiler
13
IXPUG Spring Conference 2017, Cambridge
2017/04/11
TOP500 list on Nov. 2016
#
1
2
3
4
5
6
Machine
Architecture
MPP (Sunway,
SW26010)
Tianhe-2 (MilkyWay-2), Cluster (NUDT,
NSCG
CPU + KNC)
MPP (Cray, XK7:
Titan, ORNL
CPU + GPU)
MPP (IBM,
Sequoia, LLNL
BlueGene/Q)
MPP (Cray, XC40:
Cori, NERSC-LBNL
KNL)
Cluster (Fujitsu,
Oakforest-PACS, JCAHPC
KNL)
TaihuLight, NSCW
K Computer, RIKEN AICS MPP (Fujitsu)
MPP (Cray, XC50:
8 Piz Daint, CSCS
CPU + GPU)
MPP (IBM,
9 Mira, ANL
BlueGene/Q)
Trinity, NNSA/
MPP (Cray,
10
LABNL/SNL
XC40: MIC)
7
2017/04/1
1
Country
Rmax (TFLOPS)
Rpeak (TFLOPS)
MFLOPS/W
China
93,014.6
125,435.9
6051.3
China
33,862.7
54,902.4
1901.5
United States
17,590.0
27,112.5
2142.8
United States
17,173.2
20,132.7
2176.6
United States
14,014.7
27,880.7
???
Japan
13,554.6
25,004.9
4985.1
Japan
10,510.0
11,280.4
830.2
Switzerland
9,779.0
15,988.0
7453.5
United States
8,586.6
10,066.3
2176.6
United States
8,100.9
11,078.9
1913.7
IXPUG Spring Conference 2017,
14
Cambridge
Center for Computational Sciences, Univ. of Tsukuba
Green500 on Nov. 2016
#
# HPL Machine
Architecture
1
28
DGX SaturnV
2
8
Piz Daint, CSCS
3
4
5
116 Shoubu
1
TaihuLight
375 QPACE3
GPU cluster (NVIDIA
DGX1)
MPP (Cray, XC50:
CPU + GPU)
Country
USA
3,307.0
9462.1
Switzerland
9,779.0
7453.5
PEZY ZettaScaler-1
Japan
1,001.0
6673.8
MPP (Sunway
SW26010)
China
93,014.6
6051.3
Cluster (Fujitsu, KNL)
Germany
447.1
5806.3
Japan
13,554.6
4985.1
5,095.8
4688.0
781.3
4112.1
3,057.4
4086.8
425.9
3836.7
6
6
Oakforest-PACS, JCAHPC
Cluster (Fujitsu, KNL)
7
18
Theta
MPP (Cray XC40, KNL) USA
8
162 XStream
9
33
10
397 SciPhi XVI
MPP (Cray CS-Storm,
GPU)
Camphor2
Rmax (TFLOPS) MFLOPS/W
USA
MPP (Cray XC40, KNL) Japan
Cluster (KNL)
USA
IXPUG Spring Conference 2017, Cambridge
15
2017/04/11
Center for Computational Sciences, Univ. of Tsukuba
HPCG on Nov. 2016
2017/04/11
IXPUG Spring Conference 2017,
16
Cambridge
Center for Computational Sciences, Univ. of Tsukuba
McKernel support
n
McKernel (special light weight kernel for Many-Core
architecture)
n
n
n
n
n
17
developed at U. Tokyo and now at AICS, RIKEN (lead by Y.
Ishikawa)
KNL-ready version is almost completed
It can be loaded as a kernel module to Linux
Batch scheduler is noticed to use McKernel by user’s script,
then applies it
Detach the McKernel module after job execution
IXPUG Spring Conference 2017, Cambridge
2017/04/11
Center for Computational Sciences, Univ. of Tsukuba
XcalableMP (XMP) support
n
XcalableMP: massively parallel description language
based on PGAS model and directives by users
n
n
n
18
originally developed at U. Tsukuba and now at AICS, RIKEN
(lead by M. Sato)
KNL-ready version is under evaluation and tuning
It will be open for users to write (relatively) easy-way for large
scale parallelization as well as performance tuning
IXPUG Spring Conference 2017, Cambridge
2017/04/11
Center for Computational Sciences, Univ. of Tsukuba
Memory Model (currently planned)
n
Our challenge – semi-dynamic switching of CACHE and FLAT
modes
n
n
n
n
n
n
n
Initial: nodes in the system are configured with a certain ratio of mixture
(half and half) of Cache and Flat modes
Batch scheduler is noticed about the memory configuration from user’s
script
Batch scheduler tries to find appropriate nodes without reconfiguration
If there are not enough number of nodes, some of them are rebooted with
another memory configuration
Reboot is by warm-reboot, with ~100 nodes group
Size limitation (max. # of nodes) may be applied
NUMA model
n
n
19
Currently quadrant mode only
(Perhaps) we will not dynamically change it ??
IXPUG Spring Conference 2017, Cambridge
2017/04/11
Center for Computational Sciences, Univ. of Tsukuba
System operation outline
n
Regular operation
n
n
n
n
Special operation
n
n
(limited period) massively large scale operation
-> effectively using the largest class resource in Japan for special
occasion (ex. Gordon Bell Challenge)
Power saving operation
n
20
both universities share the CPU time based on the budget ratio
not split the system hardware, but split the “CPU time” for flexible
operation (except several specially dedicated partitions)
single system entry for HPCI program, and other own program by each
university is performed under “CPU time” sharing
power capping feature for energy saving scheduling feature reacts to
power saving requirement (ex. summer time)
IXPUG Spring Conference 2017, Cambridge
2017/04/11
Center for Computational Sciences, Univ. of Tsukuba
Machine location: Kashiwa Campus of U. Tokyo
U. Tsukuba
Kashiwa
Campus
of U. Tokyo
Hongo Campus of U. Tokyo
21
IXPUG Spring Conference 2017, Cambridge
2017/04/11
Center for Computational Sciences, Univ. of Tsukuba
Summary
n
n
n
n
n
22
JCAHPC is a joint resource center for advanced HPC by U. Tokyo and U.
Tsukuba as the first case in Japan
Oakforest-PACS (OFP) with 25 PFLOPS peak is ranked as #1 in Japan and
#6 in the world, with Intel Xeon Phi (KNL) and OPA
Under JCAHPC, both universities perform nation-wide resource sharing
programs including HPCI
JCAHPC is not just an organization to manage the resource but also a
basic community for advanced HPC research
OFP is used not only for HPCI and other resource sharing program but
also a testbed for McKernel and XcalableMP system software to support
Post-K development
IXPUG Spring Conference 2017, Cambridge
2017/04/11
Center for Computational Sciences, Univ. of Tsukuba

Similar documents