Oakforest-PACS (OFP)
Transcription
Oakforest-PACS (OFP)
Oakforest-PACS (OFP) : Japan’s #1 System driven by KNL and OPA Taisuke Boku Deputy Director, Center for Computational Sciences University of Tsukuba (with courtesy of JCAHPC members) 1 2017/04/11 IXPUG Spring Conference 2017, Cambridge Center for Computational Sciences, Univ. of Tsukuba Towards Exascale Computing Future Exascale PF 1000 Post K Computer Tier-1 and tier-2 supercomputers form HPCI and move forward to Exascale computing like two wheels 100 Riken AICS OFP 10 JCAHPC(U. Tsukuba and U. Tokyo) Tokyo Tech. 1 TSUBAME2.0 T2K U. of Tsukuba U. of Tokyo Kyoto U. 2008 2 2010 2012 2014 2016 2018 2020 IXPUG Spring Conference 2017, Cambridge 2017/04/11 Center for Computational Sciences, Univ. of Tsukuba Deployment plan of 9 supercomputing center (Feb. 2017) Power consumption indicates maximum of power supply (includes cooling facility) Fiscal Year Hokkaido 2014 2015 2016 2017 2018 HITACHI SR16000/M1(172TF, 22TB) Cloud System BS2000 (44TF, 14TB) Tsukuba 2020 2021 2022 3.2 PF (UCC + CFL/M) 0.96MW 0.3 PF (Cloud) 0.36MW Data Science Cloud / Storage HA8000 / WOS7000 (10TF, 1.96PB) Tohoku 2019 NEC SX~30PF, ~30PB/s SX-ACE(707TF,160TB, 655TB/s) 9他 LX406e(31TF), Storage(4PB), 3D Vis, 2MW (60TF) HA-PACS (1166 TF) PACS-X 10PF (TPF) 2MW 2023 2024 2025 30 PF (UCC + CFL-M) 2MW Mem BW (CFL-D/CFL-M) ~3MW COMA (PACS-IX) (1001 TF) Oakforest-PACS 25 PF (UCC + TPF) 4.2 MW Tokyo Fujitsu FX10 (1PFlops, 150TiB, 408 TB/s), Reedbush 1.8〜1.9 PF 0.7 MW 50+ PF (FAC) 3.5MW Hitachi SR16K/M1 (54.9 TF, 10.9 TiB, 28.7 TB/s) Tokyo Tech. Nagoya TSUBAME 2.5 (5.7 PF, 110+ TB, 1160 TB/s), 1.4MW TSUBAME 2.5 (3~4 PF, extended) TSUBAME3.012PF,UCC/TPF2.0MW Fujitsu FX100 (2.9PF, 81 TiB) Fujitsu CX400 (774TF, 71TiB) SGI UV2000 (24TF, 20TiB) 2MW in total FX10(90TF) CX400(470T F) Cray: XE6 + GB8K + XC30 (983TF) Kyoto HA8000 (712TF, 242 TB) SR16000 (8.2TF, 6TB) FX10 (272.4TF, 36 TB)2.0MW CX400 (966.2 TF, 183TB) 15-20 PF (UCC/TPF) 2.6MW FX10 (90.8TFLOPS) K Computer (10PF, 13MW) 3 2017/04/11 0.7-1PF (UCC) ???? IXPUG Spring Conference 2017, Cambridge 100+ PF (FAC/UCC+CFLM)up to 4MW 50-100+ PF (FAC/TPF + UCC) 1.8-2.4 MW 3.2PB/s, 5-10Pflop/s, 1.0-1.5MW (CFL-M) NEC SX-ACE NEC Express5800 (423TF) (22.4TF) Kyushu TSUBAME 4.0 (100+ PF, >10PB/s, ~2.0MW) 50+ PF (FAC/UCC + CFL-M) up to 4MW 5 PF(FAC/TPF) 1.5 MW Cray XC30 (584TF) Osaka 100+ PF 4.5MW (UCC + TPF) 200+ PF (FAC) 6.5MW 25.6 PB/s, 50100Pflop/s,1.52.0MW 100-150 PF (FAC/TPF + UCC/TPF)3MW Post-K Computer (??. ??) JCAHPC Joint Center for Advanced High Performance Computing (http://jcahpc.jp) n Very tight collaboration for “post-T2K” with two universities n n n n 4 For main supercomputer resources, uniform specification to single shared system Each university is financially responsible to introduce the machine and its operation -> unified procurement toward single system with largest scale in Japan To manage everything smoothly, a joint organization was established -> JCAHPC IXPUG Spring Conference 2017, Cambridge 2017/04/11 Center for Computational Sciences, Univ. of Tsukuba Procurement Policies of JCAHPC based on the spirit of T2K, introducing open advanced technology n n n n n massively parallel PC cluster advanced processor for HPC easy to use and efficient interconnection large scale shared file system flatly shared by all nodes joint procurement by two universities n n n n the largest class of budget as national universities’ supercomputer in Japan the largest system scale as PC cluster in Japan no accelerator to support wide variety of users and application fields -> not chasing absolute peak performance and inheriting traditional application codes (basically) goodness of single system n n n scale-merit by merging budget -> largest in Japan ultra large scale single job execution at special occasion such as “Gordon Bell Prize Challenge” ⇒ Oakforest-PACS (OFP) 5 2017/04/11 IXPUG Spring Conference 2017, Cambridge Center for Computational Sciences, Univ. of Tsukuba Oakforest-PACS (OFP) U. Tokyo convention U. Tsukuba convention ⇒ Don’t call it just “Oakforest” ! “OFP” is much better • 25 PFLOPS peak • 8208 KNL CPUs • FBB Fat-Tree by OmniPath • HPL 13.55 PFLOPS #1 in Japan #6 in World • HPCG #3 in World • Green500 #6 in World • Full operation started Dec. 2016 • Official Program started on April 2017 6 IXPUG Spring Conference 2017, Cambridge 2017/04/11 Computation node & chassis Water cooling wheel & pipe Chassis with 8 nodes, 2U size Computation node (Fujitsu next generation PRIMERGY) with single chip Intel Xeon Phi (Knights Landing, 3+TFLOPS) and Intel Omni-Path Architecture card (100Gbps) 7 IXPUG Spring Conference 2017, Cambridge 2017/04/11 Water cooling pipes and IME (burst buffer) 8 IXPUG Spring Conference 2017, Cambridge 2017/04/11 Specification of Oakforest-PACS Total peak performance 25 PFLOPS Total number of compute nodes 8,208 Compute node Product Fujitsu Next-generation PRIMERGY server for HPC (under development) Processor Intel® Xeon Phi™ (Knights Landing) Xeon Phi 7250 (1.4GHz TDP) with 68 cores Memory Interconnect Login node 9 High BW 16 GB, > 400 GB/sec (MCDRAM, effective rate) Low BW 96 GB, 115.2 GB/sec (DDR4-2400 x 6ch, peak rate) Product Intel® Omni-Path Architecture Link speed 100 Gbps Topology Fat-tree with full-bisection bandwidth Product Fujitsu PRIMERGY RX2530 M2 server # of servers 20 Processor Intel Xeon E5-2690v4 (2.6 GHz 14 core x 2 socket) Memory 256 GB, 153 GB/sec (DDR4-2400 x 4ch x 2 socket) IXPUG Spring Conference 2017,2017/04/11 Cambridge Center for Computational Sciences, Univ. of Tsukuba Specification of Oakforest-PACS (I/O) Parallel File System Type Lustre File System Total Capacity 26.2 PB Meta data Product DataDirect Networks MDS server + SFA7700X # of MDS 4 servers x 3 set MDT 7.7 TB (SAS SSD) x 3 set Product DataDirect Networks SFA14KE # of OSS (Nodes) 10 (20) Aggregate BW ~500 GB/sec Object storage Fast File Type Cache System Total capacity 10 Burst Buffer, Infinite Memory Engine (by DDN) 940 TB (NVMe SSD, including parity data by erasure coding) Product DataDirect Networks IME14K # of servers (Nodes) 25 (50) Aggregate BW ~1,560 GB/sec IXPUG Spring Conference 2017, Cambridge 2017/04/11 Center for Computational Sciences, Univ. of Tsukuba Full bisection bandwidth Fat-tree by Intel® Omni-Path Architecture 12 of 768 port Director Switch (Source by Intel) 2 2 Uplink: 24 362 of 48 port Edge Switch Downlink: 24 1 ... 24 25 ... 48 49 ... 72 Firstly, to reduce switches&cables, we considered : • All the nodes into subgroups are connected with FBB Fat-tree • Subgroups are connected with each other with >20% of FBB But, HW quantity is not so different from globally FBB, and globally FBB is preferredfor flexible job management. 11 IXPUG Spring Conference 2017, Cambridge 2017/04/11 Compute Nodes 8208 Login Nodes 20 Parallel FS 64 IME Mgmt, etc. Total 300 8 8600 Facility of Oakforest-PACS system Power consumption 4.2 MW (including cooling) # of racks 102 Cooling system Compute Node Others 12 Type Warm-water cooling Direct cooling (CPU) Rear door cooling (except CPU) Facility Cooling tower & Chiller Type Air cooling Facility PAC IXPUG Spring Conference 2017, Cambridge 2017/04/11 Software of Oakforest-PACS Compute node Login node OS CentOS 7, McKernel Red Hat Enterprise Linux 7 Compiler MPI Library gcc, Intel compiler (C, C++, Fortran) Intel MPI, MVAPICH2 Intel MKL LAPACK, FFTW, SuperLU, PETSc, METIS, Scotch, ScaLAPACK, GNU Scientific Library, NetCDF, Parallel netCDF, Xabclib, ppOpen-HPC, ppOpen-AT, MassiveThreads mpijava, XcalableMP, OpenFOAM, ABINIT-MP, PHASE system, FrontFlow/blue, FrontISTR, REVOCAP, OpenMX, xTAPP, AkaiKKR, MODYLAS, ALPS, feram, GROMACS, BLAST, R packages, Bioconductor, BioPerl, BioRuby Globus Toolkit, Gfarm Fujitsu Technical Computing Suite Allinea DDT Intel VTune Amplifier, Trace Analyzer & Collector Application Distributed FS Job Scheduler Debugger Profiler 13 IXPUG Spring Conference 2017, Cambridge 2017/04/11 TOP500 list on Nov. 2016 # 1 2 3 4 5 6 Machine Architecture MPP (Sunway, SW26010) Tianhe-2 (MilkyWay-2), Cluster (NUDT, NSCG CPU + KNC) MPP (Cray, XK7: Titan, ORNL CPU + GPU) MPP (IBM, Sequoia, LLNL BlueGene/Q) MPP (Cray, XC40: Cori, NERSC-LBNL KNL) Cluster (Fujitsu, Oakforest-PACS, JCAHPC KNL) TaihuLight, NSCW K Computer, RIKEN AICS MPP (Fujitsu) MPP (Cray, XC50: 8 Piz Daint, CSCS CPU + GPU) MPP (IBM, 9 Mira, ANL BlueGene/Q) Trinity, NNSA/ MPP (Cray, 10 LABNL/SNL XC40: MIC) 7 2017/04/1 1 Country Rmax (TFLOPS) Rpeak (TFLOPS) MFLOPS/W China 93,014.6 125,435.9 6051.3 China 33,862.7 54,902.4 1901.5 United States 17,590.0 27,112.5 2142.8 United States 17,173.2 20,132.7 2176.6 United States 14,014.7 27,880.7 ??? Japan 13,554.6 25,004.9 4985.1 Japan 10,510.0 11,280.4 830.2 Switzerland 9,779.0 15,988.0 7453.5 United States 8,586.6 10,066.3 2176.6 United States 8,100.9 11,078.9 1913.7 IXPUG Spring Conference 2017, 14 Cambridge Center for Computational Sciences, Univ. of Tsukuba Green500 on Nov. 2016 # # HPL Machine Architecture 1 28 DGX SaturnV 2 8 Piz Daint, CSCS 3 4 5 116 Shoubu 1 TaihuLight 375 QPACE3 GPU cluster (NVIDIA DGX1) MPP (Cray, XC50: CPU + GPU) Country USA 3,307.0 9462.1 Switzerland 9,779.0 7453.5 PEZY ZettaScaler-1 Japan 1,001.0 6673.8 MPP (Sunway SW26010) China 93,014.6 6051.3 Cluster (Fujitsu, KNL) Germany 447.1 5806.3 Japan 13,554.6 4985.1 5,095.8 4688.0 781.3 4112.1 3,057.4 4086.8 425.9 3836.7 6 6 Oakforest-PACS, JCAHPC Cluster (Fujitsu, KNL) 7 18 Theta MPP (Cray XC40, KNL) USA 8 162 XStream 9 33 10 397 SciPhi XVI MPP (Cray CS-Storm, GPU) Camphor2 Rmax (TFLOPS) MFLOPS/W USA MPP (Cray XC40, KNL) Japan Cluster (KNL) USA IXPUG Spring Conference 2017, Cambridge 15 2017/04/11 Center for Computational Sciences, Univ. of Tsukuba HPCG on Nov. 2016 2017/04/11 IXPUG Spring Conference 2017, 16 Cambridge Center for Computational Sciences, Univ. of Tsukuba McKernel support n McKernel (special light weight kernel for Many-Core architecture) n n n n n 17 developed at U. Tokyo and now at AICS, RIKEN (lead by Y. Ishikawa) KNL-ready version is almost completed It can be loaded as a kernel module to Linux Batch scheduler is noticed to use McKernel by user’s script, then applies it Detach the McKernel module after job execution IXPUG Spring Conference 2017, Cambridge 2017/04/11 Center for Computational Sciences, Univ. of Tsukuba XcalableMP (XMP) support n XcalableMP: massively parallel description language based on PGAS model and directives by users n n n 18 originally developed at U. Tsukuba and now at AICS, RIKEN (lead by M. Sato) KNL-ready version is under evaluation and tuning It will be open for users to write (relatively) easy-way for large scale parallelization as well as performance tuning IXPUG Spring Conference 2017, Cambridge 2017/04/11 Center for Computational Sciences, Univ. of Tsukuba Memory Model (currently planned) n Our challenge – semi-dynamic switching of CACHE and FLAT modes n n n n n n n Initial: nodes in the system are configured with a certain ratio of mixture (half and half) of Cache and Flat modes Batch scheduler is noticed about the memory configuration from user’s script Batch scheduler tries to find appropriate nodes without reconfiguration If there are not enough number of nodes, some of them are rebooted with another memory configuration Reboot is by warm-reboot, with ~100 nodes group Size limitation (max. # of nodes) may be applied NUMA model n n 19 Currently quadrant mode only (Perhaps) we will not dynamically change it ?? IXPUG Spring Conference 2017, Cambridge 2017/04/11 Center for Computational Sciences, Univ. of Tsukuba System operation outline n Regular operation n n n n Special operation n n (limited period) massively large scale operation -> effectively using the largest class resource in Japan for special occasion (ex. Gordon Bell Challenge) Power saving operation n 20 both universities share the CPU time based on the budget ratio not split the system hardware, but split the “CPU time” for flexible operation (except several specially dedicated partitions) single system entry for HPCI program, and other own program by each university is performed under “CPU time” sharing power capping feature for energy saving scheduling feature reacts to power saving requirement (ex. summer time) IXPUG Spring Conference 2017, Cambridge 2017/04/11 Center for Computational Sciences, Univ. of Tsukuba Machine location: Kashiwa Campus of U. Tokyo U. Tsukuba Kashiwa Campus of U. Tokyo Hongo Campus of U. Tokyo 21 IXPUG Spring Conference 2017, Cambridge 2017/04/11 Center for Computational Sciences, Univ. of Tsukuba Summary n n n n n 22 JCAHPC is a joint resource center for advanced HPC by U. Tokyo and U. Tsukuba as the first case in Japan Oakforest-PACS (OFP) with 25 PFLOPS peak is ranked as #1 in Japan and #6 in the world, with Intel Xeon Phi (KNL) and OPA Under JCAHPC, both universities perform nation-wide resource sharing programs including HPCI JCAHPC is not just an organization to manage the resource but also a basic community for advanced HPC research OFP is used not only for HPCI and other resource sharing program but also a testbed for McKernel and XcalableMP system software to support Post-K development IXPUG Spring Conference 2017, Cambridge 2017/04/11 Center for Computational Sciences, Univ. of Tsukuba