Number of CPUs - Gamess-UK
Transcription
Number of CPUs - Gamess-UK
Computational Science and Engineering Department Daresbury Laboratory Computational Chemistry Applications: Performance on High-End and Commodity-class Computers Martyn F. Guest CCLRC Daresbury Laboratory [email protected] Acknowledgements: P. Sherwood, H.J. van Dam, W. Smith, I. Bush, A. Sunderland and M. Plummer, DL J. Nieplocha, E. Apra, PNNL ScicomP 5, Daresbury Laboratory http://www.dl.ac.uk/CFS/parallel/chemistry 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory Outline ! ! Computational Chemistry - Background Commodity-based and High-end Systems " " " Single-node performance & the Interconnect bottleneck Prototype Commodity Systems; CS1 - CS9 High-end Systems: IBM SP/WH2-375 and Regatta-H, SGI Origin 3800, Compaq Alpha Server SC and Cray SuperCluster " Performance Metrics ! Application performance " Molecular Simulation (DLPOLY and CHARMM) " Electronic Structure (Global Arrays (GAs) ; Linear Algebra (PeIGS) ! NWChem and GAMESS-UK " Materials Simulation (CRYSTAL, CPMD & CASTEP) and Engineering (ANGUS) ! Application performance analysis " VAMPIR and instrumenting the GA Tools ! Summary ScicomP 5, Daresbury Laboratory 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory Scaling of Molecular Computations 1 X 107 Relative Computi ng Power 1 X 106 Calculation of Reaction Energetics; CI/CCSD Techniques Determination of Molecular Structures: Hartree-Fock Techniques 1 X 105 TeraFLOPS MPP 1 X 104 Determination of Molecular Structures: Density Functional Techniques 1 X 103 1 X 102 CRAY C90 (16) Simulation of Biomolecules: Molecular Dynamics Techniques 1 X 101 1 X 100 CRAY 1S Number of Atoms 1 X 100 1 X 101 ScicomP 5, Daresbury Laboratory 1 X 102 1 X 103 1 X 104 1 X 105 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory Capability Computing - Cost Effectiveness & Dependencies S pe c ific atio n Us ag e MPP/AS CI HP C community 1280 CPUs , IBM S P (1 TB) Co s t Units CPU Me mo ry 10,000 2,000 2,000 Capability I/O 200-300 Commodity Systems (1.5 x N ??) S MP 16-pro c e s s o r S GI Orig in 3800, R14k (8GB RAM ) De pa rtme nt 100 Capacity 15 20 20-30 Commodity Systems (1.5 x N) PC Pe ntium-4 / 2GHz (512 MByte , 30 GB) De s ktop 1 1 1 1 3 year continual access to 32 CPUs: High-end (£0.5 / CPU hour) : £419,358 in-house Beowulf : £50,000 ScicomP 5, Daresbury Laboratory 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory High-end Commodity-based Systems “Need for real workloads on large clusters. To date mainly experimental systems, so failures are viewed as inevitable”, but ... Cluster Site Nodes/ Peak CPU Processors (Gflop) 1. Locus Locus, USA 960/1920 1920 PIII/1000 2. HELICS Heidelberg 256/512 1434 AMD MP/1400 Myrinet 3. EMPIRE Mississippi 519/1038 1157 PIII/1000 Myrinet 4. RHIC Brookhaven 706/1412 1083 PIII/PII. FastEth. 5. Biopentium Inpharmatica 800/1220 1061 PIII/700+ FastEth. 6. Genesis Shell, NL 1030/1038 1037 PIII/1000 GigabitE 7. Platinum NCSA 516/1032 1032 PIII/1000 Myrinet 520/1040 967 PIII/933 Myrinet 9. Score III RWCP, Japan 512/1024 955 PIII/933 Myrinet 10. ICE Box Utah, USA 915 PIII + AMD FastEth. 8. CBRC Magi AIST,Japan ScicomP 5, Daresbury Laboratory 303/406 Interconnect FastEth. 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory High-End Systems Evaluated ! Cray T3E/1200E " 816 processor system at Manchester (CSAR service) " 600 Mz EV56 Alpha processor with 256 MB memory ! IBM SP/WH2-375 and Regatta-H node " 32 CPU system at DL, 4-way Winterhawk2 SMP “thin nodes” with 2 GB memory, 375 MHz Power3-II processors with 8 MB L2 cache " IBM SP Regatta-H (32-way node, 1.3 GHZ power4 CPUs) at Montepelier ! Compaq AlphaServer SC " 4-way ES40/667 A21264A (APAC) and 833 MHz SMP nodes (2 GB RAM, 8MB L2 cache); " TCS1 system at PSC (comprising 750 4-way ES45 nodes - 3,000 EV68 CPUs - with 4 GB memory per node, 8MB L2 cache " Quadrics “fat tree” interconnect (5 usec latency, 250 MB/sec B/W) ! SGI Origin 3800 " SARA (1000 CPUs) - Numalink with R14k/500 & R12k/400 CPUs ! Cray Supercluster at Eagen " Linux Alpha Cluster (96 X API CS20s - dual 833 MHz EV67 CPUs) " Myrinet interconnect, Red Hat 6.2 ScicomP 5, Daresbury Laboratory 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory Technical Progress in 2001/2002 Hardware and Software Evaluation: " CPU ! PC systems - Intel 1 GHz Pentium III, Pentium 4 (2.0 GHz), AMD K7 1.533 GHz ! Itanium (SGI Troon, IBM, HP RX4610) ! Alpha systems - ES40/833, ES45/1000 CS20/833, UP2000 (667 & 833 MHz) " Networks www.cse.clrc.ac.uk/Activity/DisCo ! Fast Ethernet options, cards, switches, channel-bonding, …. ! 100Mbit switch, ! SCI, QSNet and Myrinet (P4/2000 Cluster, Cray Supercluster, CS20) " System Software ! message passing S/W (LAM MPI, LAM MPI-VIA (100 us to 60 us), MPICH, VMI, SCAMPI),libraries (ATLAS, NASA, MKL), compilers (Absoft, PGI, Intel’s ifc and efc, GNU/g77), GA tools (PNNL) ! resource management software (LobosQ, PBS, Beowulf, LSF etc.) ScicomP 5, Daresbury Laboratory 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory Commodity Systems (CSx) Prototype / Evaluation Hardware Systems Location CPUs CS1 CS2 Daresbury Daresbury 32 64 CS3 CS4 CS6 RAL Sara CLiC 16 32 528 CS7 CS8 CS9 Daresbury NCSA Bristol 64 320 96 Configuration PentiumIII / 450 MHz + FE (EPSRC) 24 X dual UP2000/EV67-667, QSNet Alpha/LINUX cluster, 8 X dual CS20/EV67-833 (“loki”) Athlon K7 850MHz + myrinet Athlon K7 1.2 GHz + FE PentiumIII / 800 MHz; fast ethernet (Chemnitzer Cluster) AMD K7/1000 MP + SCALI/SCI (“ukcp”) 160 dual IBM Itanium/800 + Myrinet 2k (“titan”) Pentium4 Xeon/2000 + Myrinet 2k (“dirac”) Protoype Systems CS0 CS5 Daresbury Daresbury ScicomP 5, Daresbury Laboratory 10 16 10 CPUS, Pentium II/266 8 X dual Pentium III/933, SCALI 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory Communications Benchmark PMB: Pallas MPI Benchmark Suite (V2.2) Systems Investigated Point-to-Point (Mbytes/sec) PingPong; PingPing; Sendrecv; Exchange Collective Operations (Time usec) - as function of no.of CPUs Allreduce; Reduce; Reduce_scatter; Allgather; Allgatherv; Alltoall; Bcast, Barrier Message Lengths: 0 to 4194304 Bytes ScicomP 5, Daresbury Laboratory ! Cray T3E/1200E ! SGI Origin3800/R14k-500 (Teras) ! IBM SP/WH2-375 (“trailblazer” switch and SP/NH2-375 (single-plane colony) ! Cray SuperCluster (833MHz EV68 dual CS20 nodes with myrinet) ! Compaq AlphaServer SC ES45/1GHz -TCS1 ! IBM Regatta H (intra node) ! CS1 PIII/450 + Fast ethernet, MPICH ! ! ! ! ! ! CS2 Alpha Linux Cluster dual UP2000/667 CS3 AMD Athlon/850 + Myrinet, MPICH CS6 PIII/800 + Fast ethernet, MPICH CS7 AMD K7/1000 MP + SCALI CS8 Itanium/800 + Myrinet 2k (NCSA) CS9 dual P4/2000 Xeon + Myrinet 2k 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory PingPong Performance PMB Benchmark (Pallas) 10000 1000 100 10 Bandwidth (MByte/sec) Bandwidth (Mbytes/s) 1 0.1 1.E+00 Message Length (Bytes) 1.E+01 1.E+02 1.E+03 CS7 AMD K7/1000 MP + SCALI CS2 QSNet Alpha Cluster/667 Compaq AlphaServer SC ES45/1000 IBM Regatta-H Cray T3E/1200E Cray SuperCluster / Myrinet IBM SP/WH2 375-quad SGI Origin3800/R14k-500 IBM SP/NH2 375 16-way CS8 Titan Itanium/800 + Myrinet 2k CS9 Pentium 4/2000 + Myrinet 2k 1.E+04 1.E+05 1.E+06 1.E+07 Message Length (Bytes) ScicomP 5, Daresbury Laboratory 7 May 2002 Computational Science and Engineering Department 16 CPUs Daresbury Laboratory MPI_allreduce Performance PMB Benchmark (Pallas) Measured Time (usec) 1.E+05 1.E+04 1.E+03 Measured Time (us) 1.E+06 CS7 AMD K7/1000 MP + SCALI CS2 QSNet Alpha Cluster/667 Compaq AlphaServer SC ES45/1000 IBM Regatta-H Cray T3E/1200E Cray SuperCluster / Myrinet IBM SP/WH2 375-quad SGI Origin3800/R14k-500 IBM SP/NH2 375 16-way CS8 Titan Itanium/800 + Myrinet 2k CS9 Pentium 4/2000 + Myrinet 2k 1.E+02 Message Length (Bytes) 1.E+01 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Message Length (Bytes) ScicomP 5, Daresbury Laboratory 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory Interconnect Benchmark - EFF_BW 530 SGI Origin 3800/R14k-500 964 AlphaServer SC ES45/1000 (1CPU) 507 AlphaServer SC ES45/1000 (4CPU) QSNet 264 IBM SP/WH2-375 IBM SP/NH2-375 (16 - 8+8) 362 IBM SP/Regatta-H 1217 Cray T3E/1200E 2151 698 CS2 QSNet AlphaEV67 (1 CPU) QSNet 456 CS2 QSNet AlphaEV67 (2 CPU) 635 CS9 P4/2000 Xeon - Myrinet (1CPU) Myrinet 2k 476 CS9 P4/2000 Xeon - Myrinet (2CPU) 16 CPUs 667 CS8 Itanium/800 - Myrinet (1CPU) 516 CS8 Itanium/800 - Myrinet (2CPU) 481 AMD K7/1000 MP - SCALI (1 CPU) 367 AMD K7/1000 MP - SCALI (2 CPU) 255 CS3 AMD K7/850 - MPICH 84.2 CS4 AMD K7/1200 - LAM 44.2 CS6 PIII/800 - MPICH 77.9 CS6 PIII/800 - LAM 0 ScicomP 5, Daresbury Laboratory SCALI Myrinet Fast Ethernet 500 1000 MBytes/sec 1500 2000 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory Application Codes Application-driven performance comparisons between Commodity-based systems and both current MPP and ASCI-style SMP-node platforms : ! Computational Chemistry " Molecular Simulation ! DL_POLY - parallel MD code with many applications ! CHARMM - macromolecular MD and energy minimisation " Molecular Electronic Structure ! GAMESS-UK and NWChem - Ab initio Electronic structure codes ! Materials ! CRYSTAL (periodic HF/DFT) ! CPMD, CASTEP - Car-Parrinello plane wave codes ! Engineering " ANGUS - regular-grid domain decomposition engineering code Performance Metric (% 32-node high-end system) ScicomP 5, Daresbury Laboratory 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory Performance Metrics: 1999-2001 Attempted to quantify delivered performance from the Commoditybased systems against current MPP (CSAR Cray T3E/1200E)and ASCI-style SMP-node platforms (e.g. SGI Origin 3800) i.e. Performance Metric (% 32-node Cray T3E) 1. T (32-nodes Cray T3E/1200E) / T (32 CPUs ) CSx [ T 32-node T3E / T 32-node CS1 Pentium III/450 + FE ] T 32-node T3E / T 32-node CS6 Pentium III/800 + FE T 32-node T3E / T 32-CPU CS2 Alpha Linux Cluster + Quadrix 2. T (32-CPUs SGI Origin 3800) / T (32 CPUs) CSx T 32-CPU SGI O3800 / T 32-CPU CS2 Alpha Linux Cluster + Quadrix ScicomP 5, Daresbury Laboratory 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory Beowulf Comparisons with the T3E & O3800/R14k-500 CSx - Pentium III + FE CS2 - QSNet Alpha Linux Cluster % of 32-node Cray T3E/1200E GAMESS-UK CS1 CS6 % of 32-node Cray T3E and O3800/R14k-500 GAMESS-UK SCF DFT DFT (Jfit (Jfit)) DFT Gradient MP2 Gradient SCF Forces 53-69% 65-85% 44-77% 90% 44% 80% NWChem (DFT Jfit) Jfit) REALC CRYSTAL 96% 130-178% 65-131% 130% 73% 127% 50-60% SCF DFT † DFT (Jfit (Jfit)) DFT Gradient † MP2 Gradient SCF Forces 256% 301-361% 219-379% 289% 228% 154% 99% 99% 89-100% 89% 87% 86% NWChem (DFT Jfit) Jfit) † 150-288% 74-135% 74-135% CRYSTAL † 349% 67% 79-145% DL_POLY DL_POLY Ewald-based Ewald-based bond constraints 95-107% 151-184% 34-56% 69% Ewald-based Ewald-based † bond constraints 363-470% 143-260% 95% 82% CHARMM CASTEP CPMD 96% 172% CHARMM † 404% 78% 33% 62% 42% CASTEP 166% 78% ANGUS FLITE3D 60% 104% 68% ANGUS FLITE3D † 145% 480% ScicomP 5, Daresbury Laboratory 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory Performance Metrics: 2002 Attempt to quantify delivered performance from current Commoditybased systems against high-end SMP-node platforms: Compaq AlphaServer SC ES45/1000 (PSC) and SGI Origin 3800 (Sara) i.e. Performance Metric (% 32-node AlphaServer SC) 1. T (32-CPUs AlphaServer SC ES45/1000) / T (32 CPUs ) CSx † T 32-CPU AlphaServer ES45 / T 32-CPU CS9 Pentium 4 Xeon / 2000 + Myrinet 2k T 32-CPU AlphaServer / T 32-CPU CS2 Alpha Linux Cluster + Quadrix 2. T (32-CPUs SGI Origin 3800/R14k-500) / T (32 CPUs) CSx † T 32-CPU SGI Origin 3800 / T 32-CPU CS9 Pentium 4 Xeon / 2000 + Myrinet 2k ScicomP 5, Daresbury Laboratory 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory Molecular Simulation Molecular Dynamics Codes: DL_POLY and CHARMM ScicomP 5, Daresbury Laboratory 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory DL_POLY Parallel Benchmarks (Cray T3E/1200) V2: Replicated Data 4. NaCl; Ewald, 27,000 ions 5. NaK-disilicate glass; 8,640 atoms,MTS+ Ewald 8. MgO microcrystal: 5,416 atoms 256 9. Model membrane/Valinomycin (MTS, 18,886) 7. Gramicidin in water (SHAKE, 12,390) 6. K/valinomycin in water (SHAKE, AMBER, 3,838) 1. Metallic Al (19,652 atoms, Sutton Chen) 3. Transferrin in Water (neutral groups + SHAKE, 27,593) 2. Peptide in water (neutral groups + SHAKE, 3993). 64 Speed-up Linear Speed-up Linear 192 48 128 32 64 16 0 0 0 64 128 Number of Nodes ScicomP 5, Daresbury Laboratory 192 256 0 16 32 48 64 Number of Nodes 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory DL_POLY: Cray/T3E, High-end and Commodity-based Systems I. Measured Time (seconds) 817 750 500 CS6 PIII/800 + FE/LAM CS7 AMD K7/1000 MP + SCI CS4 AMD K7/1200 +FE CS9 P4/2000 + Myrinet CS2 QSNet Alpha Cluster/667 CS8 Itanium/800 + Myrinet Cray T3E/1200E IBM SP/WH2-375 SGI Origin 3800/R14k-500 CraySuperCluster/833 IBM SP/Regatta-H AlphaServer SC ES45/1000 Bench 4. NaCl; 27,000 ions, Ewald , 75 time steps, Cutoff=24Å 385 376 44%,71% 250 T3E128 =94 204 191 114 90 79 98 65 135 48 43 55 0 16 ScicomP 5, Daresbury Laboratory 32 Number of CPUs 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory DL_POLY: Cray/T3E, High-end and Commodity-based Systems Bench 5: NaK-disilicate glass; 8,640 atoms, MTS + Ewald: 270 time steps Speed-up T3E128 = 75 Measured Time (seconds) CS6 PIII/800 + FE/LAM CS7 AMD K7/1000 MP + SCI CS9 P4/2000 + Myrinet CS8 Itanium/800 + Myrinet CS2 QSNet Alpha Cluster/667 SGI Origin3800/R14k-500 Cray SuperCluster/833 AlphaServer SC ES45/1000 250 228 200 167 149 96 98 153 150 100 50.1 89 88 62 121 64 53%,84% 109 50 Linear Cray T3E/1200E SGI Origin 3800/R14k-500 Cray SuperCluster / 833 CS1 Pentium III/800 + FE CS7 AMD K7/1000 MP +SCI Compaq AlphaServer ES45/1000 128 60 53 52 32 40 36 36 68 57 23 44 0 0 16 32 Number of CPUs ScicomP 5, Daresbury Laboratory 64 0 32 64 96 Number of CPUs 7 May 2002 128 Computational Science and Engineering Department Daresbury Laboratory DL_POLY: Cray/T3E, High-end and Commodity-based Systems III. Measured Time (seconds) 800 733 688 600 552 CS6 PIII/800 + FE/LAM CS4 AMD K7/1200 + FE CS7 AMD K7/1000 MP + SCI CS9 P4/2000 + Myrinet CS2 QSNet Alpha Cluster/667 CS8 Itanium/800 + Myrinet Cray T3E/1200E IBM SP/WH2-375 Cray SuperCluster/833 SGI Origin 3800/R14k-500 IBM SP/Regatta-H AlphaServer SC ES45/1000 400 rigid bonds and SHAKE, 12,390 atoms, 500 time steps 382 306 200 Bench 7. Gramicidin in water; 41%,64% T3E128 =166 190 182 154 216 121 116 191 93 147 78 125 0 16 ScicomP 5, Daresbury Laboratory Number of CPUs 32 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory CHARMM ! ! ! ! ! CHARMM (Chemistry at HARvard Macromolecular Mechanics) is a general purpose molecular mechanics, molecular dynamics and vibrational analysis package for modelling and simulation of the structure and behaviour of macromolecular systems (proteins, nucleic acids, lipids etc.) Supports energy minimisation and MD approaches using a classical parameterised force field. J. Comp. Chem. 4 (1983) 187-217 Parallel Benchmark - MD Calculation of Carboxy Myoglobin (MbCO) with 3830 Water Molecules. QM/MM model for study of reacting species " incorporate the QM energy as part of the system into the force field " coupling between GAMESS-UK (QM) and CHARMM. ScicomP 5, Daresbury Laboratory 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory Parallel CHARMM Benchmark Benchmark MD Calculation of Carboxy Myoglobin (MbCO) with 3830 Water Molecules: 14026 atoms, 1000 steps (1 ps), 12-14 A shift. Measured Time (seconds) CS6 PIII/800 + FE/LAM CS2 QSNet Alpha Cluster/667 CS7 AMD K7/1000 MP + SCI CS9 P4/2000 + Myrinet 2k SGI Origin 3800/R14k-500 AlphaServer SC ES45/1000 300 231 200 199 Linear CS1 PIII/450 + FE/LAM Cray T3E/1200E CS2 QSNet Alpha Cluster/667 SGI Origin 3800/R14k-500 52 T3E16 = 665, T3E128 = 106 36 114 95%,103% 89 100 66 61 104 64 64 62 51 73 20 4 4 0 16 32 ScicomP 5, Daresbury Laboratory 64 20 36 52 Number of CPUs 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory Migration from Replicated to Distributed data DL_POLY-3 : Domain Decomposition ! ! ! Distribute atoms, forces across the nodes " More memory efficient, can address much larger cases (10 5-10 7) Shake and short-ranges forces require only neighbour communication " communications scale linearly with number of nodes A B C D Coulombic energy remains global " strategy depends on problem and machine characteristics ScicomP 5, Daresbury Laboratory 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory Migration from Replicated to Distributed data DL_POLY-3: Halo Data Exchange ! ! ! Replaces global summation, move co-ordinates and constraint information between processors. Measurements on Origin-3000 (green) 64 processors Texch = 51 ms ! 128 processors Texch = 30 ms shorter messages increasing sparsity ScicomP 5, Daresbury Laboratory 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory Migration from Replicated to Distributed data DL_POLY-3: Coulomb Energy Evaluation ! Current implementation is based on Particle Mesh Ewald scheme " Optimal for small processor counts, includes Fourier transform smoothed charge density (reciprocal space grid typically 64x64x64 - 128x128x128) " Initial FFT implementation serial on small numbers of nodes ! MPP Strategies " Global plane- or column-based FFT (requires scalable Alltoallv function) and data redistribution to blocks. " Local FFT (multithread on single Regatta node) " Block-decomposed FFT (Ian Bush, CLRC) " New Algorithms (e.g. multigrid methods from Tom Darden). ScicomP 5, Daresbury Laboratory 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory Migration from Replicated to Distributed data DL_POLY-3: Coulomb Energy Evaluation Plane ! Conventional routines (e.g. fftw) assume plane or column distributions ! A global transpose of the data is required to complete the 3D FFT and additional costs are incurred re-organising the data from the natural block domain decomposition. ! An alternative FFT algorithm has been designed to reduce communication costs. Block " the 3D FFT are performed as a series of 1D FFTs, each involving communications only between blocks in a given column " More data is transferred, but in far fewer messages " Rather than all-to-all, the communications are column-wise only ScicomP 5, Daresbury Laboratory 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory Migration from Replicated to Distributed data DL_POLY-3: Coulomb Energy Evaluation Speed-up 256 Linear DL_POLY-2: 27,000 ions - AlphaServer 27,000 ions - AlphaServer 27,000 ions - CS9 P4/2000 216,000 ions - AlphaServer 216,000 ions - CS9 P4/2000 192 NaCl Simulation; DLPOLY_2.11 171 1. 27,000 ions, Ewald, 500 time steps, Cutoff=24Å DLPOLY_3 128 122 1. 27,000 ions, PMES, 500 time steps, Cutoff=12Å 2. 216,000 ions,PMES, 200 time steps, Cutoff=12Å 64 0 0 64 128 ScicomP 5, Daresbury Laboratory 192 256 Number of CPUs 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory Migration from Replicated to Distributed data DL_POLY-3: Macromolecular Simulations Speed-up Linear DL_POLY-2: 12,390 atoms - AlphaServer 99,120 atoms - AlphaServer 99,120 atoms - CS9 P4/2000 972,960 atoms - AlphaServer 972,960 atoms - CS9 P4/2000 128 96 S 256 = 175 109 Gramicidin in water; DLPOLY_2.11 1. rigid bonds and SHAKE, 12,390 atoms,500 time steps 64 DLPOLY_3 1. 99,120 ions, rigid bonds and SHAKE, 100 time steps 32 33 2. 792,960 ions, rigid bonds and SHAKE, 50 time steps 0 0 32 64 ScicomP 5, Daresbury Laboratory 96 128 Number of CPUs 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory Molecular Electronic Structure Ab initio Electronic structure Codes: NWChem and GAMESS-UK ScicomP 5, Daresbury Laboratory 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory Distributed Data SCF MOAO Pictorial representation of the iterative SCF process in (i) a sequential process, and (ii) a distributed data parallel process: MOAO represents the molecular orbitals, P the density matrix and F the Fock or Hamiltonian matrix Pµ ν guess orbitals dgemm Sequential Eigensol ver Fρσ Integrals V XC If Converged V Coul V 1e MOAO Pµν guess orbitals Sequential ga_dgemm PeIGS If Converged Distributed Data ScicomP 5, Daresbury Laboratory Fρσ Integrals V XC V Coul V 1e 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory High-End Computational Chemistry The NWChem Software ! Capabilities (Direct, Semi-direct and conventional): " RHF, UHF, ROHF using up to 10,000 basis functions; analytic 1st and 2nd derivatives. " DFT with a wide variety of local and non-local XC potentials, using up to 10,000 basis functions; analytic 1st and 2nd derivatives. " CASSCF; analytic 1st and numerical 2nd derivatives. " Semi-direct and RI-based MP2 calculations for RHF and UHF wave functions using up to 3,000 basis functions; analytic 1st derivatives and numerical 2nd derivatives. " Coupled cluster, CCSD and CCSD(T) using up to 3,000 basis functions; numerical 1st and 2nd derivatives of the CC energy. " Classical molecular dynamics and free energy simulations with the forces obtainable from a variety of sources ScicomP 5, Daresbury Laboratory 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory Global Arrays Physically distributed data Single, shared data structure ScicomP 5, Daresbury Laboratory • Shared-memory-like model – Fast local access – NUMA aware and easy to use – MIMD and data-parallel modes – Inter-operates with MPI, … • BLAS and linear algebra interface • Ported to major parallel machines – IBM, Cray, SGI, clusters,... • Originated in an HPCC project • Used by 5 major chemistry codes, financial futures forecasting, astrophysics, computer graphics Tools developed as part of the NWChem project at PNNL; R.J. Harrison, J. Nieplocha et al. 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory PeIGS 3.0 Parallel Performance (Solution of real symmetric generalized and standard eigensystem problems) Developed as part of the NWChem project at PNNL; R.J. Harrison, J. Nieplocha et al. Features (not always available elsewhere): • Inverse iteration using Dhillon-Fann-Parlett’s parallel algorithm (fastest uniprocessor performance and good parallel scaling) • Guaranteed orthonormal eigenvectors in the presence of large clusters of degenerate eigenvalues • Packed Storage • Smaller scratch space requirements • PDSYEV (Scalapack 1.5) and PDSYEVD (Scalapack 1.7) Consider performance on Fock matrices generated in typical Hartree Fock SCF calculations (1152 and 3888 basis functions). ScicomP 5, Daresbury Laboratory 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory Real symmetric eigenvalue problems Parallel Eigensolvers SGI Origin 3800/R12k-400 (“green”) 120 50 Fock matrix (N = 1152) 45 PeIGS 2.1 PeIGS 2.1 - Cray T3E/1200 PeIGS 3.0 PDSYEV (Scpk 1.5) PDSYEVD (Scpk 1.7) Time (sec) 40 35 30 Fock matrix (N = 3888) PeIGS 2.1 PeIGS 3.0 PDSYEV (Scpk 1.5) PDSYEVD (Scpk 1.7) BFG-Jacobi (DL) 16 128 100 80 60 25 20 40 15 10 20 5 0 0 2 4 8 16 Number of processors ScicomP 5, Daresbury Laboratory 32 32 64 256 512 Number of processors 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory Parallel Eigensolvers CS7 AMD K7/1000 MP + SCALI CS9 P4/2000 Xeon + Myrinet Measured Time (seconds) 400 600 Scalapack (PDSYEV) PDSYEV: 2 CPUs per Node 400 Measured Time (seconds) PDSYEV: 2 CPUs / node PDSYEV: 1 CPU / node PDSYEVD: 2 CPUs / node PDSYEVD: 1 CPU / node PeIGS 2.1: 2 CPUs / node PeIGS 2.1: 1 CPU / node 350 500 PDSYEV: 1 CPU Per Node 300 Real symmetric eigenvalue problems 300 250 200 N = 3888 200 N = 3888 150 100 100 50 0 0 4 8 16 32 Number of processors ScicomP 5, Daresbury Laboratory 64 8 16 32 64 Number of processors 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory Case Studies - Zeolite Fragments • DFT Calculations with Si8O7H18 347/832 Coulomb Fitting Basis (Godbout et al.) Si8O25H18 617/1444 DZVP - O, Si DZVP2 - H Fitting Basis: DGAUSS-A1 - O, Si DGAUSS-A2 - H Si26O37H36 Si28O67H30 ScicomP 5, Daresbury Laboratory 1199/2818 • NWChem & GAMESS-UK 1687/3928 Both codes use auxiliary fitting basis for coulomb energy, with 3 centre 2 electron integrals held in core. 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory DFT Coulomb Fit - NWChem Si8O7H18 Si8O25H18 347/832 Measured Time (seconds) 350 300 250 Measured Time (seconds) CS6 PIII/800 + FE 800 CS7 AMD K7/1000 + SCI CS9 P4/2000 + Myrinet CS2 QSNet Alpha/667 Cray T3E/1200E 600 SGI O3800/R14k-500 AlphaServer SC ES45/1000 400 227 223 177 163 50 596 511 411 76%,95% 150 100 CS6 PIII/800 + FE CS7 AMD K7/1000 + SCI CS9 P4/2000 + Myrinet CS2 QSNet Alpha/667 Cray T3E/1200E SGI O3800/R14k-500 AlphaServer SC ES45/1000 249 200 617/1444 514 404 88%,104% 400 257 107 68 55 200 55 182 159 135 42 177 61 124 43 0 0 16 32 Number of CPUs ScicomP 5, Daresbury Laboratory 16 32 Number of CPUs 7 May 2002 119 Computational Science and Engineering Department Daresbury Laboratory DFT Coulomb Fit - NWChem Si26O37H36 1199/2818 Si28O67H30 TIBM-SP/P2SC-120 (256) = 1137 TIBM-SP/P2SC-120 (256) = 2766 Measured Time (seconds) Measured Time (seconds) CS7 AMD K7/1000 + SCI CS9 P4/2000 + Myrinet 2k CS2 QSNet Alpha Cluster/667 Cray T3E/1200E SGI Origin 3800/R14k-500 AlphaServer SC ES45/1000 5169 4500 79%,210% 3000 2388 1687/3928 85%,227% 6000 6090 5507 4682 CS7 AMD K7/1000 + SCI CS9 P4/2000 + Myrinet 2k CS2 QSNet Alpha Cluster Cray T3E/1200E SGI Origin 3800/R14k-500 AlphaServer SC ES45/1000 4000 3050 3008 2414 2424 1500 1271 1147 2000 798 907 632 517 502 1580 2351 303 404 1617 1182 1360 880 611 1504 0 0 32 64 Number of CPUs ScicomP 5, Daresbury Laboratory 128 32 64 128 Number of CPUs 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory NWChem - DFT (LDA) Performance on the SGI Origin 3800 Zeolite ZSM-5 • DZVP Basis (DZV_A2) and Dgauss A1_DFT Fitting basis: AO basis: CD basis: 3554 12713 • MIPS R14k-500 CPUs (Teras) Wall time (13 SCF iterations): 64 CPUs = 5,242 seconds 128 CPUs= 3,951 seconds • 3-centre 2e-integrals = 1.00 X 10 12 • Schwarz screening = 5.95 X 10 10 • % 3c 2e-ints. In core = 100% ScicomP 5, Daresbury Laboratory Est. time on 32 CPUs = 40,000 secs 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory Parallel Implementations of GAMESS-UK ! ! Extensive use of Global Array (GA) Tools and Parallel Linear Algebra from NWChem Project (EMSL) SCF and DFT energies and gradients " " " " ! Replicated data, but … GA Tools for caching of I/O for restart and checkpoint files Storage of 3-centre 2-e integrals in DFT Jfit Linear Algebra (via PeIGs, DIIS/MMOs, Inversion of 2c-2e matrix) SCF second derivatives " Distribution of <vvoo> and <vovo> integrals via GAs ! MP2 gradients " Distribution of <vvoo> and <vovo> integrals via GAs ScicomP 5, Daresbury Laboratory 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory GAMESS-UK: DFT HCTH on Valinomycin. Impact of Coulomb Fitting Basis: DZV_A2 (Dgauss) 14980 15000 12000 9000 CS6 PIII/800 + FE CS7 AMD K7/1000 MP + SCALI CS4 AMD K7/1200 + FE CS2 QSNet Alpha Cluster/667 CS9 P4 Xeon/2000 + Myrinet Cray T3E/1200E IBM SP/WH2-375 Cray SuperCluster/833 AlphaServer SC ES45/1000 7638 61%,93% TT3E/1200E (128) = 995 6226 6000 3910 4281 CS6 PIII/800 + FE CS4 AMD K7/1200 + FE CS7 AMD K7/1000 MP + SCALI CS2 QSNet Alpha Cluster/667 CS9 P4 Xeon/2000 + Myrinet Cray T3E/1200E IBM SP/WH2-375 Cray SuperCluster/833 AlphaServer SC ES45/1000 4000 3386 2418 3063 1783 1301 3451 1791 0 16 JFIT Measured Time (seconds) 7617 6000 3000 A1_DFT Fit: 882/3012 JEXPLICIT Measured Time (seconds) Number of CPUs 2333 2000 1488 933 32 726 477 1567 780 73%,161% TT3E/1200E (128) = 2139 ScicomP 5, Daresbury Laboratory 0 16 Number of CPUs 32 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory GAMESS-UK: DFT HCTH on Valinomycin. Impact of Coulomb Fitting: Cray T3E/1200, Cray Super Cluster/833, Compaq AlphaServer SC/667 and SGI Origin R14k/500 Basis: DZV_A2 (Dgauss) A1_DFT Fit: 882/3012 Measured Time (seconds) Measured Time (seconds) 8000 Cray T3E/1200E SGI Origin 3800/R14k-500 Compaq AlphaServer SC/667 Cray Alpha Cluster/833 7617 4300 3000 Cray T3E/1200E Compaq AlphaServer SC/667 SGI Origin 3800/R14k-500 Cray Alpha Cluster/833 3063 JEXPLICIT JFIT 1573 4000 1500 2139 2033 1783 995 881 726 1123 539 713 978 476 419 598 0 0 32 64 Number of CPUs ScicomP 5, Daresbury Laboratory 128 32 64 128 Number of CPUs 7 May 2002 364 Computational Science and Engineering Department Daresbury Laboratory GAMESS-UK: DFT HCTH on Valinomycin. Speedups for both Explicit and Coulomb Fitting. JEXPLICIT Speedup 128 128 Linear Cray T3E/1200E 112 112 CS2 QSNet Alpha Cluster/667 linear Cray T3E/1200E CS2 QSNet Alpha Cluster/667 Compaq Alpha Server SC/667 Cray SuperCluster/833 SGI Origin 3800/R14k-500 SGI Origin 3800/R12k-400 112 Compaq AlphaServer/667 Cray SuperCluster/883 96 90.6 SGI Origin 3800/R12k-400 JFIT Speedup 80 96 100.1 105+ ⇑ 80 65.4 64 64 48 48 32 32 16 16 16 32 48 64 80 96 Number of CPUs ScicomP 5, Daresbury Laboratory 112 128 16 32 48 64 80 96 112 128 Number of CPUs 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory Memory-driven Approaches - SCF and DFT HF and DFT Energy and Gradient calculation for NiSiMe2CH2CH2C6F13: Basis Ahlrichs DZ (1485 GTOs) Elapsed Time (seconds) 5000 4206 HF HF DFT DFT 4182 4000 Direct SCF In-core - Direct-SCF - In-core Integrals written directly to memory, rather than to disk, and are not recalculated 3000 2476 2405 2000 1723 1692 830 1000 799 SGI Origin 3800/R14k-500 0 28 ScicomP 5, Daresbury Laboratory 60 124 Number of CPUs 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory Performance of MP2 Gradient Module 128 Cray T3E/1200, High-end and Commoditybased Systems ! ! 112 96 Mn(CO)5H - MP2 geometry optimisation BASIS: TZVP + f (217 GTOs 80 48 6000 32 CS6 PIII/800 + FE CS7 AMD K7/1000 MP + SCI CS2 QSNet Alpha Cluster/667 CS9 P4 Xeon/2000 + Myrinet Cray T3E/1200E IBM SP/WH2-375 Compaq AlphaServer SC/667 SGI Origin3800/R14k-500 AlphaServer SC ES45/1000 6713 5233 4847 3790 3456 Speed-up 64 Elapsed Time (seconds) 7487 Linear Cray T3E/1200E AlphaServer SC ES45/1000 SGI Origin3800/R14k-500 16 16 6713 6000 59%,78% 3530 48 64 80 96 112 Cray T3E/1200E SGI Origin3800/R14k-500 AlphaServer SC ES45/1000 4500 3530 2722 3000 32 2496 2129 3000 1550 2521 1539 2496 1923 1346 1012 1725 1539 1346 1500 1012 1158 836 634 529 0 16 Number of CPUs 32 ScicomP 5, Daresbury Laboratory 0 16 32 64 128 7 May 2002 128 Computational Science and Engineering Department Daresbury Laboratory The QM/MM Modelling Approach ! ! Couple quantum mechanics and molecular mechanics approaches QM treatment of the active site " reacting centre " problem structures (e.g. complex transition metal centre) " excited state processes (e.g. spectroscopy) ! Classical MM treatment of environment " " " enzyme structure zeolite framework explicit and/or dielectric solvent models ScicomP 5, Daresbury Laboratory 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory Quantum Simulation in Industry (QUASI) ! Software Development " Address barriers to uptake of existing QM/MM methodology ! ! ! explore range of QM/MM coupling schemes enhance performance of geometry optimisation for large systems maintain flexible approach, address enzymes, zeolites and metal oxide surfaces ! adopt modular scheme with interfaces to industry standard codes ! High Performance Computing " Scalable MPP implementation " QM/MM MD simulation based on semiempirical ab-initio and DFT methods ! Demonstration Applications " Value of modelling technology and HPC to industrial problems " Beowulf EV6-based solution ! Exploitation QUASI Partners CLRC Daresbury Laboratory P. Sherwood, M.F. Guest, A.H. de Vries Royal Institition of Great Britain C.R.A Catlow, A. Sokol University of Zurich / MPI Mulheim W. Thiel, S. Billeter, F. Terstegen. ICI Wilton (UK) J. Kendrick (CAPS), J. Casci (Catalco) Norsk Hydro (Porsgrunn, Norway) K. Schoeffel, O. Swang (SINTEF) BASF (Ludwigshafen, Germany) A. Schaefer " Disseminate results through workshop, newsletters etc. ScicomP 5, Daresbury Laboratory 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory QM/MM Applications Triosephosphate Triosephosphate isomerase isomerase (TIM) (TIM) ••Central Centralreaction reactionin in glycolysis, glycolysis,catalytic catalytic interconversion interconversionof of DHAP DHAPto toGAP GAP T 128 (O3800/R14k-500) = 181 secs ••Demonstration Demonstrationcase case Measured Time (seconds) 1600 within withinQUASI QUASI (Partners (PartnersUZH, UZH,and and BASF) BASF) CS7 AMD K7/1000 + SCI 1467 1487 CS9 P4/2000 + Myrinet 2k SGI Origin3800/R14k-500 1200 AlphaServer SC ES45/1000 802 800 89%,139% 778 • QM region 35 atoms (DFT BLYP) 714 1030 428 419 400 – include residues with possible proton donor/acceptor roles – GAMESS-UK, MNDO, TURBOMOLE 508 540 274 257 213 308 196 0 8 16 ScicomP 5, Daresbury Laboratory 32 • MM region (4,180 atoms + 2 link) – CHARMM force-field, implemented in CHARMM, DL_POLY 64 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory Vampir 2.5 Visualization and Analysis of MPI Programs Performance Analysis of GAbased Applications using Vampir GAMESS-UK on High-end and Commodity class machines • extensions to handle GA applications ScicomP 5, Daresbury Laboratory 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory GAMESS-UK / Si8O25H18 : 8 CPUs: One DFT Cycle ScicomP 5, Daresbury Laboratory 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory GAMESS-UK / Si8O25H18 : 8 CPUs Q†HQ (GAMULT2) and PEIGS ScicomP 5, Daresbury Laboratory 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory Materials Simulation Codes Plane Wave DFT Codes: • CASTEP • VASP • CPMD Local Gaussian Basis Set Codes: • CRYSTAL These codes have similar functionality, power and problems. CASTEP is the flagship code of UKCP and hence subsequent discussions will focus on this. This code presents a different set of problems when considering performance on HPC(x). SIESTA and CONQUEST: • • O(n) scaling codes which will be extremely attractive to users. Both are currently development rather than production codes. ScicomP 5, Daresbury Laboratory 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory CRYSTAL - 2000 ! ! Distributed Data implementation Benchmark: " " " " An Acid Centre in Zeolite-Y Zeolite-Y (Faujasite (Faujasite)) Single point energy 145 atoms / cell, No symmetry / 8k-points 2208 basis functions, (6-21G* ) Elapsed Time (seconds) 21000 18000 CS1 PIII/450 + FE/LAM CS2 Alpha Linux Cluster/667 Cray T3E/1200E SGI Origin 3800/R12k-400 IBM SP/WH2-375 14000 Speed-up Linear Cray T3E/1200E CS1 PIII/450 + FE SGI Origin 3800/R12k-400 256 192 9962 T256 SGI Origin 3800 = 945 secs 144 128 7609 5633 7000 4028 2206 3241 1483 0 16 32 64 Number of CPUs ScicomP 5, Daresbury Laboratory 128 128.8 64 0 0 64 128 192 7 May 2002 256 Computational Science and Engineering Department Daresbury Laboratory Plane Wave Methods: CASTEP r k r ψ j (r ) = r r 2 ( k + G ) < E cut ∑ r G C r r r r k − i ( k + G ). r r j ,G e • • Direct minimisation of the total energy (avoiding diagonalisation) • Large number of basis functions N~106 (especially for heavy atoms). Pseudopotentials must be used to keep the number of plane waves manageable The plane wave expansion means that the bulk of the computation comprises large 3D Fast Fourier Transforms (FFTs) between real and momentum space. • • These are distributed across the processors in various ways. The actual FFT routines are optimized for the cache size of the processor. ScicomP 5, Daresbury Laboratory 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory Parallelization of CASTEP ! A Number of parallelization methods are implemented: " k-point: a processor holds all the wavefunction for a k-point (MPI_ALLTOALLV is NOT required) BUT for large unit cells Nk ⇒1 i.e small CPU count. " G-vector: a processor holds part of the wavefunction for all k-points (MPI_ALLTOALLV is over ALL CPUs) i.e. biggest systems with 1 K point " mixed kG: k-points are allocated amongst processors, the wavefunctions sub-allocated amongst processors associated with their particular kpoints i.e. MPI_ALLTOALLV is over NCPUs / Nk - intermediate cases. ! ! On HPC hardware the desired method is either k or kG as this minimizes inter-processor communication, specifically MPI_ALLTOALLV. However, on large numbers of processors such distributions will still be problematic. New algorithms will therefore need to be developed to overcome latency problems. ScicomP 5, Daresbury Laboratory 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory CASTEP 4.2 - Parallel Benchmark Chabazite Measured Time (seconds) 1250 750 CS6 PIII/800 FE/MPICH CS7 AMD K7/1000 MP + SCI CS2 QSNet Alpha Cluster/667 CS9 P4/2000 Xeon + Myrinet CS8 Itanium/800 + Myrinet Cray T3E/1200E IBM SP/WH2-375 SGI Origin 3800/R14k-500 AlphaServer SC ES45/1000 IBM SP/Regatta-H 1228 G 1000 ! 693 637 774 500 67%,75% ! ! ! ! Acid sites in a zeolite. (Si11 O24 Al H) Vanderbilt ulatrasoft pseudopotential Pulay density mixing minimiser scheme single k point total energy, 96 bands 15045 plane waves on 3D FFT grid size = 54x54x54; convergence in 17 SCF cycles 422 367 368 330 324 271 235 228 250 147 272 205 195 204 0 16 ScicomP 5, Daresbury Laboratory CPUs 32 153 137 Time (comms) 89 IBM SP/WH2-375 Cray T3E/1200E CS6 PIII/800+FE CS7 AMD K7/1000 + SCI CS9 P4/2000 +Myrinet CS2 QSNet Alpha SGI Origin 3800/R14k 157 90 600 242 115 111 71 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory CASTEP 4.2 - Parallel Benchmark II. TiN: A TiN 32 atom slab, 8 k-points, single point energy calculation with Mulliken analysis, Measured Time (seconds) 6488 6000 4000 CS9 P4/2000 + Myrinet 2k CS8 Itanium/800 + Myrinet 2k Cray T3E/1200E SGI Origin 3800/R14k-500 AlphaServer SC ES45/1000 47%,81% kG 3109 2815 2287 1781 2000 1740 1312 1096 800 536 560 0 32 ScicomP 5, Daresbury Laboratory 64 128 CPUs 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory CPMD - Car-Parrinello Molecular Dynamics CPMD Elapsed Time (secs.) 400 ! CS7 dual-Athlon K7/1000 CS9 P4/2000 Xeon + Myrinet (2CPU) CS9 P4/2000 Xeon + Myrinet (1CPU) CS2 Alpha Linux Cluster IBM SP/WH2-375 SGI Origin 3800 / R14k-500 AlphaServer SC ES40 / 667 AlphaServer SC ES45 / 1000 390 259 253 221 204 Benchmark Example: Liquid Water ! 208 200 ! 30%,53% 145 124 99 ! 111 95 63 ! 0 16 32 Version 3.5.1: Hutter, Alavi, Deutsh, Bernasconi, St. Goedecker, Marx, Tuckerman and Parrinello (1995-2001) DFT, plane-waves, pseudo-potentials and FFT's Physical Specifications: 32 molecules, Simple cubic periodic box of length 9.86 A, Temperature 300K Electronic Structure; Structure; BLYP functional, Trouillier Martins pseudopotential, pseudopotential, Recriprocal space cutoff 70 Ry = 952 eV CPMD is the base code for the new CCP1 flagship project CPUs Sprik and Vuilleumier (Cambridge) ScicomP 5, Daresbury Laboratory 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory CPMD on High-end Computers ! ! ! ! ! Performances on NH2 nodes Elapsed Time (seconds) " 256 CPUs and 8 MPI tasks/node including IO (RD30WFN / 3000 WR30WFN) Parallel efficency on 256 " 390 Mflops per CPU, 99.8 Gflops NH2 CPUs: 75% " 252*252*252 - 8 MPI tpn 2000 " Use of ESSL routines instead of Lapack; Some routines were "OpenMPed” 1000 Mixed mode MPI and OMP (IBM) CPMD is dominated by ESSL SMP routines (ROTATE and OVLAP). 0 16 32 64 128 256 4 MPI tpn and 2 SMP tpn is 1.36 faster than flat MPI on a given MESH Number of CPUs Power4 estimates: The key is FFTCOM (MPall2all). Average speedup of 2.0 on a R-H LPARd system ScicomP 5, Daresbury Laboratory 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory ANGUS: Combustion modelling (regular grid) Cray T3E/1200, High-end and Commodity-based Systems Conjugate Gradient + ILU Grid Size - 1443 Measured Time (seconds) 100 Iterations 3000 CS6 PIII/800 + FE/LAM CS7 AMD K7/1000 MP + SCALI CS8 Itanium/800 + Myrinet CS9 P4/2000 Xeon+ Myrinet CS2 QSNet Alpha Cluster/667 Cray T3E/1200E IBM SP/WH2-375 SGI Origin 3800/R14k-500 AlphaServer SC ES45/1000 IBM SP/Regatta-H 2870 2380 2000 1886 1864 1600 879 800 1218 531 776 512 330 364 228 647 0 16 CPUs ScicomP 5, Daresbury Laboratory Discretisation of equations is performed using standard 2nd order central differences on a 3D-grid. 35%,51% 1090 1000 Direct numerical simulations (DNS) of turbulent pre-mixed combustion solving the augmented Navier-Stokes equations for fluid flow. Pressure solver utilises either a conjugate gradient method with modified incomplete LU preconditioner or a multi-grid solver (both make extensive use of Level 1 BLAS) or fast Fourier transform. 32 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory ANGUS: Combustion modelling (regular grid) Memory Bandwidth Effects: The IBM SP and Alpha Linux System Conjugate Gradient + ILU IBM SP/WH2-375 Grid Size - 1443 Number of CPUs Number of CPUs 776 8 Nodes 4 Nodes 2 Nodes 1 Node 32 16 751 16 8 4 4 4000 8000 12000 16 Nodes 8 Nodes 4 Nodes 2 Nodes 32 8 0 CS2 QSNet Alpha Cluster 0 2000 4000 6000 Measured Time (seconds) ScicomP 5, Daresbury Laboratory 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory ANGUS: Combustion modelling (regular grid) High-end and Commodity-based Systems Measured Time (seconds) Conjugate Gradient + ILU Grid Size - 2883 10 Iterations 4500 CS7 AMD dual-K7/1000 MP + SCALI CS2 Alpha Linux Cluster / EV67 CS8 dual-Itanium/800 + Myrinet CS9 P4/2000 Xeon + Myrinet IBM SP/WH2-375 SGI Origin 3800/R14k-500 AlphaServer SC ES45/1000 Direct numerical simulations IBM SP/Regatta-H 1.3 GHz 4098 3000 2602 2348 (DNS) of turbulent pre-mixed combustion solving the augmented Navier-Stokes equations for fluid flow. 1961 69%,147% 1500 1169 1161 1595 1206 884 563 661 548 819 399 204 360 0 16 ScicomP 5, Daresbury Laboratory 32 64 CPUs 7 May 2002 Computational Science and Engineering Department Daresbury Laboratory Commodity Comparisons with High-end Systems: Compaq AlphaServer SC ES45/1000 and the SGI O3800/R14k-500 CS9 - Pentium4/2000 Xeon + Myrinet 2k % of 32-CPU AlphaServer SC Origin 3800 GAMESS-UK SCF DFT DFT (Jfit (Jfit)) DFT Gradient MP2 Gradient SCF Forces 95% 70-73% 59-70% 69% 59% 92% 135% 117-161% 92-111% 114% 78% 148% NWChem (DFT Jfit) Jfit) 65-88% 94-227% GAMESS / CHARMM 89% 139% Ewald-based Ewald-based bond constraints 44-53% 41% 71-84% 64% CHARMM CASTEP CPMD 95% 103% 47-67% 30% 75-81% 53% ANGUS 35-69% 51-147% DL_POLY ScicomP 5, Daresbury Laboratory 7 May 2002 Computational Science and Engineering Department Summary ! ! Computational Chemistry - Background Commodity-based and High-end Systems " Prototype Commodity Systems; CS1 - CS9 " High-end systems from Cray, SGI, IBM and Compaq - Initial results from Regatta-H " Performance Metrics ! Application performance " Molecular Simulation (distributed data) ! DL_POLY and CHARMM " Electronic Structure ! Distributed data: GAs and PeIGS ! NWChem and GAMESS-UK ! VAMPIR and instrumenting the GA Tools ! Daresbury Laboratory CS2 - QSNet Alpha Linux Cluster % of 32 CPU O3800/R14k-500 GAMESS-UK SCF DFT DFT (Jfit (Jfit)) DFT Gradient MP2 Gradient SCF Forces NWChem 99% 99% 89-122% 89% 87% 86% 112-234% DL_POLY Ewald-based Ewald-based bond constraints 95% 82% CHARMM 78% Cost-effectiveness of P4/2000 cluster with Myrinet-2k: " delivers on average 66%of AlphaServer ES45/1000, 109% of SGI Origin 3800/R14k-500 ScicomP 5, Daresbury Laboratory 7 May 2002
Similar documents
INDCs - Latin America
Adaptation in Latin America The Paris Agreement Daniel Rincón
More information