Number of CPUs - Gamess-UK

Transcription

Computational Science and Engineering Department
Daresbury Laboratory
Computational Chemistry Applications:
Performance on High-End and
Commodity-class Computers
Martyn F. Guest
CCLRC Daresbury Laboratory
[email protected]
Acknowledgements:
P. Sherwood, H.J. van Dam, W. Smith, I. Bush,
A. Sunderland and M. Plummer, DL
J. Nieplocha, E. Apra, PNNL
ScicomP 5, Daresbury Laboratory
http://www.dl.ac.uk/CFS/parallel/chemistry
7 May 2002
Outline
!
!
Computational Chemistry - Background
Commodity-based and High-end Systems
"
"
"
Single-node performance & the Interconnect bottleneck
Prototype Commodity Systems; CS1 - CS9
High-end Systems: IBM SP/WH2-375 and Regatta-H, SGI Origin
3800, Compaq Alpha Server SC and Cray SuperCluster
" Performance Metrics
!
Application performance
" Molecular Simulation (DLPOLY and CHARMM)
" Electronic Structure (Global Arrays (GAs) ; Linear Algebra (PeIGS)
! NWChem and GAMESS-UK
" Materials Simulation (CRYSTAL, CPMD & CASTEP) and
Engineering (ANGUS)
!
Application performance analysis
" VAMPIR and instrumenting the GA Tools
!
Summary
7 May 2002
Scaling of Molecular Computations
1 X 107
Relative Computi ng Power
1 X 106
Calculation of Reaction
Energetics; CI/CCSD
Techniques
Determination of
Molecular Structures:
Hartree-Fock
Techniques
1 X 105
TeraFLOPS MPP
1 X 104
Determination of
Molecular Structures:
Density Functional
Techniques
1 X 103
1 X 102
CRAY C90 (16)
Simulation of Biomolecules:
Molecular Dynamics
Techniques
1 X 101
1 X 100
CRAY 1S
Number of Atoms
1 X 100
1 X 101
1 X 102
1 X 103
1 X 104
1 X 105
7 May 2002
Capability Computing - Cost
Effectiveness & Dependencies
S pe c ific atio n
Us ag e
MPP/AS CI
HP C
community
1280 CPUs , IBM S P
(1 TB)
Co s t
Units
CPU
Me mo ry
10,000
2,000
2,000
Capability
I/O
200-300
Commodity Systems (1.5 x N ??)
S MP
16-pro c e s s o r S GI
Orig in 3800, R14k
(8GB RAM )
De pa rtme nt
100
Capacity
15
20
20-30
Commodity Systems (1.5 x N)
PC
Pe ntium-4 / 2GHz
(512 MByte , 30 GB)
De s ktop
1
1
1
1
3 year continual access to 32 CPUs:
High-end (£0.5 / CPU hour) : £419,358
in-house Beowulf : £50,000
7 May 2002
High-end Commodity-based Systems
“Need for real workloads on large clusters. To date mainly experimental
systems, so failures are viewed as inevitable”, but ...
Cluster
Site
Nodes/
Peak CPU
Processors (Gflop)
1. Locus
Locus, USA
960/1920
1920
PIII/1000
2. HELICS
Heidelberg
256/512
1434
AMD MP/1400 Myrinet
3. EMPIRE
Mississippi
519/1038
1157
PIII/1000
Myrinet
4. RHIC
Brookhaven 706/1412
1083
PIII/PII.
FastEth.
5. Biopentium Inpharmatica 800/1220
1061
PIII/700+
FastEth.
6. Genesis
Shell, NL
1030/1038
1037
PIII/1000
GigabitE
7. Platinum
NCSA
516/1032
1032
PIII/1000
Myrinet
520/1040
967
PIII/933
Myrinet
9. Score III
RWCP, Japan 512/1024
955
PIII/933
Myrinet
10. ICE Box
Utah, USA
915
PIII + AMD
FastEth.
8. CBRC Magi AIST,Japan
303/406
Interconnect
FastEth.
7 May 2002
High-End Systems Evaluated
!
Cray T3E/1200E
" 816 processor system at Manchester (CSAR service)
" 600 Mz EV56 Alpha processor with 256 MB memory
!
IBM SP/WH2-375 and Regatta-H node
" 32 CPU system at DL, 4-way Winterhawk2 SMP “thin nodes” with 2 GB
memory, 375 MHz Power3-II processors with 8 MB L2 cache
" IBM SP Regatta-H (32-way node, 1.3 GHZ power4 CPUs) at Montepelier
!
Compaq AlphaServer SC
" 4-way ES40/667 A21264A (APAC) and 833 MHz SMP nodes (2 GB RAM,
8MB L2 cache);
" TCS1 system at PSC (comprising 750 4-way ES45 nodes - 3,000 EV68
CPUs - with 4 GB memory per node, 8MB L2 cache
" Quadrics “fat tree” interconnect (5 usec latency, 250 MB/sec B/W)
!
SGI Origin 3800
" SARA (1000 CPUs) - Numalink with R14k/500 & R12k/400 CPUs
!
Cray Supercluster at Eagen
" Linux Alpha Cluster (96 X API CS20s - dual 833 MHz EV67 CPUs)
" Myrinet interconnect, Red Hat 6.2
7 May 2002
Technical Progress in 2001/2002
Hardware and Software Evaluation:
" CPU
! PC systems - Intel 1 GHz Pentium III,
Pentium 4 (2.0 GHz), AMD K7 1.533 GHz
! Itanium (SGI Troon, IBM, HP RX4610)
! Alpha systems - ES40/833, ES45/1000
CS20/833, UP2000 (667 & 833 MHz)
" Networks
www.cse.clrc.ac.uk/Activity/DisCo
! Fast Ethernet options, cards,
switches, channel-bonding, ….
! 100Mbit switch,
! SCI, QSNet and Myrinet (P4/2000 Cluster, Cray Supercluster, CS20)
" System Software
! message passing S/W (LAM MPI, LAM MPI-VIA (100 us to 60 us),
MPICH, VMI, SCAMPI),libraries (ATLAS, NASA, MKL), compilers
(Absoft, PGI, Intel’s ifc and efc, GNU/g77), GA tools (PNNL)
! resource management software (LobosQ, PBS, Beowulf, LSF etc.)
7 May 2002
Commodity Systems (CSx)
Prototype / Evaluation Hardware
Systems
Location
CPUs
CS1
CS2
Daresbury
Daresbury
32
64
CS3
CS4
CS6
RAL
Sara
CLiC
16
32
528
CS7
CS8
CS9
Daresbury
NCSA
Bristol
64
320
96
Configuration
PentiumIII / 450 MHz + FE (EPSRC)
24 X dual UP2000/EV67-667,
QSNet Alpha/LINUX cluster,
8 X dual CS20/EV67-833 (“loki”)
Athlon K7 850MHz + myrinet
Athlon K7 1.2 GHz + FE
PentiumIII / 800 MHz; fast
ethernet (Chemnitzer Cluster)
AMD K7/1000 MP + SCALI/SCI (“ukcp”)
160 dual IBM Itanium/800 + Myrinet 2k (“titan”)
Pentium4 Xeon/2000 + Myrinet 2k (“dirac”)
Protoype Systems
CS0
CS5
Daresbury
Daresbury
10
16
10 CPUS, Pentium II/266
8 X dual Pentium III/933, SCALI
7 May 2002
Communications Benchmark
PMB: Pallas MPI Benchmark Suite (V2.2)
Systems Investigated
Point-to-Point (Mbytes/sec)
PingPong; PingPing; Sendrecv;
Exchange
Collective Operations (Time usec) - as function of no.of CPUs
Allreduce; Reduce; Reduce_scatter;
Allgather; Allgatherv; Alltoall; Bcast,
Barrier
Message Lengths:
0 to 4194304 Bytes
! Cray T3E/1200E
! SGI Origin3800/R14k-500 (Teras)
! IBM SP/WH2-375 (“trailblazer” switch and
SP/NH2-375 (single-plane colony)
! Cray SuperCluster (833MHz EV68
dual CS20 nodes with myrinet)
! Compaq AlphaServer SC ES45/1GHz -TCS1
! IBM Regatta H (intra node)
! CS1 PIII/450 + Fast ethernet, MPICH
!
!
!
!
!
!
CS2 Alpha Linux Cluster dual UP2000/667
CS3 AMD Athlon/850 + Myrinet, MPICH
CS6 PIII/800 + Fast ethernet, MPICH
CS7 AMD K7/1000 MP + SCALI
CS8 Itanium/800 + Myrinet 2k (NCSA)
CS9 dual P4/2000 Xeon + Myrinet 2k
7 May 2002
PingPong Performance
PMB Benchmark (Pallas)
10000
1000
100
10
Bandwidth (MByte/sec)
Bandwidth (Mbytes/s)
1
0.1
1.E+00
Message Length (Bytes)
1.E+01
1.E+02
1.E+03
CS2 QSNet Alpha Cluster/667
Compaq AlphaServer SC ES45/1000
IBM Regatta-H
Cray T3E/1200E
Cray SuperCluster / Myrinet
IBM SP/WH2 375-quad
SGI Origin3800/R14k-500
IBM SP/NH2 375 16-way
CS8 Titan Itanium/800 + Myrinet 2k
CS9 Pentium 4/2000 + Myrinet 2k
1.E+04
1.E+05
1.E+06
1.E+07
7 May 2002
16 CPUs
MPI_allreduce Performance
PMB Benchmark (Pallas)
Measured Time (usec)
1.E+05
1.E+04
1.E+03
Measured Time (us)
1.E+06
Compaq AlphaServer SC ES45/1000
IBM Regatta-H
Cray T3E/1200E
Cray SuperCluster / Myrinet
IBM SP/WH2 375-quad
IBM SP/NH2 375 16-way
CS8 Titan Itanium/800 + Myrinet 2k
CS9 Pentium 4/2000 + Myrinet 2k
1.E+02
1.E+01
1.E+00
1.E+01
1.E+02
1.E+03
1.E+04
1.E+05
1.E+06
1.E+07
7 May 2002
Interconnect Benchmark - EFF_BW
530
SGI Origin 3800/R14k-500
964
AlphaServer SC ES45/1000 (1CPU)
507
AlphaServer SC ES45/1000 (4CPU)
QSNet
264
IBM SP/WH2-375
IBM SP/NH2-375 (16 - 8+8)
362
IBM SP/Regatta-H
1217
Cray T3E/1200E
2151
698
CS2 QSNet AlphaEV67 (1 CPU)
QSNet
456
CS2 QSNet AlphaEV67 (2 CPU)
635
CS9 P4/2000 Xeon - Myrinet (1CPU)
Myrinet 2k
476
CS9 P4/2000 Xeon - Myrinet (2CPU)
16 CPUs
667
CS8 Itanium/800 - Myrinet (1CPU)
516
CS8 Itanium/800 - Myrinet (2CPU)
481
AMD K7/1000 MP - SCALI (1 CPU)
367
AMD K7/1000 MP - SCALI (2 CPU)
255
CS3 AMD K7/850 - MPICH
84.2
CS4 AMD K7/1200 - LAM
44.2
CS6 PIII/800 - MPICH
77.9
CS6 PIII/800 - LAM
0
SCALI
Myrinet
Fast Ethernet
500
1000
MBytes/sec
1500
2000
7 May 2002
Application Codes
Application-driven performance comparisons between Commodity-based
systems and both current MPP and ASCI-style SMP-node platforms :
!
Computational Chemistry
"
Molecular Simulation
! DL_POLY - parallel MD code with many applications
! CHARMM - macromolecular MD and energy minimisation
"
Molecular Electronic Structure
! GAMESS-UK and NWChem - Ab initio Electronic structure codes
!
Materials
! CRYSTAL (periodic HF/DFT)
! CPMD, CASTEP - Car-Parrinello plane wave codes
!
Engineering
" ANGUS - regular-grid domain decomposition engineering code
Performance Metric (% 32-node high-end system)
7 May 2002
Performance Metrics: 1999-2001
Attempted to quantify delivered performance from the Commoditybased systems against current MPP (CSAR Cray T3E/1200E)and
ASCI-style SMP-node platforms (e.g. SGI Origin 3800) i.e.
Performance Metric (% 32-node Cray T3E)
1. T (32-nodes Cray T3E/1200E) / T (32 CPUs ) CSx
[ T 32-node T3E / T 32-node CS1 Pentium III/450 + FE ]
T 32-node T3E / T 32-node CS6 Pentium III/800 + FE
T 32-node T3E / T 32-CPU CS2 Alpha Linux Cluster + Quadrix
2. T (32-CPUs SGI Origin 3800) / T (32 CPUs) CSx
T 32-CPU SGI O3800 / T 32-CPU CS2 Alpha Linux Cluster + Quadrix
7 May 2002
Beowulf Comparisons with the T3E & O3800/R14k-500
CSx - Pentium III + FE
CS2 - QSNet Alpha Linux Cluster
% of 32-node Cray T3E/1200E
GAMESS-UK CS1
CS6
% of 32-node Cray T3E and O3800/R14k-500
GAMESS-UK
SCF
DFT
DFT (Jfit
(Jfit))
DFT Gradient
MP2 Gradient
SCF Forces
53-69%
65-85%
44-77%
90%
44%
80%
NWChem (DFT Jfit)
Jfit)
REALC
CRYSTAL
96%
130-178%
65-131%
130%
73%
127%
50-60%
SCF
DFT †
DFT (Jfit
(Jfit))
DFT Gradient †
MP2 Gradient
SCF Forces
256%
301-361%
219-379%
289%
228%
154%
99%
99%
89-100%
89%
87%
86%
NWChem (DFT Jfit)
Jfit) †
150-288%
74-135%
74-135%
CRYSTAL †
349%
67%
79-145%
DL_POLY
DL_POLY
Ewald-based
Ewald-based
bond constraints
95-107% 151-184%
34-56% 69%
Ewald-based
Ewald-based †
bond constraints
363-470%
143-260%
95%
82%
CHARMM
CASTEP
CPMD
96%
172%
CHARMM †
404%
78%
33%
62%
42%
CASTEP
166%
78%
ANGUS
FLITE3D
60%
104%
68%
ANGUS
FLITE3D †
145%
480%
7 May 2002
Performance Metrics: 2002
Attempt to quantify delivered performance from current Commoditybased systems against high-end SMP-node platforms: Compaq
AlphaServer SC ES45/1000 (PSC) and SGI Origin 3800 (Sara) i.e.
Performance Metric (% 32-node AlphaServer SC)
1. T (32-CPUs AlphaServer SC ES45/1000) / T (32 CPUs ) CSx
†
T 32-CPU AlphaServer ES45 / T 32-CPU CS9 Pentium 4 Xeon / 2000 + Myrinet 2k
T 32-CPU AlphaServer / T 32-CPU CS2 Alpha Linux Cluster + Quadrix
2. T (32-CPUs SGI Origin 3800/R14k-500) / T (32 CPUs) CSx
†
T 32-CPU SGI Origin 3800 / T 32-CPU CS9 Pentium 4 Xeon / 2000 + Myrinet 2k
7 May 2002
Molecular Simulation
Molecular Dynamics Codes:
DL_POLY and CHARMM
7 May 2002
DL_POLY Parallel Benchmarks (Cray T3E/1200)
V2: Replicated Data
4. NaCl; Ewald, 27,000 ions
5. NaK-disilicate glass; 8,640 atoms,MTS+ Ewald
8. MgO microcrystal: 5,416 atoms
256
9. Model membrane/Valinomycin (MTS, 18,886)
7. Gramicidin in water (SHAKE, 12,390)
6. K/valinomycin in water (SHAKE, AMBER, 3,838)
1. Metallic Al (19,652 atoms, Sutton Chen)
3. Transferrin in Water (neutral groups + SHAKE,
27,593)
2. Peptide in water (neutral groups + SHAKE, 3993).
64
Speed-up
Linear
Speed-up
Linear
192
48
128
32
64
16
0
0
0
64
128
Number of Nodes
192
256
0
16
32
48
64
Number of Nodes
7 May 2002
DL_POLY: Cray/T3E, High-end and
Commodity-based Systems I.
Measured Time (seconds)
817
750
500
CS6 PIII/800 + FE/LAM
CS7 AMD K7/1000 MP + SCI
CS4 AMD K7/1200 +FE
CS9 P4/2000 + Myrinet
CS8 Itanium/800 + Myrinet
Cray T3E/1200E
IBM SP/WH2-375
CraySuperCluster/833
IBM SP/Regatta-H
AlphaServer SC ES45/1000
Bench 4.
NaCl; 27,000 ions,
Ewald , 75 time
steps, Cutoff=24Å
385
376
44%,71%
250
T3E128 =94
204
191
114
90 79
98
65
135
48 43
55
0
16
32
Number of CPUs
7 May 2002
Commodity-based Systems
Bench 5: NaK-disilicate glass; 8,640
atoms, MTS + Ewald: 270 time steps
Speed-up
T3E128 = 75
Cray SuperCluster/833
250
228
200
167
149
96
98
153
150
100
50.1
89
88
62
121
64
53%,84%
109
50
Linear
Cray T3E/1200E
Cray SuperCluster / 833
CS1 Pentium III/800 + FE
CS7 AMD K7/1000 MP +SCI
Compaq AlphaServer ES45/1000
128
60
53
52
32
40 36
36
68
57
23
44
0
0
16
32
Number of CPUs
64
0
32
64
96
Number of CPUs
7 May 2002
128
Commodity-based Systems III.
800
733
688
600
552
CS4 AMD K7/1200 + FE
Cray T3E/1200E
IBM SP/WH2-375
IBM SP/Regatta-H
400
rigid bonds and
SHAKE,
12,390 atoms,
500 time steps
382
306
200
Bench 7.
Gramicidin in
water;
41%,64%
T3E128 =166
190
182
154
216
121
116
191
93
147
78
125
0
16
Number of CPUs
32
7 May 2002
CHARMM
!
!
!
!
!
CHARMM (Chemistry at HARvard Macromolecular Mechanics) is a
general purpose molecular mechanics, molecular dynamics and
vibrational analysis package for modelling and simulation of the
structure and behaviour of macromolecular systems (proteins,
nucleic acids, lipids etc.)
Supports energy minimisation and MD approaches using a
classical parameterised force field.
J. Comp. Chem. 4 (1983) 187-217
Parallel Benchmark - MD Calculation of Carboxy Myoglobin
(MbCO) with 3830 Water Molecules.
QM/MM model for study of reacting species
" incorporate the QM energy as part of the system into the force field
" coupling between GAMESS-UK (QM) and CHARMM.
7 May 2002
Parallel CHARMM Benchmark
Benchmark MD Calculation of Carboxy Myoglobin
(MbCO) with 3830 Water Molecules: 14026 atoms,
1000 steps (1 ps), 12-14 A shift.
CS9 P4/2000 + Myrinet 2k
300
231
200
199
Linear
Cray T3E/1200E
52
T3E16 = 665, T3E128 = 106
36
114
95%,103%
89
100
66
61
104
64
64
62
51
73
20
4
4
0
16
32
64
20
36
52
Number of CPUs
7 May 2002
Migration from Replicated to Distributed data
DL_POLY-3 : Domain Decomposition
!
!
!
Distribute atoms, forces across
the nodes
" More memory efficient, can
address much larger cases
(10 5-10 7)
Shake and short-ranges forces
require only neighbour
communication
" communications scale linearly
with number of nodes
A
B
C
D
Coulombic energy remains global
" strategy depends on problem
and machine characteristics
7 May 2002
DL_POLY-3: Halo Data Exchange
!
!
!
Replaces global summation, move co-ordinates and constraint
information between processors.
Measurements on
Origin-3000 (green)
64 processors
Texch = 51 ms
!
128 processors
Texch = 30 ms
shorter messages
increasing sparsity
7 May 2002
DL_POLY-3: Coulomb Energy Evaluation
!
Current implementation is based on Particle Mesh Ewald scheme
" Optimal for small processor counts, includes Fourier transform smoothed
charge density (reciprocal space grid typically 64x64x64 - 128x128x128)
" Initial FFT implementation serial on small numbers of nodes
!
MPP Strategies
" Global plane- or column-based FFT (requires scalable Alltoallv function)
and data redistribution to blocks.
" Local FFT (multithread on single Regatta node)
" Block-decomposed FFT (Ian Bush, CLRC)
" New Algorithms (e.g. multigrid methods from Tom Darden).
7 May 2002
Plane
!
Conventional routines (e.g. fftw) assume plane
or column distributions
!
A global transpose of the data is required to
complete the 3D FFT and additional costs are
incurred re-organising the data from the natural
block domain decomposition.
!
An alternative FFT algorithm has been designed
to reduce communication costs.
Block
" the 3D FFT are performed as a series of 1D
FFTs, each involving communications only
between blocks in a given column
" More data is transferred, but in far fewer
messages
" Rather than all-to-all, the communications are
column-wise only
7 May 2002
Speed-up
256
Linear
DL_POLY-2: 27,000 ions - AlphaServer
27,000 ions - AlphaServer
27,000 ions - CS9 P4/2000
216,000 ions - AlphaServer
216,000 ions - CS9 P4/2000
192
NaCl Simulation;
DLPOLY_2.11
171
1. 27,000 ions, Ewald, 500
time steps, Cutoff=24Å
DLPOLY_3
128
122
1. 27,000 ions, PMES, 500
2. 216,000 ions,PMES, 200
64
0
0
64
128
192
256
Number of CPUs
7 May 2002
DL_POLY-3: Macromolecular Simulations
Speed-up
Linear
DL_POLY-2: 12,390 atoms - AlphaServer
99,120 atoms - AlphaServer
99,120 atoms - CS9 P4/2000
972,960 atoms - AlphaServer
972,960 atoms - CS9 P4/2000
128
96
S 256 = 175
109
Gramicidin in water;
DLPOLY_2.11
1. rigid bonds and SHAKE,
12,390 atoms,500 time steps
64
DLPOLY_3
1. 99,120 ions, rigid bonds
and SHAKE, 100 time steps
32
33
2. 792,960 ions, rigid bonds
and SHAKE, 50 time steps
0
0
32
64
96
128
Number of CPUs
7 May 2002
Molecular Electronic Structure
Ab initio Electronic structure Codes:
NWChem and GAMESS-UK
7 May 2002
Distributed Data SCF
MOAO
Pictorial representation of the iterative
SCF process in (i) a sequential process,
and (ii) a distributed data parallel
process: MOAO represents the
molecular orbitals, P the density matrix
and F the Fock or Hamiltonian matrix
Pµ ν
guess
orbitals
dgemm
Sequential
Eigensol ver
Fρσ
Integrals
V
XC
If Converged
V
Coul
V
1e
MOAO
Pµν
guess
orbitals
Sequential
ga_dgemm
PeIGS
If Converged
Distributed Data
Fρσ
Integrals
V XC
V Coul
V 1e
7 May 2002
High-End Computational Chemistry
The NWChem Software
!
Capabilities (Direct, Semi-direct and conventional):
" RHF, UHF, ROHF using up to 10,000 basis functions; analytic 1st
and 2nd derivatives.
" DFT with a wide variety of local and non-local XC potentials, using
up to 10,000 basis functions; analytic 1st and 2nd derivatives.
" CASSCF; analytic 1st and numerical 2nd derivatives.
" Semi-direct and RI-based MP2 calculations for RHF and UHF wave
functions using up to 3,000 basis functions; analytic 1st derivatives
and numerical 2nd derivatives.
" Coupled cluster, CCSD and CCSD(T) using up to 3,000 basis
functions; numerical 1st and 2nd derivatives of the CC energy.
" Classical molecular dynamics and free energy simulations with
the forces obtainable from a variety of sources
7 May 2002
Global Arrays
Physically distributed data
Single, shared data structure
• Shared-memory-like model
– Fast local access
– NUMA aware and easy to use
– MIMD and data-parallel modes
– Inter-operates with MPI, …
• BLAS and linear algebra interface
• Ported to major parallel machines
– IBM, Cray, SGI, clusters,...
• Originated in an HPCC project
• Used by 5 major chemistry codes,
financial futures forecasting,
astrophysics, computer graphics
Tools developed as part of the NWChem project
at PNNL; R.J. Harrison, J. Nieplocha et al.
7 May 2002
PeIGS 3.0 Parallel Performance
(Solution of real symmetric generalized
and standard eigensystem problems)
Developed as part of the
NWChem project at PNNL; R.J.
Harrison, J. Nieplocha et al.
Features (not always available elsewhere):
• Inverse iteration using Dhillon-Fann-Parlett’s parallel
algorithm (fastest uniprocessor performance and good
parallel scaling)
• Guaranteed orthonormal eigenvectors in the presence of
large clusters of degenerate eigenvalues
• Packed Storage
• Smaller scratch space requirements
• PDSYEV (Scalapack 1.5) and PDSYEVD (Scalapack 1.7)
Consider performance on Fock matrices generated in typical
Hartree Fock SCF calculations (1152 and 3888 basis functions).
7 May 2002
Real symmetric
eigenvalue problems
Parallel Eigensolvers
SGI Origin 3800/R12k-400 (“green”)
120
50
Fock matrix (N = 1152)
45
PeIGS 2.1
PeIGS 2.1 - Cray T3E/1200
PeIGS 3.0
PDSYEV (Scpk 1.5)
PDSYEVD (Scpk 1.7)
Time (sec)
40
35
30
Fock matrix
(N = 3888)
PeIGS 2.1
PeIGS 3.0
PDSYEV (Scpk 1.5)
PDSYEVD (Scpk 1.7)
BFG-Jacobi (DL)
16
128
100
80
60
25
20
40
15
10
20
5
0
0
2
4
8
16
Number of processors
32
32
64
256
512
7 May 2002
Parallel Eigensolvers
CS9 P4/2000 Xeon + Myrinet
400
600
Scalapack (PDSYEV)
PDSYEV: 2 CPUs per Node
400
PDSYEV: 2 CPUs / node
PDSYEV: 1 CPU / node
PDSYEVD: 2 CPUs / node
PDSYEVD: 1 CPU / node
PeIGS 2.1: 2 CPUs / node
PeIGS 2.1: 1 CPU / node
350
500
PDSYEV: 1 CPU Per Node
300
Real symmetric
eigenvalue problems
300
250
200
N = 3888
200
N = 3888
150
100
100
50
0
0
4
8
16
32
64
8
16
32
64
7 May 2002
Case Studies - Zeolite Fragments
• DFT Calculations with
Si8O7H18
347/832
Coulomb Fitting
Basis (Godbout et al.)
Si8O25H18
617/1444
DZVP - O, Si
DZVP2 - H
Fitting Basis:
DGAUSS-A1 - O, Si
DGAUSS-A2 - H
Si26O37H36
Si28O67H30
1199/2818
• NWChem & GAMESS-UK
1687/3928
Both codes use auxiliary fitting
basis for coulomb energy, with
3 centre 2 electron integrals
held in core.
7 May 2002
DFT Coulomb Fit - NWChem
Si8O7H18
Si8O25H18
347/832
350
300
250
CS6 PIII/800 + FE
800
CS7 AMD K7/1000 + SCI
CS2 QSNet Alpha/667
Cray T3E/1200E
600
SGI O3800/R14k-500
400
227
223
177
163
50
596
511
411
76%,95%
150
100
CS6 PIII/800 + FE
CS2 QSNet Alpha/667
Cray T3E/1200E
SGI O3800/R14k-500
249
200
617/1444
514
404
88%,104%
400
257
107
68
55
200
55
182
159
135
42
177
61
124
43
0
0
16
32
Number of CPUs
16
32
Number of CPUs
7 May 2002
119
DFT Coulomb Fit - NWChem
Si26O37H36
1199/2818
Si28O67H30
TIBM-SP/P2SC-120 (256) = 1137
TIBM-SP/P2SC-120 (256) = 2766
Cray T3E/1200E
5169
4500
79%,210%
3000
2388
1687/3928
85%,227%
6000
6090
5507
4682
CS2 QSNet Alpha Cluster
Cray T3E/1200E
4000
3050
3008
2414
2424
1500
1271
1147
2000
798
907
632
517
502
1580
2351
303
404
1617
1182
1360
880
611
1504
0
0
32
64
Number of CPUs
128
32
64
128
Number of CPUs
7 May 2002
NWChem - DFT (LDA)
Performance on the SGI Origin 3800
Zeolite ZSM-5
• DZVP Basis (DZV_A2) and Dgauss
A1_DFT Fitting basis:
AO basis:
CD basis:
3554
12713
• MIPS R14k-500 CPUs (Teras)
Wall time (13 SCF iterations):
64 CPUs = 5,242 seconds
128 CPUs= 3,951 seconds
• 3-centre 2e-integrals = 1.00 X 10 12
• Schwarz screening = 5.95 X 10 10
• % 3c 2e-ints. In core = 100%
Est. time on 32 CPUs = 40,000 secs
7 May 2002
Parallel Implementations of GAMESS-UK
!
!
Extensive use of Global Array (GA) Tools and Parallel
Linear Algebra from NWChem Project (EMSL)
SCF and DFT energies and gradients
"
"
"
"
!
Replicated data, but …
GA Tools for caching of I/O for restart and checkpoint files
Storage of 3-centre 2-e integrals in DFT Jfit
Linear Algebra (via PeIGs, DIIS/MMOs, Inversion of 2c-2e matrix)
SCF second derivatives
" Distribution of <vvoo> and <vovo> integrals via GAs
!
MP2 gradients
" Distribution of <vvoo> and <vovo> integrals via GAs
7 May 2002
GAMESS-UK: DFT HCTH on Valinomycin.
Impact of Coulomb Fitting
Basis: DZV_A2 (Dgauss)
14980
15000
12000
9000
CS6 PIII/800 + FE
CS9 P4 Xeon/2000 + Myrinet
Cray T3E/1200E
IBM SP/WH2-375
7638
61%,93%
TT3E/1200E (128) = 995
6226
6000
3910
4281
CS6 PIII/800 + FE
Cray T3E/1200E
IBM SP/WH2-375
4000
3386
2418
3063
1783
1301
3451
1791
0
16
JFIT
7617
6000
3000
A1_DFT Fit: 882/3012
JEXPLICIT
Number of CPUs
2333
2000
1488
933
32
726
477
1567
780
73%,161%
TT3E/1200E (128) = 2139
0
16
Number of CPUs
32
7 May 2002
GAMESS-UK: DFT HCTH on Valinomycin.
Impact of Coulomb Fitting: Cray T3E/1200, Cray Super
Cluster/833, Compaq AlphaServer SC/667 and SGI Origin R14k/500
Basis: DZV_A2 (Dgauss)
A1_DFT Fit: 882/3012
8000
Cray T3E/1200E
Compaq AlphaServer SC/667
Cray Alpha Cluster/833
7617
4300
3000
Cray T3E/1200E
Cray Alpha Cluster/833
3063
JEXPLICIT
JFIT
1573
4000
1500
2139
2033
1783
995
881
726
1123
539
713
978
476
419
598
0
0
32
64
Number of CPUs
128
32
64
128
Number of CPUs
7 May 2002
364
GAMESS-UK: DFT HCTH on Valinomycin. Speedups
for both Explicit and Coulomb Fitting.
JEXPLICIT
Speedup
128
128
Linear
Cray T3E/1200E
112
112
linear
Cray T3E/1200E
Compaq Alpha Server SC/667
112
Compaq AlphaServer/667
96
90.6
JFIT
Speedup
80
96
100.1
105+
⇑
80
65.4
64
64
48
48
32
32
16
16
16
32
48
64
80
96
Number of CPUs
112 128
16
32
48
64
80
96
112 128
Number of CPUs
7 May 2002
Memory-driven Approaches - SCF and DFT
HF and DFT Energy and Gradient calculation for
NiSiMe2CH2CH2C6F13: Basis Ahlrichs DZ (1485 GTOs)
Elapsed Time (seconds)
5000
4206
HF HF DFT
DFT
4182
4000
Direct SCF
In-core
- Direct-SCF
- In-core
Integrals written directly
to memory, rather than to
disk, and are not recalculated
3000
2476
2405
2000
1723
1692
830
1000
799
0
28
60
124
Number of CPUs
7 May 2002
Performance of MP2 Gradient Module
128
Cray T3E/1200, High-end and Commoditybased Systems
!
!
112
96
Mn(CO)5H - MP2 geometry
optimisation
BASIS: TZVP + f (217 GTOs
80
48
6000
32
CS6 PIII/800 + FE
Cray T3E/1200E
IBM SP/WH2-375
6713
5233
4847
3790
3456
Speed-up
64
7487
Linear
Cray T3E/1200E
16
16
6713
6000
59%,78%
3530
48
64
80
96
112
Cray T3E/1200E
4500
3530
2722
3000
32
2496
2129
3000
1550
2521
1539
2496
1923
1346
1012
1725
1539
1346
1500
1012
1158
836
634
529
0
16
Number of CPUs 32
0
16
32
64
128
7 May 2002
128
The QM/MM Modelling Approach
!
!
Couple quantum mechanics and
molecular mechanics approaches
QM treatment of the active site
" reacting centre
" problem structures (e.g. complex
transition metal centre)
" excited state processes (e.g.
spectroscopy)
!
Classical MM treatment of
environment
"
"
"
enzyme structure
zeolite framework
explicit and/or dielectric solvent models
7 May 2002
Quantum Simulation in Industry (QUASI)
!
Software Development
" Address barriers to uptake of existing QM/MM methodology
!
!
!
explore range of QM/MM coupling schemes
enhance performance of geometry optimisation for large systems
maintain flexible approach, address enzymes, zeolites
and metal oxide surfaces
! adopt modular scheme with interfaces to industry standard codes
!
High Performance Computing
" Scalable MPP implementation
" QM/MM MD simulation based on semiempirical ab-initio and DFT methods
!
Demonstration Applications
" Value of modelling technology
and HPC to industrial problems
" Beowulf EV6-based solution
!
Exploitation
QUASI Partners
CLRC Daresbury Laboratory
P. Sherwood, M.F. Guest, A.H. de Vries
Royal Institition of Great Britain
C.R.A Catlow, A. Sokol
University of Zurich / MPI Mulheim
W. Thiel, S. Billeter, F. Terstegen.
ICI Wilton (UK)
J. Kendrick (CAPS), J. Casci (Catalco)
Norsk Hydro (Porsgrunn, Norway)
K. Schoeffel, O. Swang (SINTEF)
BASF (Ludwigshafen, Germany)
A. Schaefer
" Disseminate results through workshop, newsletters etc.
7 May 2002
QM/MM Applications
Triosephosphate
Triosephosphate
isomerase
isomerase (TIM)
(TIM)
••Central
Centralreaction
reactionin
in
glycolysis,
glycolysis,catalytic
catalytic
interconversion
interconversionof
of
DHAP
DHAPto
toGAP
GAP
T 128 (O3800/R14k-500) = 181 secs
••Demonstration
Demonstrationcase
case
1600
within
withinQUASI
QUASI
(Partners
(PartnersUZH,
UZH,and
and
BASF)
BASF)
1467
1487
1200
802
800
89%,139%
778
• QM region 35 atoms (DFT BLYP)
714
1030
428
419
400
– include residues with possible
proton donor/acceptor roles
– GAMESS-UK, MNDO, TURBOMOLE
508
540
274
257
213
308
196
0
8
16
32
• MM region (4,180 atoms + 2 link)
– CHARMM force-field, implemented
in CHARMM, DL_POLY
64
7 May 2002
Vampir 2.5
Visualization and
Analysis of
MPI
Programs
Performance Analysis of GAbased Applications using
Vampir
GAMESS-UK on High-end and
Commodity class machines
• extensions to handle GA applications
7 May 2002
GAMESS-UK / Si8O25H18 : 8 CPUs:
One DFT Cycle
7 May 2002
GAMESS-UK / Si8O25H18 : 8 CPUs
Q†HQ (GAMULT2) and PEIGS
7 May 2002
Materials Simulation Codes
Plane Wave DFT Codes:
• CASTEP
• VASP
• CPMD
Local Gaussian Basis Set Codes:
• CRYSTAL
These codes have similar
functionality, power and problems.
CASTEP is the flagship code of
UKCP and hence subsequent
discussions will focus on this.
This code presents a different set of
problems when considering
performance on HPC(x).
SIESTA and CONQUEST:
•
•
O(n) scaling codes which will be extremely attractive to users.
Both are currently development rather than production codes.
7 May 2002
CRYSTAL - 2000
!
!
Distributed Data implementation
Benchmark:
"
"
"
"
An Acid Centre in Zeolite-Y
Zeolite-Y (Faujasite
(Faujasite))
Single point energy
145 atoms / cell, No symmetry / 8k-points
2208 basis functions, (6-21G* )
21000
18000
CS2 Alpha Linux Cluster/667
Cray T3E/1200E
IBM SP/WH2-375
14000
Speed-up
Linear
Cray T3E/1200E
CS1 PIII/450 + FE
256
192
9962
T256 SGI Origin 3800 = 945 secs
144
128
7609
5633
7000
4028
2206
3241
1483
0
16
32
64
Number of CPUs
128
128.8
64
0
0
64
128
192
7 May 2002
256
Plane Wave Methods: CASTEP
r
k
r
ψ j (r ) =
r r 2
( k + G ) < E cut
∑
r
G
C
r
r r r
k
− i ( k + G ). r
r
j ,G
e
•
•
Direct minimisation of the total energy (avoiding diagonalisation)
•
Large number of basis functions N~106 (especially for heavy
atoms).
Pseudopotentials must be used to keep the number of plane
waves manageable
The plane wave expansion means that the bulk of the computation comprises
large 3D Fast Fourier Transforms (FFTs) between real and momentum space.
•
•
These are distributed across the processors in various ways.
The actual FFT routines are optimized for the cache size of the
processor.
7 May 2002
Parallelization of CASTEP
!
A Number of parallelization methods are implemented:
" k-point: a processor holds all the wavefunction for a k-point
(MPI_ALLTOALLV is NOT required) BUT for large unit cells Nk ⇒1 i.e
small CPU count.
" G-vector: a processor holds part of the wavefunction for all k-points
(MPI_ALLTOALLV is over ALL CPUs) i.e. biggest systems with 1 K point
" mixed kG: k-points are allocated amongst processors, the wavefunctions
sub-allocated amongst processors associated with their particular kpoints i.e. MPI_ALLTOALLV is over NCPUs / Nk - intermediate cases.
!
!
On HPC hardware the desired method is either k or kG as this
minimizes inter-processor communication, specifically
MPI_ALLTOALLV.
However, on large numbers of processors such distributions will still
be problematic. New algorithms will therefore need to be developed
to overcome latency problems.
7 May 2002
CASTEP 4.2 - Parallel Benchmark
Chabazite
1250
750
CS6 PIII/800 FE/MPICH
Cray T3E/1200E
IBM SP/WH2-375
IBM SP/Regatta-H
1228
G
1000
!
693
637
774
500
67%,75%
!
!
!
!
Acid sites in a zeolite. (Si11 O24
Al H)
Vanderbilt ulatrasoft pseudopotential
Pulay density mixing minimiser
scheme
single k point total energy, 96
bands
15045 plane waves on 3D FFT
grid size = 54x54x54;
convergence in 17 SCF cycles
422
367
368
330
324
271
235
228
250
147
272
205
195
204
0
16
CPUs
32
153
137
Time (comms)
89
IBM SP/WH2-375
Cray T3E/1200E
CS6 PIII/800+FE
CS9 P4/2000 +Myrinet
CS2 QSNet Alpha
SGI Origin 3800/R14k
157
90
600
242
115
111
71
7 May 2002
CASTEP 4.2 - Parallel Benchmark II.
TiN: A TiN 32 atom slab, 8 k-points, single point
energy calculation with Mulliken analysis,
6488
6000
4000
CS8 Itanium/800 + Myrinet 2k
Cray T3E/1200E
47%,81%
kG
3109
2815
2287
1781
2000
1740
1312
1096
800
536 560
0
32
64
128
CPUs
7 May 2002
CPMD - Car-Parrinello Molecular Dynamics
CPMD
Elapsed Time (secs.)
400
!
CS7 dual-Athlon K7/1000
CS9 P4/2000 Xeon + Myrinet (2CPU)
CS9 P4/2000 Xeon + Myrinet (1CPU)
CS2 Alpha Linux Cluster
IBM SP/WH2-375
SGI Origin 3800 / R14k-500
AlphaServer SC ES40 / 667
AlphaServer SC ES45 / 1000
390
259
253
221
204
Benchmark Example: Liquid Water
!
208
200
!
30%,53%
145
124
99
!
111
95
63
!
0
16
32
Version 3.5.1: Hutter, Alavi, Deutsh,
Bernasconi, St. Goedecker, Marx,
Tuckerman and Parrinello (1995-2001)
DFT, plane-waves, pseudo-potentials
and FFT's
Physical Specifications:
32 molecules, Simple cubic periodic box of
length 9.86 A, Temperature 300K
Electronic Structure;
Structure;
BLYP functional, Trouillier Martins
pseudopotential,
pseudopotential, Recriprocal space cutoff 70
Ry = 952 eV
CPMD is the base code for the new CCP1
flagship project
CPUs
Sprik and Vuilleumier (Cambridge)
7 May 2002
CPMD on High-end Computers
!
!
!
!
!
Performances on NH2 nodes
" 256 CPUs and 8 MPI tasks/node
including IO (RD30WFN /
3000
WR30WFN)
Parallel efficency on 256
" 390 Mflops per CPU, 99.8 Gflops
NH2 CPUs: 75%
" 252*252*252 - 8 MPI tpn
2000
" Use of ESSL routines instead of
Lapack; Some routines were
"OpenMPed”
1000
Mixed mode MPI and OMP (IBM)
CPMD is dominated by ESSL SMP
routines (ROTATE and OVLAP).
0
16
32
64
128
256
4 MPI tpn and 2 SMP tpn is 1.36 faster
than flat MPI on a given MESH
Number of CPUs
Power4 estimates: The key is
FFTCOM (MPall2all). Average speedup
of 2.0 on a R-H LPARd system
7 May 2002
ANGUS: Combustion modelling (regular grid)
Cray T3E/1200, High-end and Commodity-based Systems
Conjugate Gradient + ILU
Grid Size - 1443
100 Iterations
3000
CS9 P4/2000 Xeon+ Myrinet
Cray T3E/1200E
IBM SP/WH2-375
IBM SP/Regatta-H
2870
2380
2000
1886
1864
1600
879
800
1218
531
776
512
330
364
228
647
0
16
CPUs
Discretisation of equations is
performed using standard 2nd
order central differences on a
3D-grid.
35%,51%
1090
1000
Direct numerical simulations
(DNS) of turbulent pre-mixed
combustion solving the
augmented Navier-Stokes
equations for fluid flow.
Pressure solver utilises either a
conjugate gradient method with
modified incomplete LU preconditioner or a multi-grid solver
(both make extensive use of
Level 1 BLAS) or fast Fourier
transform.
32
7 May 2002
Memory Bandwidth Effects: The IBM SP and Alpha Linux System
IBM SP/WH2-375
Grid Size - 1443
Number of CPUs
Number of CPUs
776
8 Nodes
4 Nodes
2 Nodes
1 Node
32
16
751
16
8
4
4
4000
8000
12000
16 Nodes
8 Nodes
4 Nodes
2 Nodes
32
8
0
CS2 QSNet Alpha Cluster
0
2000
4000
6000
7 May 2002
High-end and Commodity-based Systems
Grid Size - 2883
10 Iterations
4500
CS7 AMD dual-K7/1000 MP + SCALI
CS2 Alpha Linux Cluster / EV67
CS8 dual-Itanium/800 + Myrinet
IBM SP/WH2-375
Direct numerical simulations
IBM SP/Regatta-H 1.3 GHz
4098
3000
2602
2348
(DNS) of turbulent pre-mixed
combustion solving the
augmented Navier-Stokes
equations for fluid flow.
1961
69%,147%
1500
1169 1161
1595
1206
884
563 661
548
819
399
204
360
0
16
32
64
CPUs
7 May 2002
Commodity Comparisons with High-end Systems:
Compaq AlphaServer SC ES45/1000 and the SGI O3800/R14k-500
CS9 - Pentium4/2000 Xeon + Myrinet 2k
% of 32-CPU AlphaServer SC Origin 3800
GAMESS-UK
SCF
DFT
DFT (Jfit
(Jfit))
DFT Gradient
MP2 Gradient
SCF Forces
95%
70-73%
59-70%
69%
59%
92%
135%
117-161%
92-111%
114%
78%
148%
NWChem (DFT Jfit)
Jfit)
65-88%
94-227%
GAMESS / CHARMM
89%
139%
Ewald-based
Ewald-based
bond constraints
44-53%
41%
71-84%
64%
CHARMM
CASTEP
CPMD
95%
103%
47-67%
30%
75-81%
53%
ANGUS
35-69%
51-147%
DL_POLY
7 May 2002
Summary
!
!
Computational Chemistry - Background
Commodity-based and High-end Systems
" Prototype Commodity Systems; CS1 - CS9
" High-end systems from Cray, SGI, IBM and
Compaq - Initial results from Regatta-H
" Performance Metrics
!
Application performance
" Molecular Simulation (distributed data)
! DL_POLY and CHARMM
" Electronic Structure
! Distributed data: GAs and PeIGS
! NWChem and GAMESS-UK
! VAMPIR and instrumenting the GA Tools
!
CS2 - QSNet Alpha
Linux Cluster
% of 32 CPU
O3800/R14k-500
GAMESS-UK
SCF
DFT
DFT (Jfit
(Jfit))
DFT Gradient
MP2 Gradient
SCF Forces
NWChem
99%
99%
89-122%
89%
87%
86%
112-234%
DL_POLY
Ewald-based
Ewald-based
bond constraints
95%
82%
CHARMM
78%
Cost-effectiveness of P4/2000 cluster with Myrinet-2k:
"
delivers on average 66%of AlphaServer ES45/1000,
109% of SGI Origin 3800/R14k-500
7 May 2002

Number of CPUs - Gamess-UK

Transcription

Similar documents

INDCs - Latin America

Camp Friedberg 2015!

warning boards sets • lighting bars • lighting equipment sets