Performance analysis of direct N

Transcription

Performance analysis of direct N
Performance analysis of direct N-body
algorithms on highly distributed systems
Alessia Gualandris
Astronomical Institute & Section Computational Science
University of Amsterdam
Outline
The gravitational N-body problem
A direct N-body code
Parallel implementation of a direct N-body code
Performance on a Beowulf cluster and on DAS-2
Latency hiding
Performance modeling
Performance with GRAPE-6
Timing experiments on the Grid
The gravitational N-body problem
Star clusters can be represented as N-body systems
N point particles interacting through the gravitational force
Problem:
given masses, positions,
velocities for each particle,
solve the equations of motion
according to Newton's inverse
square force law
A direct code for the N-body problem
A direct N-body code
computes the force on each particle by summing up
the contributions from every other particle
solves the equations of motion for each body
==> trajectories
Hermite integrator (4th order)
predictor-corrector scheme
hierarchical time-steps:
the time-steps are quantized to
powers of 2
Complexity of a direct N-body code
Direct N-body codes O(N2): very expensive
General purpose parallel computers
Special purpose GRAPE hardware
Parallel algorithms
Two parallelization schemes:
COPY scheme: each processor
computes complete forces on a
subgroup of particles
RING scheme: each processor
computes partial forces on all
particles
Performance results:
The Beowulf cluster (blue)@ sara
Performance results: DAS-2
Performance results:
The Beowulf cluster @ sara
efficiency = T(p) / [p T(1)]
Latency hiding
Ring algorithm with MPI non-blocking communication
In each shift of the systolic scheme the communication is
split in two: the transfer of the positions and velocities is
separated from the transfer of accelerations and jerks
⇒ overlap of computation and communication
Particularly indicated when the communication time
is not negligible compared to the computation time.
Performance results:
BlueGene/L @ IBM Watson Research Center
Performance results:
BlueGene/L @ IBM Watson Research Center
Performance modeling
Tforce = Tcalc + Tcomm (cpu speed, bandwidth, latency)
Tcalc ∝ 1 / p
Tcomm ∝ p
Beowulf cluster (blue),
shared time-step code:
pmin = 102
peq = 24
Performance modeling
hierarchical time-step code
<s> ∝ N2/3
pmin ≈ (τf N s / τl)1/2
N
1024
8192
16384
32768
65536
131072
262144
1048576
8388608
<s>
10
41
65
102
162
260
410
1032
4128
pmin
Tforce(sec)
10
0.01
58
0.2
104
0.7
182
2
326
6.4
584
20
1036
62
3290
610
18608
18850
Special purpose hardware:
The GRAPE-6
The GRAPE (short for GRAvity
PipE) is a special-purpose
hardware specifically designed to
compute
the
gravitational
interaction among particles.
It is connected to a generalpurpose host and is used as a
back-end processor, on which the
is performed.
The rest of the computation, force
such calculation
as the orbit
integration, is
performed on the host computer.
Performance results:
The GRAPE-6
MODESTA: a cluster of GRAPEs
http://modesta.science.uva.nl/
The MODESTA dedicated
supercomputer consists of a cluster of 4
nodes 240Gflop/s GRAPE-6 special
purpose computers setup in a Beowulf
architecture.
Each GRAPE-6 board consists of
2 modules with four chips each
⇒ 32 chips in total.
Performance on the Grid:
DAS-2 and the CrossGrid
Performance on the Grid:
DAS-2 and the CrossGrid
The performance on large
grids improves as the size of the
N-body system increases.
Communication among
nodes residing in different
locations across Europe
becomes more evident as the
number of locations increases.
The performance decreases
only by about a factor three for a
large simulation.
Summary & Conclusions
A direct N-body code for the simulation of star clusters has O
(N2) complexity ⇒ very expensive.
The performance can be improved by means of general purpose
parallel computers or special purpose hardware (GRAPE).
The use of GRAPE hardware for the force calculation reduces
the execution times by about 2 orders of magnitude with respect
to a single node in a Beowulf cluster (for the same accuracy).
Numerical simulations of globular clusters (≈ a million stars)
may soon be feasible on small clusters of GRAPEs or massively
parallel supercomputers (like BlueGene). Grid technology appears
very promising but not yet reliable and stable.