Performance analysis of direct N
Transcription
Performance analysis of direct N
Performance analysis of direct N-body algorithms on highly distributed systems Alessia Gualandris Astronomical Institute & Section Computational Science University of Amsterdam Outline The gravitational N-body problem A direct N-body code Parallel implementation of a direct N-body code Performance on a Beowulf cluster and on DAS-2 Latency hiding Performance modeling Performance with GRAPE-6 Timing experiments on the Grid The gravitational N-body problem Star clusters can be represented as N-body systems N point particles interacting through the gravitational force Problem: given masses, positions, velocities for each particle, solve the equations of motion according to Newton's inverse square force law A direct code for the N-body problem A direct N-body code computes the force on each particle by summing up the contributions from every other particle solves the equations of motion for each body ==> trajectories Hermite integrator (4th order) predictor-corrector scheme hierarchical time-steps: the time-steps are quantized to powers of 2 Complexity of a direct N-body code Direct N-body codes O(N2): very expensive General purpose parallel computers Special purpose GRAPE hardware Parallel algorithms Two parallelization schemes: COPY scheme: each processor computes complete forces on a subgroup of particles RING scheme: each processor computes partial forces on all particles Performance results: The Beowulf cluster (blue)@ sara Performance results: DAS-2 Performance results: The Beowulf cluster @ sara efficiency = T(p) / [p T(1)] Latency hiding Ring algorithm with MPI non-blocking communication In each shift of the systolic scheme the communication is split in two: the transfer of the positions and velocities is separated from the transfer of accelerations and jerks ⇒ overlap of computation and communication Particularly indicated when the communication time is not negligible compared to the computation time. Performance results: BlueGene/L @ IBM Watson Research Center Performance results: BlueGene/L @ IBM Watson Research Center Performance modeling Tforce = Tcalc + Tcomm (cpu speed, bandwidth, latency) Tcalc ∝ 1 / p Tcomm ∝ p Beowulf cluster (blue), shared time-step code: pmin = 102 peq = 24 Performance modeling hierarchical time-step code <s> ∝ N2/3 pmin ≈ (τf N s / τl)1/2 N 1024 8192 16384 32768 65536 131072 262144 1048576 8388608 <s> 10 41 65 102 162 260 410 1032 4128 pmin Tforce(sec) 10 0.01 58 0.2 104 0.7 182 2 326 6.4 584 20 1036 62 3290 610 18608 18850 Special purpose hardware: The GRAPE-6 The GRAPE (short for GRAvity PipE) is a special-purpose hardware specifically designed to compute the gravitational interaction among particles. It is connected to a generalpurpose host and is used as a back-end processor, on which the is performed. The rest of the computation, force such calculation as the orbit integration, is performed on the host computer. Performance results: The GRAPE-6 MODESTA: a cluster of GRAPEs http://modesta.science.uva.nl/ The MODESTA dedicated supercomputer consists of a cluster of 4 nodes 240Gflop/s GRAPE-6 special purpose computers setup in a Beowulf architecture. Each GRAPE-6 board consists of 2 modules with four chips each ⇒ 32 chips in total. Performance on the Grid: DAS-2 and the CrossGrid Performance on the Grid: DAS-2 and the CrossGrid The performance on large grids improves as the size of the N-body system increases. Communication among nodes residing in different locations across Europe becomes more evident as the number of locations increases. The performance decreases only by about a factor three for a large simulation. Summary & Conclusions A direct N-body code for the simulation of star clusters has O (N2) complexity ⇒ very expensive. The performance can be improved by means of general purpose parallel computers or special purpose hardware (GRAPE). The use of GRAPE hardware for the force calculation reduces the execution times by about 2 orders of magnitude with respect to a single node in a Beowulf cluster (for the same accuracy). Numerical simulations of globular clusters (≈ a million stars) may soon be feasible on small clusters of GRAPEs or massively parallel supercomputers (like BlueGene). Grid technology appears very promising but not yet reliable and stable.