What is Molecular Dynamics? Biophysics Application Project Emiliano Ippoliti
Transcription
What is Molecular Dynamics? Biophysics Application Project Emiliano Ippoliti
Parallel Algorithms for Short-Range Molecular Dynamics What is Molecular Dynamics? Biophysics Application Project Emiliano Ippoliti September 14, 2011 CONTENTS CONTENTS Contents 1 2 3 4 5 Introduction 1.1 The role of computer experiments 1.2 What is molecular dynamics? . . . 1.3 Some historical notes . . . . . . . 1.4 Limitations . . . . . . . . . . . . 1.4.1 Use of classical forces . . 1.4.2 Realism of forces . . . . . 1.4.3 Time and size limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The basic machinery 2.1 Modeling the physical system . . . . . . . . . . . . . 2.1.1 Lennard-Jones potential . . . . . . . . . . . . 2.1.2 Potential truncation and long-range corrections 2.2 Periodic boundaries conditions . . . . . . . . . . . . . 2.2.1 The minimum image criterion . . . . . . . . . 2.3 Time integration algorithm . . . . . . . . . . . . . . . 2.4 The Verlet algorithm . . . . . . . . . . . . . . . . . . 2.4.1 Velocity Verlet algorithm . . . . . . . . . . . . 2.4.2 Leap-frog algorithm . . . . . . . . . . . . . . 2.4.3 Error terms . . . . . . . . . . . . . . . . . . . Simple statistical quantities 3.1 Potential energy . . . . . . 3.2 Kinetic energy . . . . . . . 3.3 Total energy . . . . . . . . 3.4 Temperature . . . . . . . . 3.5 Radial distribution function 3.6 MSD and VACF . . . . . . 3.7 Pressure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 4 5 5 7 7 7 8 . . . . . . . . . . 9 9 9 11 12 12 13 14 14 15 15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 17 18 18 19 19 20 20 Computational aspects of MD 4.1 MD algorithms . . . . . . . . . . . 4.2 Long- vs short-range force model . . 4.3 Neighbor lists and link-cell method . 4.4 Parallel algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 22 23 24 25 . . . . . . . . . . . . . . . . . . . . . . . . . . . . Computational aspects of our model 27 5.1 Reduced units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 5.2 Simulation parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2 CONTENTS 6 Atom-decomposition algorithm 6.1 Expand and fold operations 6.2 The simpler algorithm . . . 6.3 Using Newton’s 3rd law . 6.4 Load-balance . . . . . . . CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 32 33 34 35 7 Force-decomposition algorithm 36 7.1 The simpler algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 7.2 Using Newton’s 3rd law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 7.3 Load-balance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 8 Spatial-decomposition algorithm 8.1 The communication scheme . . 8.2 The simpler algorithm . . . . . . 8.2.1 Communication scaling 8.2.2 Computational scaling . 8.2.3 Data structures’ update . 8.3 Using Newton’s 3rd law . . . . 8.4 Load-balance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 40 42 43 44 44 45 45 47 3 1 1 1.1 INTRODUCTION Introduction The role of computer experiments Computer experiments play a very important role in science today. In the past physical sciences were characterized by an interplay between experiment and theory. In experiment, a system is subjected to measurements, and results, expressed in numeric form, are obtained. In theory, a model of the system is constructed, usually in the form of a set of mathematical equations. The model is then validated by its ability to describe the system behavior in a few selected cases, simple enough to allow a solution to be computed from the equations. In many cases, this implies a considerable amount of simplification in order to eliminate all the complexities invariably associated with real world problems, and make the problem solvable. In the past, theoretical models could be easily tested only in a few simple special circumstances. So, for instance, in condensed matter physics a model for intermolecular forces in a specific material could be verified in a diatomic molecule, or in a perfect, infinite crystal. Even then, approximations were often required to carry out the calculation. Unfortunately, many physical problems of extreme interest (both academic and practical) fall outside the realm of these ”special circumstances.” Among them, one could mention the physics and chemistry of organic molecules, involving a large amount of degrees of freedom. The advent of high speed computers — which started to be used in the 50s — altered the picture by inserting a new element right in between experiment and theory: the computer experiment. In a computer experiment, a model is still provided by theorists, but the calculations are carried out by the machine by following a ”recipe” (the algorithm, implemented in a suitable programming language). In this way, complexity can be introduced (still with caution!) and more realistic systems can be investigated, opening a road towards a better understanding of real experiments. Needless to say, the development of computer experiments altered substantially the traditional relationship between theory and experiment. On one side, computer simulations increased the demand for accuracy of the models. For instance a molecular dynamics simulation allows us to evaluate the melting temperature of a material, modeled by means of a certain interaction law. This is a difficult test for the theoretical model to pass—and a test that has not been available in the past. Therefore, simulation ”brings to life” the models, disclosing critical areas and providing suggestions to improve them. On the other side, simulation can often come very close to experimental conditions, to the extent that computer results can sometimes be compared directly with experimental results. When this happens, simulation becomes an extremely powerful tool not only to understand and interpret the experiments at the microscopic level, but also to study regions which are not accessible experimentally, or which would imply very expensive experiments, such as under extremely high pressure. Last but not least, computer simulations allow thought experiments — things which are just impossible to do in reality, but whose outcome greatly increases our understanding of phenomena — to be realized. Fantasy and creativity are important qualities for the computer simulator! 4 1.2 1.2 What is molecular dynamics? 1 INTRODUCTION What is molecular dynamics? We call molecular dynamics (MD) a computer simulation technique where the time evolution of a set of interacting atoms is followed by integrating their equations of motion. In molecular dynamics we follow the laws of classical mechanics, and most notably the Newton’s 2nd law: Fi = mi ai (1) for each atom i in a system constituted by N atoms. Here, mi is the atom mass, ai = d2 ri /dt2 its acceleration, and Fi the force acting upon it, due to the interactions with other atoms. Therefore, molecular dynamics is a deterministic technique: given an initial set of positions and velocities, the subsequent time evolution is in principle1 completely determined. In more pictorial terms, atoms will ”move” into the computer bumping into each other wandering around (if the system is fluid), oscillating in waves in concert with their neighbors, and so on, in a way pretty similar to what atoms in a real substance would do. The computer calculates a trajectory in a 6-N dimensional phase space (3N positions and 3N momenta). However, such trajectory is usually not particularly relevant by itself. Molecular dynamics is a statistical mechanics method: it is a way to obtain a set of configurations distributed according to some statistical distribution function or statistical ensemble. According to statistical physics, physical quantities are represented by averages over configurations distributed according to a certain statistical ensemble. A trajectory obtained by molecular dynamics provides such a set of configurations. Therefore, a measurement of a physical quantity by simulation is simply obtained as an arithmetic average of the various instantaneous values assumed by that quantity during the MD run. Statistical physics is the link between the microscopic behavior and thermodynamics. In the limit of very long simulation times, one could expect the phase space to be fully sampled and in that limit this averaging process would yield the thermodynamic properties. In practice, the runs are always of finite length and one should exert caution to estimate when the sampling may be good (”system at equilibrium”) or not. In this way, MD simulations can be used to measure thermodynamic properties. 1.3 Some historical notes Molecular dynamics is concerned with simulating the motion of molecules to gain a deeper understanding of several physical phenomena that derive from molecular interactions. These studies include not only the motion of many molecules as in a fluid, but also the motion of a single large molecule consisting of hundreds or thousands of atoms, as in a protein. Computers are a critically important tool for these studies because there simply is no other way to trace the motion of a large number of interacting particles. The earliest of these computations were done in the 1950s by Berni Alder and Tom Wainwright at Lawrence Livermore National Laboratory [1]. They studied the distribution of molecules in a liquid, using a model in 1 In practice, the finiteness of the integration time step and arithmetic rounding errors will eventually cause the computed trajectory to deviate from the true trajectory. 5 1.3 Some historical notes 1 INTRODUCTION which the molecules are represented as ”hard spheres” that interact like billiard balls. Using the fastest computer at that time, an IBM 704, they were able to simulate the motions of 32 and 108 molecules in computations requiring 10 to 30 hours. Now it is possible to perform hard sphere computations on systems of over a billion particles. Another crucial paper appeared in 1967 [2]. Loup Verlet introduced the concept of neighbor list and the Verlet time integrator algorithm, which we will meet in the next lessons, to calculate the phase diagram of argon and to compute correlation functions to test theories of the liquid state. Verlet studied a molecular model more realistic than the hard sphere one, known as Lennard-Jones model (the same model we will study in this course). The Lennard-Jones model was employed also by Lawrence Hannon, George Lie, and Enrico Climenti in 1986 at IBM Kingston to study the flow of fluids [3]. In these computations the fluid was represented by ∼ 104 interacting molecules. Even though this is miniscule compared with the number of molecules in a gram of water, the behavior of the flow was like that in a real fluid. Another class of molecular dynamics computations is concerned with the internal motion of molecules especially proteins and nucleic acids, vital components of biological systems. The goal in this case is to gain a better understanding of the function of these molecules in biochemical reactions. Interestingly, it has been found that quantum effects do not have much influence on the overall dynamics of proteins except at low temperatures. Thus, classical mechanics is sufficient to model the motions, but still the computational power required for following the motion of a large molecule is enormous. For example, simulating the motion of a 1,500 atom molecule, a small protein, for a time interval of 10−7 seconds is a 6-hour computation on a modern Intel Xeon quadcore processor. Martin Karplus and Andrew McCammon, in an interesting article ”The Molecular Dynamics of Proteins” [4] describe a discovery concerning the molecule myoglobin that could only have been made through molecular dynamics. The interest in myoglobin comes about because it stores oxygen in biological systems. Whales, for example, are able to stay under water for long periods of time because of a large amount of myoglobin in their bodies. It was known that the oxygen molecule binds to a particular site in the myoglobin molecule but it was not understood how the binding could take place. X ray crystallography work shows that large protein molecules tend to be folded up into compact three dimensional structures, and in the case of myoglobin the oxygen sites were known to lie in the interior of such a structure. There did not seem to be any way that an oxygen atom could penetrate the structure to reach the binding site. Molecular dynamics provided the answer to this puzzle. The picture of a molecule provided by X ray crystallography shows the average positions of the atoms in the molecule. In fact these atoms are in continuous motion, vibrating about positions of equilibrium. Molecular dynamics simulation of the internal motion of the molecule showed that a path to the site, wide enough for an oxygen atom, could open up for a short period of time. Scientific visualization is particularly important for understanding the results of a molecular dynamics simulation. The millions of numbers representing a history of the positions and velocities of the particles is not a very revealing picture of the motion. How can one recognize the formation of a vortex in this mass of data or the nature of the bending and stretching of a large molecule? How can one gain new insights? Pictures and animations enable the scientist to literally see the formation of vortices, to view protein bending, and thus to gain new insights into 6 1.4 Limitations 1 INTRODUCTION the details of these phenomena. Computations that involve following the motion of a large number of interacting particles, whether the particles are atoms in a molecule, or molecules of a fluid or solid or even particles in the discrete model of a vibrating string or membrane are similar in the following respect. They involve a long series of time steps, at each of which Newton’s laws are used to determine the new positions and velocities from the old positions and velocities and the forces. The computations are quite simple but there are many of them. To achieve accuracy, the time steps must be quite small, and therefore many are required to simulate a significantly long real time interval. In computational chemistry the time step is typically about a femtosecond (10−15 seconds), and the total computation may represent ∼ 10−8 seconds which could cost about 100 hours of computer time. The amount of computation at each step can be quite extensive. In a system consisting of N particles the force computations at each step may involve O(N 2 ) operations. Thus it is easy to see that these computations can consume a large number of machine cycles. In addition, animation of the motion of the system can make large demands on memory. 1.4 Limitations Molecular dynamics is a very powerful technique but has — of course — limitations. We quickly examine the most important of them. 1.4.1 Use of classical forces One could immediately ask: how can we use Newton’s law to move atoms, when everybody knows that systems at the atomistic level obey quantum laws rather than classical laws, and that Schr¨odinger’s equation is the one to be followed? A simple test of the validity of the classical approximation is based on the de Broglie thermal wavelength [5] defined as s Λ= 2π~2 M kB T (2) where M is the atomic mass and T the temperature. The classical approximation is justified if Λ a, where a is the mean nearest neighbor separation. If one considers for instance liquids at the triple point, Λ/a is of the order of 0.1 for light elements such as Li and Ar, decreasing further for heavier elements. The classical approximation is poor for very light systems such as H2 , He, Ne. Moreover, quantum effects become important in any system when T is sufficiently low. The drop in the specific heat of crystals below the Debye temperature [6] or the anomalous behavior of the thermal expansion coefficient, are well known examples of measurable quantum effects in solids. Molecular dynamics results should be interpreted with caution in these regions. 1.4.2 Realism of forces How realistic is a molecular dynamics simulation? 7 1.4 Limitations 1 INTRODUCTION In molecular dynamics, atoms interact with each other. These interactions originate forces that act upon atoms, and atoms move under the action of these instantaneous forces. As the atoms move, their relative positions change and forces change as well. The essential ingredient containing the physics is therefore constituted by the forces. A simulation is realistic, i.e. it mimics the behavior of the real system, only to the extent that interatomic forces are similar to those that real atoms (or, more exactly, nuclei) would experience when arranged in the same configuration. As we will see in the next section, forces are usually obtained as the gradient of a potential energy function depending on the positions of the particles. The realism of the simulation therefore depends on the ability of the potential chosen to reproduce the behavior of the material under the conditions at which the simulation is run. The problem of selecting, or constructing, potentials (the so-called force-field problem) is still under study by the scientific community and it is beyond the scope of these short notes. 1.4.3 Time and size limitations Typical MD simulations can nowadays be performed on systems containing hundred thousands, or, perhaps, millions of atoms, and for simulation times ranging from a few nanoseconds to more than one microsecond. While these numbers are certainly respectable, it may happen to run into conditions where time and/or size limitations become important. A simulation is ”safe” from the point of view of its duration when the simulation time is much longer than the relaxation time of the quantities we are interested in. However, different properties have different relaxation times. In particular, systems tend to become slow and sluggish in the proximity of phase transitions, and it is not uncommon to find cases where the relaxation time of a physical property is orders of magnitude larger than times achievable by simulation. A limited system size can also constitute a problem. In this case one has to compare the size of the MD cell with the correlation lengths of the spatial correlation functions of interest. Again, correlation lengths may increase or even diverge in proximity of phase transitions, and the results are no longer reliable when they become comparable with the box length. This problem can be partially alleviated by a method known as finite size scaling. This consist of computing a physical property A using several box with different sizes L, and then fitting the results on a relation c A(L) = A0 + n L using A0 , c, n as fitting parameters. A0 then corresponds to limL→∞ A(L), and should therefore be taken as the most reliable estimate for the ”true” physical quantity. 8 2 2 2.1 THE BASIC MACHINERY The basic machinery Modeling the physical system The main ingredient of a simulation is a model for the physical system. For a molecular dynamics simulation this amounts to choosing the potential, a function U (r1 , . . . , rN ) of the positions of the nuclei, representing the potential energy of the system when the atoms are arranged in that specific configuration. This function is translationally and rotationally invariant, and is usually constructed from the relative positions of the atoms with respect to each other, rather than from the absolute positions. Forces are then derived as the gradients of the potential with respect to atomic displacements: Fi = ∇ri U (r1 , . . . , rN ) (3) This form implies the presence of a conservation law of the total energy E = K + U where K is the instantaneous kinetic energy: X1 mi vi2 (4) K= 2 i where vi is the velocity of the particle i. The simplest choice for U is to write it as a sum of pairwise interactions: XX U (r1 , . . . , rN ) = φ(|ri − rj |) i (5) j>i The clause j > i in the second summation has the purpose of considering each atom pair only once. In the past most potentials were constituted by pairwise interactions, but this is no longer the case. It has been recognized that the two-body approximation is very poor for many relevant systems, such as metals and semiconductors. Various kinds of many-body potentials are now of common use in condensed matter simulation. The development of accurate potentials represents an important research line. In this course, however, we will take into account only the most commonly used pairwise interaction model: the Lennard-Jones pair potential, which has been proved to correctly describe monoatomic liquid systems like for example liquid Argon, as we will see in the next section. 2.1.1 Lennard-Jones potential The Lennard-Jones 12-6 potential (LJ) is given by the expression σ 12 σ 6 − φLJ (r) = 4 r r (6) for the interaction potential between a pair of atoms. The total potential of a system containing many atoms is then given by (10). This potential has an attractive (i.e. a negative value since we put the arbitrary zero of the energy at the value of the potential when the to particles are at 9 2.1 Modeling the physical system 2 THE BASIC MACHINERY Figure 1: The Lennard-Jones 12-6 potential given by (6). an infinite distance apart) tail at large r, it reaches a minimum around 1.122σ, and it is strongly repulsive at shorter distance, passing through 0 at r = σ and increasing steeply as r is decreased further (Fig. 1). The term ∼ 1/r12 dominating at short distance, models the repulsion between atoms when they are brought very close to each other. Its physical origin is related to the Pauli principle: when the electronic clouds surrounding the atoms starts to overlap, the energy of the system increases abruptly. The exponent 12 was chosen exclusively on a practical basis: equation (6) is particularly easy to compute. In fact, on physical grounds an exponential behavior would be more appropriate. The term ∼ 1/r6 dominating at large distance, constitute the attractive part. This is the term that gives cohesion to the system. A 1/r6 attraction is originated by van der Waals dispersion forces, originated by dipole-dipole interactions in turn due to fluctuating dipoles. These are rather weak interactions, which however dominate the bonding character of closed-shell systems, that is, rare gases such as Ar or Kr. Therefore, these are the materials that a LJ potential could mimic fairly well2 . The parameters and σ are chosen to fit the physical properties of the material. On the other hand, a LJ potential is not at all adequate to model situations with open shells, where strong localized bonds may form (as in covalent systems), or where there is a delocalized 2 Many-body effects in the interaction are however present also in these systems and, in all cases, pair potentials more accurate than LJ have been developed for rare gases. 10 2.1 Modeling the physical system 2 THE BASIC MACHINERY ”electron sea” where the ions sit (as in metals). In these systems the two-body interactions scheme itself fails very badly. We will not deal with these cases further. However, regardless of how well it is able to model actual physical systems, the LJ 126 potential constitutes nowadays an extremely important model system. There is a vast body of papers that investigated the behavior of atoms interacting via LJ on a variety of different geometries (solids, liquids, surfaces, clusters, two-dimensional systems, etc). One could say that LJ is the standard potential to use for all the investigations where the focus is on fundamental issues, rather than studying the properties of a specific material. The simulation work done on LJ systems helped us (and still does) to understand basic points in many areas of condensed matter physics and consequently of the biophysics, and for this reason the importance of LJ cannot be underestimated. When using the LJ potentials in simulation, it is customary to work in a system of units where σ = 1 and = 1. All the codes in this course follow this convention. 2.1.2 Potential truncation and long-range corrections The potential in eq. (6) has an infinite range. In practical applications, it is customary to establish a cutoff radius Rc and disregard the interactions between atoms separated by more than Rc . This results in simpler programs and enormous savings of computer resources, because the number of atomic pairs separated by a distance r grows as r2 and becomes quickly huge. A simple truncation of the potential creates a new problem though: whenever a particle pair ”crosses” the cutoff distance, the energy makes a little jump. A large number of these events is likely to spoil energy conservation in a simulation. To avoid this problem, the potential is often shifted in order to vanish at the cutoff radius3 : ( φLJ (r) − φLJ (Rc ) if r ≤ Rc (7) φ(r) = 0 if r > Rc Physical quantities are of course affected by the potential truncation. The effects of truncating a full-ranged potential can be approximately estimated by treating the system as a uniform (constant density) continuum beyond Rc . For a bulk system (periodic boundary conditions along each direction, see next section) this usually amounts to a constant additive correction. For example, the potential tail (attractive) brings a small additional contribution to the cohesive energy, and to the total pressure. Commonly used truncation radii for the LJ potential are 2.5σ and 3.2σ. It should be mentioned that these truncated Lennard-Jones models are so popular that they acquired a value on their own as reference models for generic two-body systems (as also discussed at the end of the previous section). In many cases, there is no interest in evaluating truncation corrections because the truncated model itself is the subject of the investigation. This is not our case, and we will use long-range correction terms in our code. Moreover, we choose to adopt the smaller commonly used truncation radius (2.5σ). 3 Shifting eliminates the energy discontinuity, but not the force discontinuity. Some researchers altered the form of the potential near Rc to obtain a smoother junction, but there is no standard way to do that. 11 2.2 2.2 Periodic boundaries conditions 2 THE BASIC MACHINERY Periodic boundaries conditions What should we do at the boundaries of our simulated system? One possibility is doing nothing special: the system simply terminates, and atoms near the boundary would have less neighbors than atoms inside. In other words, the sample would be surrounded by surfaces. Unless we really want to simulate a cluster of atoms, this situation is not realistic. No matter how large the simulated system is, its number of atoms N would be negligible compared with the number of atoms contained in a macroscopic piece of matter (of the order of 1023 ) and the ratio between the number of surface atoms and the total number of atoms would be much larger than in reality, causing surface effects to be much more important than what they should. A solution to this problem is to use periodic boundary conditions (PBC). When using PBC, particles are enclosed in a box, and we can imagine that this box is replicated to infinity by rigid translation in all the three Cartesian directions, completely filling the space. In other words, if one of our particles is located at position r in the box, we assume that this particle really represents an infinite set of particles located at r + la + mb + nc, (l, m, n = −∞, ∞), where l, m, n are integer numbers, and we have indicated with a, b, c the vectors corresponding to the edges of the box. All these ”image” particles move together, and in fact only one of them is represented in the computer program. The key point is that now each particle i in the box should be thought as interacting not only with other particles j in the box, but also with their images in nearby boxes. That is, interactions can ”go through” box boundaries. In fact, one can easily see that (a) we have virtually eliminated surface effects from our system, and (b) the position of the box boundaries has no effect (i.e. a translation of the box with respect to the particles leaves the forces unchanged). Apparently, the number of interacting pairs increases enormously as an effect of PBC. In practice, this is not true because potentials, like LJ one, usually have a short interaction range. The minimum image criterion discussed next simplifies things further, reducing to a minimum the level of additional complexity introduced in a program by the use of PBC. Finally, it could be worth to point out that even if PBC allow us to describe an infinite system of particles that approach more and more a real system, the PBC system is still not an exact representation of reality: in fact, PBC introduce an artificial symmetry, the translational invariance, not present in the real system. 2.2.1 The minimum image criterion Let us suppose that we are using a potential with a finite range, like LJ one: when separated by a distance equal or larger than a cutoff distance Rc two particles do not interact with each other. Let us also suppose that we are using a box whose size is larger than 2Rc along each Cartesian direction. When these conditions are satisfied, it is obvious that at most one among all the pairs formed by a particle i in the box and the set of all the periodic images of another particle j will interact. 12 2.3 Time integration algorithm 2 THE BASIC MACHINERY To demonstrate this, let us suppose that i interacts with two images j1 and j2 of j. Two images must be separated by the translation vector bringing one box into another, and whose length is at least 2Rc by hypothesis. In order to interact with both j1 and j2 , i should be within a distance Rc from each of them, which is impossible since they are separated by more than 2Rc . When we are in these conditions, we can safely use is the minimum image criterion: among all possible images of a particle j, select the closest, and throw away all the others. In fact, only the closest is a candidate to interact: all the others certainly do not. This operating conditions greatly simplify the set up of a MD program, and are commonly used. Of course, one must always make sure that the box size is at least 2Rc along all the directions where PBCs are in effect. 2.3 Time integration algorithm The engine of a molecular dynamics program is its time integration algorithm, required to integrate the equation of motion of the interacting particles and follow their trajectory. Time integration algorithms are based on finite difference methods, where time is discretized on a finite grid, the time step ∆t being the distance between consecutive points on the grid. Knowing the positions and some of their time derivatives at time t (the exact details depend on the type of algorithm), the integration scheme gives the same quantities at a later time t + ∆t. By iterating the procedure, the time evolution of the system can be followed for long times. Of course, these schemes are approximate and there are errors associated with them. In particular, one can distinguish between • Truncation errors, related to the accuracy of the finite difference method with respect to the true solution. Finite difference methods are usually based on a Taylor expansion truncated at some term, hence the name. These errors do not depend on the implementation: they are intrinsic to the algorithm. • Round-off errors, related to errors associated to a particular implementation of the algorithm. For instance, to the finite number of digits used in computer arithmetics. Both errors can be reduced by decreasing ∆t. For large ∆t truncation errors dominate, but they decrease quickly as ∆t is decreased. For instance, the Verlet algorithm discussed in the next section has a truncation error proportional to ∆t4 for each integration time step. Round-off errors decrease more slowly with decreasing ∆t, and dominate in the small ∆t limit. Using 64bit precision (corresponding to ”double precision” on the majority of today’s workstations) helps to keep round-off errors at a minimum. The popular Verlet algorithm integration method, which will be adopted in the codes of this course, is presented in the section below. For more detailed information on time integration algorithms, the interested reader is referred to refs. [7] for a general survey, and to ref. [8] for a deeper analysis. 13 2.4 2.4 The Verlet algorithm 2 THE BASIC MACHINERY The Verlet algorithm In molecular dynamics, the most commonly used time integration algorithm is probably the so-called Verlet algorithm [2]. The basic idea is to write two third-order Taylor expansions for the positions r(t), one forward and one backward in time. Calling v the velocities, a the accelerations, and b the third derivatives of r with respect to t, one has: r(t + ∆t) = r(t) + v(t)∆t + (1/2)a(t)∆t2 + (1/6)b(t)∆t3 + O(∆t4 ) r(t − ∆t) = r(t) − v(t)∆t + (1/2)a(t)∆t2 − (1/6)b(t)∆t3 + O(∆t4 ) (8) (9) Adding the two expressions gives r(t + ∆t) = 2r(t) − r(t − ∆t) + a(t)∆t2 + O(∆t4 ) (10) This is the basic form of the Verlet algorithm. Since we are integrating Newton’s equations, a(t) is just the force divided by the mass, and the force is in turn a function of the positions r(t): a(t) = −(1/m)∇U (r(t)) (11) As one can immediately see, the truncation error of the algorithm when evolving the system by ∆t is of the order of ∆t4 , even if third derivatives do not appear explicitly. This algorithm is at the same time simple to implement, accurate and stable, explaining its large popularity among molecular dynamics simulators. 2.4.1 Velocity Verlet algorithm A problem with this version of the Verlet algorithm is that velocities are not directly generated. While they are not needed for the time evolution, their knowledge is sometimes necessary. Moreover, they are required to compute the kinetic energy K, whose evaluation is necessary to test the conservation of the total energy E = K + U 4 . This is one of the most important tests to verify that a MD simulation is proceeding correctly. One could compute the velocities from the positions by using r(t + ∆t) − r(t − ∆t) (12) v(t) = 2∆t However, the error associated to this expression is of order ∆t2 rather than ∆t4 . In fact, from (8) and (9) we have: v(t) = r(t + ∆t) − r(t − ∆t) + (1/6)b(t)∆t2 + O(∆t3 ) 2∆t (13) To overcome this difficulty, some variants of the Verlet algorithm have been developed. They give rise to exactly the same trajectory, and differ in what variables are stored in memory and at what times. For example, an even better implementation of the same basic algorithm is the 4 One of the most important features that makes the Verlet algorithm so popular is the fact that the conservation of energy is exactly respected, even at large time steps. 14 2.4 The Verlet algorithm 2 THE BASIC MACHINERY so-called velocity Verlet scheme, where positions, velocities and accelerations at time t + ∆t are obtained from the same quantities at time t in the following way: 1 r(t + ∆t) = r(t) + v(t)∆t + a(t)∆t2 2 ∆t 1 v(t + ) = v(t) + a(t)∆t 2 2 a(t + ∆t) = −(1/m)∇U (r(t + ∆t)) v(t + ∆t) = v(t + ∆t 1 ) + a(t + ∆t)∆t 2 2 Note how we need 9N memory locations to save the N positions, velocities and accelerations, but we never need to have simultaneously stored the values at two different times for any one of these quantities. 2.4.2 Leap-frog algorithm In the codes developed in this course another variant of the Verlet algorithm will be employed, which is very similar to the previous one: the so-called leap-frog algorithm. Leap-frog integration is equivalent to calculating positions and velocities at interleaved time points, interleaved in such a way that they ”leapfrog” over each other. For example, the position is known at integer time steps and the velocity is known at integer plus half time steps. The equations for leap-frog algorithm can be written r(t + ∆t) = r(t) + v(t + v(t + ∆t )∆t 2 ∆t ∆t ) = v(t − ) + a(t)∆t. 2 2 (14) (15) Note that at the start (t = 0) of the leap-frog iteration, at the first step t = ∆t/2, computing v(t + ∆t ), one needs the velocitiy at t = −∆t/2. At first sight this could give problems, because 2 the initial conditions are known only at the initial time (t=0). Usually, the initial velocity is used in place of the v(−∆t) term: the error on the first time step calculation then is of order O(∆t2 ). This is not considered a problem because on a simulation of over a large amount of time steps, the error on the first time step is only a negligible small amount of the total error, which at time t = n∆t is of the order O(eLn∆t ∆t2 ). 2.4.3 Error terms The local error in position of the Verlet integrator is O(∆t4 ) as described above, and the local error in velocity is O(∆t2 ). The global error in position, in contrast, is O(∆t2 ) and the global error in velocity is O(∆t2 ). These can be derived by noting the following: error(r(t0 + ∆t)) = O(∆t4 ) 15 2.4 The Verlet algorithm 2 THE BASIC MACHINERY and from (10) r(t0 + 2∆t)) = 2r(t0 + ∆t)) − r(t0 ) + a(t0 + ∆t)∆t2 + O(∆t4 ). Therefore error(r(t0 + 2∆t)) = 2 error(r(t0 + ∆t)) + O(∆t4 ) = 3 O(∆t4 ). In deriving the above equation, pay attention that the errors on r(t0 + ∆t)) and r(t0 ) have not to be summed, as one does when he deals with ”random” errors, but you have to keep the minus sign since the error of the first term depends on the error of the second one according a precise function and consequently they goes always to ”the same direction”. Similarly at the previous calculation, we have: error(r(t0 + 3∆t)) = 6 O(∆t4 ) error(r(t0 + 4∆t)) = 10 O(∆t4 ) error(r(t0 + 5∆t)) = 15 O(∆t4 ), which can be generalized by induction to error(r(t0 + n∆t)) = n(n + 1) O(∆t4 ). 2 If we consider the global error in position between r(t) and r(t + T ), where T = n∆t, it is clear that 2 T T error(r(t0 + T )) = + O(∆t4 ), 2 2∆t 2∆t and therefore, the global (cumulative) error over a constant interval of time is given by error(r(t0 + T )) = O(∆t2 ). Because the velocity is determined in a non-cumulative way5 from the positions in the Verlet integrator, the global error in velocity is also O(∆t2 ). In molecular dynamics simulations, the global error is typically far more important than the local error, and the Verlet integrator is therefore known as a second-order integrator. 5 In the Verlet algorithm the velocity plays no part in the integration of the equations of motions. However, velocities often are necessary for the calculation of certain physical quantities like the kinetic energy. This deficiency can either be dealt with using the Velocity Verlet (or Leap-frog) algorithm, or estimating the velocity using the position terms by the equation (12), which has a local error O(∆t2 ). Since from this equation velocity depends only on the positions and not on the velocities evaluated at previous times, the ”non-cumulative way” expression is used for the determination of the velocity in the Verlet algorithm. 16 3 3 SIMPLE STATISTICAL QUANTITIES Simple statistical quantities Measuring quantities in molecular dynamics usually means performing time averages of physical properties over the system trajectory. Physical properties are usually a function of the particle coordinates and velocities. So, for instance, one can define the instantaneous value of a generic physical property A at time t: A(t) = f (r1 (t), . . . , rN (t), v1 (t), . . . , vN (t)) and then obtain its average hAi = NT 1 X A(t) NT t=1 where t is an index which runs over the time steps from 1 to the total number of steps NT . There are two equivalent ways to do this in practice: P 1. A(t) is calculated at each time step by the MD program while running. The sum t A(t) is also updated at each step. At the end of the run the average is immediately obtained by dividing by the number of steps. This is the preferred method when the quantity is simple to compute and/or particularly important. An example is the system temperature. 2. Positions (and possibly velocities) are periodically dumped in a ”trajectory file” while the program is running. A separate program, running after the simulation program, processes the trajectory and computes the desired quantities. This approach can be very demanding in terms of disk space: dumping positions and velocities using 64-bit precision takes 48 bytes per step and per particle. However, it is often used when the quantities to compute are complex, or combine different times as in dynamical correlations, or when they are dependent on other additional parameters that may be unknown when the MD program is run. In these cases, the simulation is run once, but the resulting trajectory can be processed over and over. By our codes we will give examples of the both strategies. 3.1 Potential energy The average potential energy U is obtained by averaging its instantaneous value, which is usually obtained straightforwardly at the same time as the force computation is made. For instance, in the case of two-body interactions XX U (t) = φ(|ri (t) − rj (t)|) (16) i j>i Even if not strictly necessary to perform the time integration (forces are all is needed), knowledge of U is required to verify energy conservation. This is an important check to do in any MD simulation. 17 3.2 Kinetic energy 3 SIMPLE STATISTICAL QUANTITIES We mentioned in the previous chapter that when one deals with short-range interactions, the potential and force are usually truncated at some cutoff distance Rc for computational expediency. Consequently the effective potential in our case will be: ( φLJ (rij ) if rij ≤ Rc φ(rij ) = (17) 0 if rij > Rc Such truncation implies that certain physical quantities like potential energy and pressure has to be corrected in order to obtain a value closer to the value one would get if he takes into account the exact potential. An approximate correction can be obtained by assuming the spatial correlations beyond the cutoff distance are unity. The student is referred to Refs. [7] and [9] for these so-called ”standard long-range corrections” (sLRC). It should be noted that this is not a good assumption in inhomogeneous fluids. We will use the following expression for the longrange correction to be applied at each term of the outermost sum in (16): " 3 # Z ∞ 9 1 σ 1 σ 1 2 drij rij φLJ (rij ) = 8πρσ 3 − (18) φLRC = 4πρ 2 9 Rc 3 Rc Rc where ρ is the bulk number density (N/V ). 3.2 Kinetic energy The instantaneous kinetic energy is of course given by K(t) = 1X mi [vi (t)]2 2 i (19) and is therefore extremely easy to compute. We will call K its average on the run. 3.3 Total energy The total energy E = K + U is a conserved quantity in Newtonian dynamics. However, it is common practice to compute it at each time step in order to check that it is indeed constant with time. In other words, during the run energy flows back and forth between kinetic and potential, causing K(t) and U (t) to fluctuate while their sum remains fixed. In practice, there could be small fluctuations in the total energy, in a typical amount of, say, one part in 104 or less. These fluctuations are usually caused by errors in the time integration (see section 2.3), and can be reduced in magnitude by reducing the time step if considered excessive. Ref. [8] contains an in-depth analysis of total energy fluctuations using various time integration algorithms. Slow drifts of the total energy are also sometimes observed in very long runs. Such drifts could also be originated by an excessive ∆t. Drifts are more disturbing than fluctuations because the thermodynamic state of the system is also changing together with the energy, and therefore 18 3.4 Temperature 3 SIMPLE STATISTICAL QUANTITIES time averages over the run do not refer to a single state. If drifts over long runs tend to occur, they can be prevented, for instance by breaking the long run into smaller pieces and restoring the energy to the nominal value between one piece and the next. A common mechanism to adjust the energy is to modify the kinetic energy via rescaling of velocities. A final word of caution: while one may be tempted to achieve ”perfect” energy conservation by reducing the time step as much as desired, working with an excessively small time step may result in waste of computer time. A practical compromise would probably allow for small energy fluctuations and perhaps slow energy drifts, as a price to pay to work with a reasonably large ∆t. See also [7]. 3.4 Temperature The temperature T is directly related to the kinetic energy by the well-known equipartition formula, assigning an average kinetic energy kB T /2 per degree of freedom: 3 K = N kB T 2 (20) An estimate of the temperature is therefore directly obtained from the average kinetic energy K (see 3.2). For practical purposes, it is also common practice to define an ”instantaneous temperature” T (t), proportional to the instantaneous kinetic energy K(t) by a relation analogous to the one above. 3.5 Radial distribution function One of the most important static properties that can be used to characterize liquids in general and LJ liquids in particular is the radial distribution function (rdf) so defined: g(r) = X 1 hδ(r − |ri − rj |)i ρ4πr2 dr ij (21) where δ(x) is the function: δ(0) = 1, and δ(x) = 0 for x 6= 0. The rdf is important for three reasons: 1. for pairwise additive potentials as LJ one, knowledge of the rdf is sufficient information to calculate thermodynamic properties, particularly the energy and pressure; 2. there are very well developed integral-equation theories that permit estimation of the rdf for a given molecular model; 3. the rdf can be measured experimentally, using neutron-scattering techniques. 19 3.6 3.6 MSD and VACF 3 SIMPLE STATISTICAL QUANTITIES Mean square displacement and velocity autocorrelation function Two of the most important dynamic properties that are used to characterize liquids are meansquare displacement (MSD) and velocity autocorrelation function (VACF). The MSD of atoms in a simulation can be easily computed by its definition ∆r2 (t) = h|r(t) − r(0)|2 i (22) where h. . . i denotes here averaging over all the atoms (or all the atoms in a given subclass). Care must be taken to avoid considering the ”jumps” of particles to refold them into the box when using periodic boundary conditions as contributing to diffusion. The MSD contains information on the atomic diffusivity. If the system is solid, MSD saturates to a finite value, while if the system is liquid, MSD grows linearly with time. In this case it is useful to characterize the system behavior in terms of the slope, which is the diffusion coefficient D: 1 h|r(t) − r(0)|2 i (23) D = lim t→∞ 2dt where d is the dimensionality of the system (usually 2 or 3). The VACF is defined as C(t) = hv(t)v(0)i (24) and it is related to D by the equation: 1 D= d 3.7 Z ∞ dthv(t) · v(0)i (25) 0 Pressure The measurement of the pressure in a molecular dynamics simulation is based on the Clausius virial function N X W (r1 , . . . , rN ) = ri · FTi OT (26) i=1 FTi OT is the total force acting on atom i. Its statistical average hW i will be obtained, as where usual, as an average over the molecular dynamics trajectory: Z N 1 t X hW i = lim dτ ri (τ ) · mi ai (τ ) t→∞ t 0 i=1 where use has been made of Newton’s law (1). Integrating by parts, Z N 1 t X hW i = − lim dτ mi |vi (τ )|2 . t→∞ t 0 i=1 This is twice the average kinetic energy, therefore by the equipartition law of statistical mechanics hW i = −dN kB T 20 (27) 3.7 Pressure 3 SIMPLE STATISTICAL QUANTITIES where kB the Boltzmann constant. Now, one may think of the total force acting on a particle as composed of two contributions: FTi OT = Fi + FEXT i (28) is the external where Fi is the internal force (arising from the interatomic interactions), and FEXT i force exerted by the container’s walls. If the particles are enclosed in a parallelepipedic container of sides Lx , Ly , Lz , volume V = Lx Ly Lz , and with the coordinates origin on one of its corners, the part hW EXT i due to the container can be evaluated using the definition (26): hW EXT i = Lx (−P Ly Lz ) + Ly (−P Lx Lz ) + Lz (−P Lx Ly ) = −dP V where −P Ly Lz is, for instance, the external force FxEXT applied by the yz wall along the x directions to particles located at x = Lx , etc. Eq. (27) can then be written * N + X ri · Fi − dP V = −dN kB T i=1 or 1 P V = N kB T + d * N X + ri · Fi (29) i=1 This important result is known as the virial equation. All the quantities except the pressure P are easily accessible in a simulation, and therefore (29) constitutes a way to measure P . Note how (29) reduces to the well-known equation of state of the perfect gas if the particles are noninteracting. In the case of pairwise interactions via a potential φ(r), it is left as an exercise to the reader to verify that (29) becomes + * 1 X X dφ . (30) rij P V = N kB T − d dr i j>i rij This expression has the additional advantage over (29) to be naturally suited to be used when periodic boundary conditions are present: it is sufficient to take them into account in the definition of rij . As for the potential energy, also the expression of the pressure has to be corrected if a truncated potential is used in the numerical simulation. By using the same assumption mentioned in the paragraph 3.1 one can derive the following correction to be applied at each term of the outermost sum in the pressure formulas above corresponding to the contribution of each pair ij: " 3 # Z ∞ 9 1 4 σ 2 σ 2 PLRC = 4πρ2 drij rij rij · Fij = 8πρ2 σ 3 − (31) 2 9 Rc 3 Rc Rc 21 4 4 COMPUTATIONAL ASPECTS OF MD Computational aspects of MD 4.1 MD algorithms Classical MD is commonly used for simulating the properties of liquids solids and molecules. Each of the N atoms or molecules in the simulation is treated as a point mass and Newton’s equations are integrated to compute their motion. From the motion of the ensemble of atoms a variety of useful microscopic and macroscopic information can be extracted such as transport coefficients, phase diagrams and structural or conformational properties. The physics of the model is contained in a potential energy functional for the system from which individual force equations for each atom are derived. MD simulations are typically not memory intensive since only vectors of atom information are stored. Computationally, the simulations are ”large” in two domains 1. the number of atoms 2. the number of time steps The length scale for atomic coordinates is Angstroms; in three dimensions many thousands or millions of atoms must usually be simulated to approach even the submicron scale. In liquids and solids the time step size is constrained by the demand that the vibrational motion of the atoms be accurately tracked. This limits time steps to the femtosecond scale and so tens or hundreds of thousands of time steps are necessary to simulate even picoseconds of ”real” time. Because of these computational demands considerable effort has been expended by researchers to optimize MD and even to build special-purpose hardware for performing MD simulations. The current state-of-the-art is such that simulating ten- to hundred-thousand atom systems for nanoseconds takes hours of CPU time on machines such as Jugene or Juropa supercomputer at JSC. Since MD computations are inherently parallel, as we will see in the next pages, there has been considerable effort in the last years by researchers to exploit this parallelism on various kind of machines. The experience has proved that the message-passing model of programming6 for multiple-instruction/multiple-data (MIMD) parallel machines is the only one that provides enough flexibility to implement all the data structure and computational enhancements that are commonly exploited in MD codes on serial (and vector) machines. Several more and more efficient algorithms were developed to performed specific kinds of classical MD. In this course we will focus on some of such algorithms which are appropriate for a general class of MD problems that has two salient characteristics: 1. The first characteristic is that forces are limited in range, meaning each atom interacts only with other atoms that are geometrically nearby. Solids and liquids are often modeled this way due to electronic screening effects or simply to avoid the computational cost of including long-range Coulombic forces. For short-range MD, the computational effort per time step scales as N , the number of atoms. 6 MPI is a library specification for message-passing model, become —de facto— the standard for messagepassing model of programming. 22 4.2 Long- vs short-range force model 4 COMPUTATIONAL ASPECTS OF MD 2. The second characteristic is that the atoms can undergo large displacements over the duration of the simulation. This could be due to diffusion in a solid or liquid, or conformational changes in a biological molecule. The important feature from a computational standpoint is that each atom’s neighbors change as the simulation progresses. While the algorithms we discuss could also be used for fixed-neighbor simulations (e.g. all atoms remain on lattice sites in a solid), it is a harder task to continually track the neighbors of each atom and maintain efficient O(N ) scaling for the overall computation on a parallel machine. Moreover, one wants the algorithms to work well on problems with small numbers of atoms, not just for large problems where parallelism is often easier to exploit. This is because the vast majority of MD simulations are performed on systems of a few hundred to several thousand atoms where N is chosen to be as small as possible while still accurate enough to model the desired physical effects. The computational goal in these calculations is to perform each time step as quickly as possible. This is particularly true in non-equilibrium MD where macroscopic changes in the system may take significant time to evolve, requiring millions of time steps to model. Thus, it is often more useful to be able to perform a 1,000,000 time step simulation of a 1000 atom system fast rather than 1000 time steps of a 1,000,000 atom system, though the O(N ) scaling means the computational effort is the same for both cases. To this end, we consider model sizes as small as a few hundred atoms in this course. On the other hand, for very large MD problems, it is important that the parallel algorithms are scalable to larger and faster parallel machines, i.e. they scale optimally with respect to N and P (the number of processors). 4.2 Long- vs short-range force model The computational task in classical MD simulation is to integrate the set of coupled differential equations (Newton’s equations) that in general can be written: X dvi = F2 (ri , rj ) + F3 (ri , rj , rk ) + . . . (32) mi dt j dri = vi dt where mi is the mass of atom i, ri and vi are its position and velocity vectors, F2 is a force function describing pairwise interactions between atoms, F3 describes three-body interactions, and many-body interactions can be added. The force terms are derivatives of energy expressions (the potentials) in which the potential energy of atom i is typically written as a function of the positions of itself and other atoms. In practice, only one or a few terms in equation (32) are kept and F2 , F3 , etc. are constructed so as to include many-body and quantum effects. To the extent the approximations are accurate these equations give a full description of the time-evolution of the system7 . 7 Thus, the great computational advantage of classical MD, as compared to the so-called ab initio electronic structure calculations, is that the dynamic behavior of the atomic system is described empirically without having to solve the computational expensive but ”exact” Schr¨odinger’s equation at each time step. 23 4.3 Neighbor lists and link-cell method 4 COMPUTATIONAL ASPECTS OF MD The force terms in equation (32) are typically non-linear functions of the distance rij between pairs of atoms and may be either long-range or short-range in nature. For long-range forces, such as Coulombic interactions in an ionic solid or biological system, each atom interacts with all others. Directly computing these forces scales as N 2 and is too costly for large N . Various approximate methods, valid in different conditions, overcome this difficulty. They include particle-mesh algorithms [10], which scale as f (M )N where M is the number of mesh points, hierarchical methods [11] which scale as N log(N ), and fast-multipole methods [12] which scale as N . Modern parallel implementations of these algorithms have improved their range of applicability for many-body simulations, but because of their expense, long-range force models are not dealt in this course. By contrast, short-range force models, and in particular the Lennard-Jones one introduced in the paragraph 2.1.1, are our main concern. They are chosen either because electronic screening effectively limits the range of influence of the interatomic forces being modeled or simply to truncate the long-range interactions and lessen the computational load. In either case, the summations in equation (32) are restricted to atoms within some small region surrounding atom i. As we have seen in the paragraph 2.1.2, this is typically implemented using a cutoff distance Rc , outside of which all interactions are ignored. The work to compute forces now scales linearly with N . Notwithstanding this savings, the vast majority of computation time spent in a shortrange force MD simulation is in evaluating the force terms in equation (32), even in the case (as the LJ model) only the F2 term has to be evaluated. The time integration typically requires only 2-3% of the total time. To evaluate the sums efficiently requires knowing which atoms are within the cutoff distance Rc at every time step. The key is to minimize the number of neighboring atoms that must be checked for possible interactions since calculations performed on neighbors at a distance r > Rc are wasted computation. There are two basic techniques used to accomplish this on serial (and vector machines). We discuss them since our parallel algorithms incorporate similar ideas. 4.3 Neighbor lists and link-cell method The first idea to minimize the number of neighboring atoms that must be checked for possible interactions is the one of neighbor lists. It was originally proposed by Verlet [2]. For each atom, a list is maintained of nearby atoms. Typically, when the list is formed, all neighboring atoms within an extended cutoff distance Rs = Rc + δ are stored. The list is used for a few time steps to calculate all force interactions. Then it is rebuilt before any atom could have moved from a distance r > Rs to r < Rc . Though δ is always chosen to be small relative to Rc , an optimal value depends on the parameters (e.g. temperature, diffusivity, density) of the particular simulation. The advantage of the neighbor list is that once it is built, examining it for possible interactions is much faster than checking all atoms in the system. The second technique commonly used for speeding up MD calculations is known as the linkcell method [13]. At every time step, all the atoms are binned into 3-D cells of side length ` where ` = Rc or slightly larger. This reduces the task of finding neighbors of a given atom to checking in 27 bins — the bin the atom is in and the 26 surrounding ones. Since binning the atoms only requires O(N ) work, the extra overhead associated with it is acceptable for the savings of only 24 4.4 Parallel algorithms 4 COMPUTATIONAL ASPECTS OF MD having to check a local region for neighbors. The fastest serial short-range MD algorithms use a combination of neighbor lists and linkcell binning. In the combined method, atoms are only binned once every few time steps for the purpose of forming neighbor lists. In this case atoms are binned into cells of size ` ≥ Rs . At intermediate time steps the neighbor lists alone are used in the usual way to find neighbors within a distance Rc of each atom. This is a significant savings over a conventional link-cell method since there are far fewer atoms to check in a sphere of volume 4πRs3 /3 than in a cube of volume 27Rc3 . Additional savings can be gained due to Newton’s 3rd law: Fij = −Fji (33) by only computing a force once for each pair of atoms (rather than once for each atom in the pair). In the combined method this is done by only searching half the surrounding bins of each atom to form its neighbor list. This has the effect of storing atom j in atom i’s list, but not atom i in atom j’s list, thus halving the number of force computations that must be done. 4.4 Parallel algorithms In the last 20 years there has been considerable interest in devising parallel MD algorithms. The natural parallelism in MD is that the force calculations and velocity/position updates can be done simultaneously for all atoms. To date, two basic ideas have been exploited to achieve this parallelism. The goal in each is to divide the force computations in equation (32) evenly across the processors so as to extract maximum parallelism. All the algorithms that have been proposed or implemented so far are variations on these two methods. In the first class of methods a pre-determined set of force computations is assigned to each processor. The assignment remains fixed for the duration of the simulation. The simplest way of doing this is to give a subgroup of atoms to each processor. We call this method an atom-decomposition of the workload, since the processor computes forces on its atoms no matter where they move in the simulation domain. More generally, a subset of the force loops inherent in equation (32) can be assigned to each processor. We term this a force-decomposition. Both of these decompositions are analogous to ”Lagrangian gridding” in a fluids simulation where the grid cells (computational elements) move with the fluid (atoms in MD). By contrast, in the second general class of methods, which we call a spatial-decomposition of the workload, each processor is assigned a portion of the physical simulation domain. Each processor computes only the forces on atoms in its sub-domain. As the simulation progresses, processors exchange atoms as they move from one sub-domain to another. This is analogous to an ”Eulerian gridding” for a fluids simulation where the grid remains fixed in space as fluid moves through it. Within the two classes of methods for parallelization of MD, a variety of algorithms have been proposed and implemented by various researchers. The details of the algorithms vary widely from one parallel machine to another since there are numerous problem-dependent and machine-dependent trade-offs to consider, such as the relative speeds of computation and communication. Atom-decomposition methods, also called replicated-data methods because identical copies of atom information are stored on all processors, have been often used in short-range MD simu25 4.4 Parallel algorithms 4 COMPUTATIONAL ASPECTS OF MD lations of molecular systems. This is because the duplication of information makes for straightforward computation of additional three- and four-body force terms. Parallel implementations of almost outdatedt biological MD programs such as CHARMM [14] and GROMOS [15] use this technique. Spatial-decomposition methods, also called geometric methods are now more common in the literature because they are well-suited to very large MD simulations and when long-range interaction are involved. Very recent parallel implementations of state-of-the art biological MD programs such as AMBER [16], GROMACS [17] and NAMD [18] use this technique. 26 5 5 COMPUTATIONAL ASPECTS OF OUR MODEL Computational aspects of our model 5.1 Reduced units Unit systems are invented to make physical laws look simple and numerical calculations easy. Take Newton’s law: F = ma. In the SI unit system, this means that if an object of mass x (kg) is undergoing an acceleration of y (m/s2 ), the force on the object must be xy (N). However, there is nothing intrinsically special about the SI unit system. One (kg) is simply the mass of a platinum-iridium prototype in a vacuum chamber in Paris. If one wishes, one can ˜ which say is 1/7 of the mass of the Paris prototype: define his or her own mass unit — e.g. (kg), ˜ 1 (kg) = 7 (kg). ˜ is oneOs ˜ choice of the mass unit, how about the unit system? One really has to make a If (kg) decision here, which is either keeping all the other units unchanged and only making the (kg)?( ˜ transition, or, changing some other units along with the (kg)?( kg) ˜ transition. kg) Imagine making the first choice, i.e. keeping all the other units of the SI system unchanged, ˜ That is all right, including the force unit (N), and only changes the mass unit from (kg) to (kg). ˜ law must be re-expressed as F = ma/7, because if except in the new unit system the NewtonOs ˜ is undergoing an acceleration of y (m/s2 ), the force on the object is xy an object of mass 7x (kg) (N). There is nothing inherently wrong with the F = ma/7 expression, which is just a recipe for computation — a correct one for the newly chosen unit system. Fundamentally, F = ma/7 and F = ma describe the same physical law. But it is true that F = ma/7 is less elegant than F = ma. No one likes to memorize extra constants if they can be reduced to unity by a sensible choice of units. The SI unit system is sensible, because (N) is picked to work with other SI units to satisfy F = ma. ˜ as the mass unit? Simple, just define How may we have a sensible unit system but with (kg) ˜ ˜ ˜ (N) = (N)/7 as the new force unit. The (m)-(s)-(kg)-(N)-unit system is sensible because the simplest form of F = ma is preserved. Thus we see that when a certain unit in a sensible unit system is altered, other units must also be altered correspondingly in order to constitute a new sensible unit system, which keeps the algebraic forms of all fundamental physical laws unaltered8 . In science people have formed deep-rooted conventions about the simplest algebraic forms of physical laws, such as F = ma, K = mv 2 /2, E = K + U , etc. Although nothing forbids one from modifying the constant coefficients in front of each expression, one is better off not to. Fortunately, as long as one uses a sensible unit system, these algebraic expressions stays invariant. Now, imagine we derive a certain composite law from a set of simple laws. On one side, we start with and consistently use a sensible unit system A. On the other side, we start with and consistently use another sensible unit system B. Since the two sides use exactly the same algebraic forms, the resultant algebraic expression must also be the same, even though for a given physical instance, a variable takes on two different numerical values on the two sides as 8 A notable exception is the conversion between SI and Gaussian unit systems in electrodynamics, during which a non-trivial factor of 4π comes up. 27 5.1 Reduced units 5 COMPUTATIONAL ASPECTS OF OUR MODEL different unit systems are adopted. This means that the final algebraic expression describing the physical phenomena must satisfy certain concerted scaling invariance with respect to its dependent variables, corresponding to any feasible transformation between sensible unit systems. This strongly limits the form of possible algebraic expressions describing physical phenomena, which is the basis of dimensional analysis. In the MD world, there is also another reason why people prefer to choose ”non-conventional” units. Typical simulation sizes in molecular dynamics simulation are very small up to 1000 atoms. As a consequence, most of the extensive quantities are small in magnitude when measured in macroscopic units. There are two possibilities to overcome this problem: either one should work with atomic-scale units (ps, a.m.u., nm) or to make all the observable quantities dimensionless with respect to their characteristic values. The second approach is more popular. The scaling is done with the model parameters e.g. in the case of LJ model: size σ, energy , mass m. So the common recipe is, one chooses a value for one atom/molecule pair potential arbitrarily () and then other model parameters (say total energy E) are given in terms of this reference value (E ∗ = E/). The other parameters are also calculated similarly. For example, dimensionless r , (34) distance r∗ = σ E energy E ∗ = , (35) kB T temperature T ∗ = , (36) P σ3 pressure P ∗ = , (37) t p time t∗ = , (38) σ m/ Fσ , (39) force F ∗ = ρσ 3 density ρ∗ = (40) m and so on. These are the so-called reduced units for the LJ potential model that we will adopt in our codes. If we write the LJ potential (6) in dimensionless form " 6 # 12 1 1 (41) φ∗LJ (r∗ ) = 4 − r∗ r∗ we see that it is parameter independent, consequently all the properties must also be parameter independent. If a potential only has a couple of parameters then this scaling has a lot of advantages. Namely, potential evaluation can be really efficient in reduced units and as the results are always the same, so the results can be transferred to different systems with straight forward 28 5.2 Simulation parameters 5 COMPUTATIONAL ASPECTS OF OUR MODEL scaling by using the model parameters σ, and m. This is equivalent to selecting unit value for the parameters and it is convenient to report system properties in this form e.g P ∗ (ρ∗ ). For reference, Tab. 1 show the values of LJ model parameters for some noble gases which are well described by such a physical model. Ne Ar Kr Xe ˚ σ (A) 0.278 3.405 3.680 3.924 /kB (K) m (a.m.u.) 37.29 20.18 119.8 39.95 166.6 83.80 257.4 131.29 Table 1: LJ model parameters for some nobel gases. 5.2 Simulation parameters As mentioned several times in the previous chapters we will study the model of N identical point-like atoms in a box where periodic boundary conditions are applied in all the directions, and where atoms interact one each other only through a Lennard-Jones pairwise potential (41) in dimensionless reduced units. The derivative of (41) with respect to the relative distance r between two particles is the F2 term in equation (32) (F3 and higher order terms are ignored). The N atoms are simulated in a 3D parallelepiped with periodic boundary conditions at the Lennard-Jones state point defined by the reduced density: ρ∗ = 0.8442 and reduced temperature: T ∗ = 0.72 This is a liquid state near the Lennard-Jones triple point9 . The simulation will begin with the atoms on a face-centered cube (fcc) lattice (Fig. 2) with randomized velocities. The fcc configuration represents a minimal energy configuration (ground state) and consequently it would be stable at T = 0, i.e. without kinetic energy as a solid crystal. Since we will provide kinetic energy (T 6= 0) to that system, it quickly melts evolving to its natural liquid state. The equilibration phase, in this very simple model, happens very fast. A roughly uniform spatial density persists for the duration of the simulation. The simulation is run at constant N , volume V , and total energy E, a statistical sampling from the microcanonical ensemble. Force computations using the potential in equation (41) are truncated at a cutoff distance ∗ Rc = 2.5. Using such cutoff, in our simulation each atom has on average 55 neighbor. If neighbor list are used, an extended cutoff length Rs∗ = 2.8 is defined (encompassing about 78 9 A triple point is a point on the phase diagram, for which three distinct thermodynamic phases coexist in equilibrium. For the LJ model without truncation, Mastny and de Pablo found it at T ∗ = 0.694 and ρ∗ = 0.96 [19]. 29 5.2 Simulation parameters 5 COMPUTATIONAL ASPECTS OF OUR MODEL Figure 2: Face-centered cube (fcc) lattice cell, used as initial configuration of our system. atoms) for forming neighbor lists and the lists are updated every 20 time steps. The cost of creating neighbor lists every 20 time steps is amortized over the per time step timing. The integration time step we are going to use in the leap-frog algorithm is ∆t∗ = 0.00462. In the case of the Ar, the constant to convert times between reduced units and conventional ones is: according to (38) and the values of the parameters from Tab. 1: r mσ 2 = 2.156x10−12 , thus the reduced time unit is 2.156 ps. This is roughly the timescale of one atomic oscillation period in Ar. The integration time step for Ar corresponds to ∆t = ∆t∗ × 2.156 ps ' 0.01 ps. In other words, we choose such integration time step (∼ 1/100 of atomic oscillation period) so that to accurately describe even such elementary atomic motion. 30 6 6 ATOM-DECOMPOSITION ALGORITHM Atom-decomposition algorithm The first parallel algorithm that we will take into account is the atom-decomposition (AD) algorithm. In this algorithm each of the P processors is assigned a group of N/P atoms at the beginning of the simulation. Atoms in a group need not have any special spatial relationship to each other. For ease of exposition, we assume N is a multiple of P , though it is simple to relax this constraint. A processor will compute forces on only its N/P atoms and will update their positions and velocities for the duration of the simulation no matter where they move in the physical domain. This is an atom-decomposition of the computational workload. A useful construct for representing the computational work involved in the algorithm is the N × N force matrix F . The (ij) element of F represents the force on atom i due to atom j. Note that F is sparse due to short-range forces and skew-symmetric, i.e. Fij = −Fji due to Newton’s 3rd law. We also define x and f as vectors of length N which store the position and total force on each atom. For a 3-D simulation, xi would store the three coordinates of atom i. With these definitions, the AD algorithm assigns each processor a sub-block of F which consists of N/P rows of the matrix, as shown in Figure 3. If z indexes the processors from 0 to P − 1, then processor Pz computes matrix elements in the Fz sub-block of rows. It also is assigned the corresponding sub-vectors of length N/P denoted by xz and fz . Figure 3: The division of the force matrix among P processors in the atom-decomposition algorithm. Processor Pz is assigned N/P rows of the matrix and the corresponding xz piece of the position vector. In addition, it must know the entire position vector x to compute the matrix elements in their own rows. For the LJ potential, the computation of matrix element Fij requires only the two atom positions xi and xj . To compute all the elements in Fz , processor Pz will need the positions of many atoms owned by other processors. This implies that at every time step each processor must receive updated atom positions from all the other processors, an operation called all-to-all communication. Various algorithms have been developed for performing this operation efficiently on different parallel machines and architectures. Usually, the MPI library implementation of the all-to-all communication operations takes advantages of such optimized algorithms. In the course we will implement such operations by using MPI library. Moreover, we will also try to implement such operations by using only point-to-point communication operations to compare it with the MPI implementation. In particular, we will resort an idea outlined in [20], that is simple, portable, and works well on a variety of machines. 31 6.1 6.1 Expand and fold operations 6 ATOM-DECOMPOSITION ALGORITHM Expand and fold operations Following Fox’s nomenclature [20], we term the all-to-all communication procedure described above an expand operation. Each processor allocates memory of length N to store the entire x vector. At the beginning of the expand, processor Pz has xz , an updated piece of x of length N/P . Each processor needs to acquire all the other processor’s pieces, storing them in the correct places in its copy of x. Figure 4a illustrates the steps that accomplish this for an 8-processor example. The processors are mapped consecutively to the sub-pieces of the vector. In the first communication step, each processor partners with an adjacent processor in the vector and they exchange sub-pieces. Processor 2 partners with 3. Now, every processor has a contiguous piece of x that is of length 2N/P . In the second step, each processor partners with a processor two positions away and exchanges its new piece (2 receives the shaded sub-vectors from 0). Each processor now has a 4N/P −length piece of x. In the last step, each processor exchanges an N/2− length piece of x with a processor P/2 positions away (2 exchanges with 6); the entire vector now resides on each processor. The expand operation can also be accomplished by using the mpi allgather collective operation provided by the MPI library. Figure 4: Expand and fold operations among eight processors, each of which requires three steps. (a) In the expand, processor 2 receives successively longer shaded sub-vectors from processor 3, 0, and 6; (b) In the fold, processor 2 receives successively shorter shaded sub-vectors from processors 6, 0, and 3. A communication operation that is essentially the inverse of the expand will also prove useful in the atom-decomposition algorithm. Assume each processor has stored new force values throughout its copy of the force vector f . Processor Pz needs to know the N/ P values in fz , 32 6.2 The simpler algorithm 6 ATOM-DECOMPOSITION ALGORITHM where each of the values is summed across all P processors. This is known as a fold operation and is outlined in Figure 4b. In the first step each processor exchanges half the vector with a processor it partners with that is P/2 positions away. Note that each processor receives the half that it is a member of and sends the half it is not a member of (processor 2 receives the shaded first half of the vector from 6). Each processor sums the received values with its corresponding retained sub-vector. This operation is recursed, halving the length of the exchanged data at each step. The fold operation can also be accomplished by using the mpi reduce collective operation provided by the MPI library. Costs for a communication algorithm are typically quantified by the number of messages and the total volume of data sent and received. On both these accounts the expand and fold of Figure 4 are optimal, each processor performs log2 (P ) sends and receives and exchanges N −N/P data values. Each processor also performs N − N/P additions in the fold. A drawback is that the algorithms require O(N ) storage on every processor. Alternative methods for performing all-toall communication require less storage at the cost of more sends and receives. This is usually not a good trade-off for MD simulations because quite large problems can be run with the few Gbytes of local memory available on current-generation processors. We now describe two versions of an AD algorithm which use expand and fold operations. 6.2 The simpler algorithm The first algorithm is simpler and does not take advantage of Newton’s 3rd law. We call this algorithm A1: it is outlined in Figure 5 with the dominating term(s) in the computation or communication cost of each step listed on the right. We assume at the beginning of the time step that each processor knows the current positions of all N atoms, i.e. each has an updated copy of the entire x vector. Step (1) of the algorithm is to construct neighbor lists for all the pairwise interactions that must be computed in block Fz . Typically this will only be done once every few time steps. If the ratio of the physical domain diameter S to the extended force cutoff length Rs is relatively small, it is quicker for Pz to construct the lists by checking all N 2 /P pairs in its Fz block. When the simulation is large enough that 4 or more bins can be created in each dimension, it is quicker for each processor to bin all N atoms, then check the 27 surrounding bins of each of its N/P atoms to form the lists. This checking scales as N/P but has a large coefficient, so the overall scaling of the binned neighbor list construction is recorded as N/P + N . In step (2) of the algorithm, the neighbor lists are used to compute the non-zero matrix elements in Fz . As each pairwise force interaction is computed, the force components are summed into fz so that Fz is never actually stored as a matrix. At the completion of the step, each processor knows the total force fz on each of its N/P atoms . This is used to update their positions and velocities in step (4). (A step (3) will be added to the other algorithm.) Finally, in step (5) the updated atom positions in xz are shared among all P processors in preparation for the next time step via the expand operation of Figure 4a. As discussed above, this operation scales as N , the volume of data in the position vector x. 33 6.3 Using Newton’s 3rd law (1) 6 ATOM-DECOMPOSITION ALGORITHM Construct neighbor lists of non-zero interactions in Fz N2 P (S < 4Rs ) All pairs (S ≥ 4Rs ) Binning N P +N (2) Compute elements of Fz , summing results into fz (4) Update atom positions in xz using fz N P N P (5) Expand xz among all processors, result in x N Figure 5: Single time step of atom-decomposition algorithm A1 for processor Pz . 6.3 Taking advantage of Newton’s 3rd law As mentioned in the previous paragraph, algorithm A1 ignores Newton’s 3rd law. If different processors own atoms i and j as is usually the case, both processors compute the (ij) interaction and store the resulting force on their atom. This can be avoided at the cost of more communication by using a modified force matrix G which references each pairwise interaction only once. There are several ways to do this by striping the force matrix [21]; we choose instead to form G as follows. Let Gij = Fij except that Gij = 0 when i > j and i + j is even, and likewise Gij = 0 when i < j and i + j is odd. Conceptually, G is colored like a checkerboard with red squares above the diagonal set to zero and black squares below the diagonal also set to zero. A modified AD algorithm A2 that uses G to take advantage of Newton’s 3rd law is outlined in Figure 6. (1) Construct neighbor lists of non-zero interactions in Gz N2 2P (S < 4Rs ) All pairs (S ≥ 4Rs ) Binning N 2P +N (2) Compute elements of Gz , doubly summing results into local copy of f N 2P (3) Fold f among all processors, result is fz N (4) Update atom positions in xz using fz N P (5) Expand xz among all processors, result in x N Figure 6: Single time step of atom-decomposition algorithm A2 for processor Pz , which take advantage of Newton’s 3rd law. Step (1) is the same as in algorithm A1 except only half as many neighbor list entries are made by each processor since Gz has only half the non-zero entries of Fz . This is reflected in the factors of two included in the scaling entries. For neighbor lists formed by binning, each 34 6.4 Load-balance 6 ATOM-DECOMPOSITION ALGORITHM processor must still bin all N atoms, but only need check half the surrounding bins of each of its N/P atoms. In step (2) the neighbor lists are used to compute elements of Gz . For an interaction between atoms i and j, the resulting forces on atoms i and j are summed into both the i and j locations of force vector f . This means each processor must store a copy of the entire force vector, as opposed to just storing fz as in algorithm A1. When all the matrix elements have been computed, f is folded across all P processors using the algorithm in Figure 4b. Each processor ends up with fz , the total forces on its atoms. Steps (4) and (5) then proceed the same as in A1. Note that implementing Newton’s 3rd law essentially halved the computation cost in steps (1) and (2) at the expense of doubling the communication cost. There are now two communication steps (3) and (5), each of which scale as N . This will only be a net gain if the communication cost in A1 is less than a third of the overall run time. This is usually not the case on large numbers of processors, so in practice we almost always choose A1 instead of A2 for an AD algorithm. However, for small P or expensive force models, A2 can be faster. 6.4 Load-balance Each processor will have an equal amount of work if each Fz or Gz block has roughly the same number of non-zero elements. This will be the case if the atom density is uniform across the simulation domain. However non-uniform densities can arise if, for example, there are free surfaces so that some atoms border on vacuum, or phase changes are occurring within a liquid or Solid. This is only a problem for load-balance if the N atoms are ordered in a geometric sense as is typically the case. Then a group of N/P atoms near a surface, for example, will have fewer neighbors than groups in the interior. This can be overcome by randomly permuting the atom ordering at the beginning of the simulation, which is equivalent to permuting rows and columns of F or G. This insures that every Fz or Gz will have roughly the same number of non-zeros. A random permutation also has the advantage that the load-balance will likely persist as atoms move about during the simulation. Note that this permutation needs only to be done once, as a pre-processing step before beginning the dynamics. In summary, the AD algorithms divide the MD force computation and integration evenly across the processors (ignoring the O(N ) component of binned neighbor list construction which is usually not significant). However, the algorithms require global communication, as each processor must acquire information held by all the other processors. This communication scales as N , independent of P , so it limits the number of processors that can be used effectively. The chief advantage of the algorithms is that of simplicity. Steps (1), (2) and (4) can be implemented by simply modifying the loops and data structures in a serial or vector code to treat N/P atoms instead of N . The expand and fold communication operations (3) and (5) can be treated as blackbox routines (as for example by using the MPI operations) and inserted at the proper locations in the code. Few other changes are typically necessary to parallelize an existing code. 35 7 7 FORCE-DECOMPOSITION ALGORITHM Force-decomposition algorithm Our next parallel MD algorithm is based on a block-decomposition of the force matrix rather than a row-wise decomposition as used in the previous chapter. We call this a force-decomposition (FD) of the workload. As we shall see, this improves the O(N ) scaling of the communication √ cost to O(N/ P ). Block-decompositions of matrices are common in linear algebra algorithms for parallel machines even if they are not common for short-range MD simulations. The assignment of sub-blocks of the force matrix to processors with a row-wise (calendar) ordering of the processors is depicted in Figure 7. We assume for ease of exposition that P is an even power of 2 and that N is a multiple of P , although again it is straightforward to relax these constraints. As before, we let z index the processors from 0 to P − 1; processor Pz owns and will update the N/P atoms stored in the sub-vector xz . Figure 7: The division of the permuted force matrix F 0 among 16 processors √ in the force-decomposition √ 0 algorithm. Processor P6 is assigned a sub-block F6√of size N/ P by N/ P . To compute its matrix elements it must know the corresponding N/ P -length pieces xα and x0β of the position vector x and permuted position vector x0 . To reduce communication (explained below) the block-decomposition in Figure 7 is actually of a permuted force matrix F 0 which is formed by rearranging the columns of F in a particular way. If we order the xz pieces in row-order, they form the usual position vector x which is shown as a vertical bar at the left of the figure. Were we to have x span the columns as in Figure 3 we would form the force matrix as before. Instead, we span the columns with a permuted position vector x0 shown as a horizontal bar at the top of Figure 7 in which the xz pieces are stored in column-order. Thus, in the 16-processor example shown in the figure, x stores each processor’s piece in the usual order (0, 1, 2, 3, 4, . . . , 14, 15) while x0 stores them as 36 7.1 The simpler algorithm 7 FORCE-DECOMPOSITION ALGORITHM (0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15). Now the (ij) element of F 0 is the force on atom i in vector x due to atom j in permuted vector x0 . √ √ P ) × (N/ P ). To compute The Fz0 sub-block owned by each processor Pz is of size (N/ √ the matrix elements in Fz0 , processor Pz must know one N/ P -length piece of each of the x and x0 vectors, which we denote as xα and x0β . As these elements are computed they will be accumulated into corresponding force sub-vectors fα and fβ0 . The Greek subscripts α and β each √ run from 0 to P − 1 and reference the row and column position occupied by processor Pz . Thus for processor 6 in the figure, xα consists of the x sub-vectors (4,5,6,7) and x0β consists of the x0 sub-vectors (2,6,10,14). 7.1 The simpler algorithm Our first FD algorithm F1 is outlined in Figure 8. As before, each processor has updated copies of the needed atom positions xα and x0β at the beginning of the time step. In step (1) neighbor lists are constructed. Again, for small problems this is most quickly done by checking all N 2 /P √ possible pairs in Fz0 . For large problems, the N/ P atoms in x0β are binned, then the 27 sur√ rounding bins of each atom in xα is checked. The checking √ procedure scales as N/ P , so the scaling of the binned neighbor list construction is still N/ P . In step (2) the neighbor lists are used to compute the matrix elements in Fz0 . As before the elements are summed into a local copy of fα as they are computed so Fz0 never need be stored in matrix form. In step (3) a fold operation is performed within each row of processors so that processor Pz obtains the total forces fz on its N/P atoms. Although the fold algorithm used is the same as in the preceding is a √ chapter, there√ key difference. In this case the vector fα being folded is only of length N/ P and only √ the P processors in one row are participating in the fold. Thus this operation scales as N/ P instead of N as in the AD algorithm. In step (4), fz is used by Pz to update the N/P atom positions in xz . Steps (5a-5b) share these updated positions with all the processors that will need them for the next time step. These are the processors which share a row or column with Pz . First, in (5a), the processors in row α perform an expand of their xz sub-vectors so that each acquires the entire xα . As with the fold, √ this operation scales as the N/ P length of xα instead of as N as it did in algorithms A1 and A2. Similarly, in step (5b), the processors in each column β perform an expand of their xz . As a result they all acquire x0β and are ready to begin the next time step. It is in step (5) that using a permuted force matrix F 0 saves extra communication. The permuted form of F 0 causes xz to be a component of both xα and x0β for each Pz . This would not be the case if we had block-decomposed the original force matrix F by having x span the columns instead of x0 . Then in Figure 7 the xβ for P6 would have consisted of the sub-vectors (8,9,10,11), none of which components are known by P6 . Thus, before performing the expand in step (5b), processor 6 would need to first acquire one of these 4 components from another processor (in the transpose position in the matrix) requiring an extra O(N/P ) communication step. The transpose-free version of the FD algorithms presented here was motivated by a matrix permutation for parallel matrix-vector multiplication discussed in reference [22]. 37 7.2 Using Newton’s 3rd law (1) 7 FORCE-DECOMPOSITION ALGORITHM Construct neighbor lists of non-zero interactions in Fz0 (S < 4Rs ) All pairs (S ≥ 4Rs ) Binning (2) Compute elements of Fz0 , storing results in fα (3) Fold fα within row α, result is fz (4) Update atom positions in xz using fz (5a) Expand xz within row α, result in xα (5b) Expand xz within column β, result in x0β N2 P N √ P N P N √ P N P N √ P N √ P Figure 8: Single time step of force-decomposition algorithm F1 for processor Pz . 7.2 Taking advantage of Newton’s 3rd law As with algorithm A1 algorithm F1 does not take advantage of Newton’s 3rd law; each pairwise force interaction is computed twice. Algorithm F2 avoids this duplicated effort by checkerboarding the force matrix as in the preceding chapter. Specifically, the checkerboarded matrix G is permuted in the same way as F was, to form G0 . Note that now the total force on atom i is the sum of all matrix elements in row i minus the sum of all elements in column i. The modified FD algorithm F2 is outlined in Figure 9. Step (1) is the same as in F1 except that half as many interactions are stored in the neighbor lists. Likewise, step (2) requires only half as many matrix elements be computed. For each (ij) element, the computed force components are now summed into two force sub-vectors instead of one. The force on atom i is summed into fα in the location corresponding to row i. Likewise, the force on atom j is summed into fβ0 in the location corresponding to column j. Steps (3a-3c) accumulate these forces so that processor Pz √ ends up with the total force on its N/P atoms. First, in step (3a), the P processors in column β fold their local copies of fβ0 . The result is fz0 . Each element of this N/P -length sub-vector is the sum of an entire column of G’. Next, in step (3b), the row contributions to the forces are summed by performing a fold of the fα vector within each row α. The result is fz , each element of which is the sum across a row of G0 . Finally, in step (3c) the column and row contributions are subtracted, element by element, to yield the total forces fz on the atoms owned by processor Pz . The processor can now update the positions and velocities of its atoms; steps (4) and (5) are identical to those of F1. In the FD algorithms, exploiting Newton’s 3rd law again halves the computation required in steps (1) and (2). However, the communication cost in steps (3) and (5) does not double. Rather there are 4 expands and folds required in F2 versus 3 in F1. Thus, in practice, it is usually faster to use algorithm F2 with its reduced computational cost and slightly increased communication cost rather √ than F1. The key point is that all the expand and fold operations in F1 and F2 scale as N/ P rather than as N as was the case in algorithms A1 and A2. When you 38 7.3 Load-balance (1) 7 FORCE-DECOMPOSITION ALGORITHM Construct neighbor lists of non-zero interactions in G0z (S < 4Rs ) All pairs (S ≥ 4Rs ) Binning (2) Compute elements of = G0z , storing results in fα and fβ0 (3a) Fold fβ0 within column β, result is fz0 (3b) Fold fα within row α, result is fz (3c) Subtract fz0 from fz , result is total fz (4) Update atom positions in xz using fz (5a) Expand xz within row α, result in xα (5b) Expand xz within column β, result in x0β N2 2P N √ 2 P N 2P N √ P N √ P N P N P N √ P N √ P Figure 9: Single time step of force-decomposition algorithm F2 for processor Pz , which takes advantage of Newton’s third law. run on large numbers of processors this significantly reduces the time the FD algorithms spend on communication as compared to the AD algorithms. 7.3 Load-balance Finally, the issue of load-balance is a more serious concern for the FD algorithms. Processors will have equal work to do only if all the matrix blocks Fz0 or G0z are uniformly sparse. If the atoms are ordered geometrically this will not be the case even for problems with uniform density. This is because such an ordering creates a force matrix with diagonal bands of non-zero elements. As in the AD case, a random permutation of the atom ordering produces the desired effect. Only now the permutation should be done as a pre-processing step for all problems, even those with uniform atom densities. In summary, algorithms F1 and F2 divide the MD computations evenly across processors as did the AD algorithms. √ But the block-decomposition of the force matrix means each procesits computations. Thus the communication sor only needs O(N/ P ) information to perform √ and memory costs are reduced by a factor of P versus algorithms A1 and A2. The FD strategy retains the simplicity of the AD technique; F1 and F2 can be implemented using the same “black-box” communication routines as A1 and A2. The FD algorithms also need no geometric information about the physical problem being modeled to perform optimally. In fact, for loadbalancing purposes the algorithms intentionally ignore such information by using a random atom ordering 39 8 8 SPATIAL-DECOMPOSITION ALGORITHM Spatial-decomposition algorithm In our final parallel algorithm the physical simulation domain is subdivided into small 3-D boxes, one for each processor. We call this a spatial-decomposition (SD) of the workload. Each processor computes forces on and updates the positions and velocities of all atoms within its box at each time step. Atoms are reassigned to new processors as they move through the physical domain. In order to compute forces on its atoms, a processor need only know positions of atoms in nearby boxes. The communication required in the SD algorithm is thus local in nature as compared to global in the AD and FD cases. The size and shape of the box assigned to each processor will depend on N , P , and the aspect ratio of the physical domain, which we assume to be a 3-D rectangular parallelepiped. Within these constraints the number of processors in each dimension is chosen so as to make each processor’s box as “cubic” as possible. This is to minimize communication since in the large N limit the communication cost of the SD algorithm will turn out to be proportional to the surface area of the boxes. An important point to note is that in contrast to the link-cell method described in Section 4.3, the box lengths may now be smaller or larger than the force cutoff lengths Rc and Rs . Each processor in our SD algorithm maintains two data structures, one for the N/P atoms in its box and one for atoms in nearby boxes. In the first data structure, each processor stores complete information positions, velocities, neighbor lists, etc. This data is stored in a linked list to allow insertions and deletions as atoms move to new boxes. In the second data structure only atom positions are stored. Inter-processor communication at each time step keeps this information current. 8.1 The communication scheme The communication scheme we use to acquire this information from processors owning the nearby boxes is shown in Figure 10. The first step (a) is for each processor to exchange information with adjacent processors in the east/west dimension. Processor 2 fills a message buffer with atom positions it owns that are within a force cutoff length Rs of processor 1’s box.10 If s < Rs , where s is the box length in the east/west direction, this will be all of processor 2’s atoms; otherwise it will be those nearest to box 1. Now each processor sends its message to the processor in the westward direction (2 sends to 1) and receives a message from the eastward direction. Each processor puts the received information into its second data structure. Now the procedure is reversed with each processor sending to the east and receiving from the west. If s > Rs , all needed atom positions in the east-west dimension have now been acquired by each processor. If s < Rs , the east-west steps are repeated with each processor sending more needed atom positions to its adjacent processors. For example, processor 2 sends processor 1 atom positions from box 3 (which processor 2 now has in its second data structure). This can be repeated until each processor knows all atom positions within a distance Rs of its box, as indicated by the dotted boxes in the figure. The same procedure is now repeated in the north/south dimension (see step 10 The reason for using Rs instead of Rc will be made clear further down this chapter 40 8.1 The communication scheme 8 SPATIAL-DECOMPOSITION ALGORITHM (b) of the Figure 10). The only difference is that messages sent to the adjacent processor now contain not only atoms the processor owns (in its first data structure), but also any atom positions in its second data structure that are needed by the adjacent processor. For s = Rs , this has the effect of sending 3 boxes worth of atom positions in one message as shown in (b). Finally, in step (c) the process is repeated in the up/down dimension. Now atom positions from an entire plane of boxes (9 in the figure) are being sent in each message. Figure 10: Method by which a processor acquires nearby atom positions in the spatial-decomposition algorithm. In six data exchanges all atom positions in adjacent boxes in the 9a) east/west, (b) north/south, and (c) up/down directions can be communicated. There are several key advantages to this scheme, all of which reduce the overall cost of communication in our algorithm. 1. First, for s ≥ Rs needed atom positions from all 26 surrounding boxes are obtained in just 6 data exchanges. Moreover, if the parallel machine has a hypercube topology (as for example the IBM BlueGene/P supercomputer at JSC) the processors can be mapped to the boxes in such a way that all 6 of these processors will be directly connected to the center processor. Thus message passing will be fast and contention-free. 2. Second, when s < Rs so that atom information is needed from more distant boxes, this occurs with only a few extra data exchanges, all of which are still with the 6 immediate neighbor processors. This is an important feature of the algorithm that enables it to perform well even when large numbers of processors are used on relatively small problems. 3. A third advantage is that the amount of data communicated is minimized. Each processor acquires only the atom positions that are within a distance Rs of its box. Fourth, all of 41 8.2 The simpler algorithm 8 SPATIAL-DECOMPOSITION ALGORITHM the received atom positions can be placed as contiguous data directly into the processor’s second data structure. No time is spent rearranging data, except to create the buffered messages that need to be sent. 4. Finally, as will be discussed in more detail below, this message creation can be done very quickly. A full scan of the two data structures is only done once every few time steps, when the neighbor lists are created, to decide which atom positions to send in each message. The scan procedure creates a list of atoms that make up each message. During all the other time steps, the lists can be used, in lieu of scanning the full atom list, to directly index the referenced atoms and buffer up the messages quickly. This is the equivalent of a gather operation on a vector machine. 8.2 The simpler algorithm We now outline our SD algorithm S1 in Figure 11. Box z is assigned to processor Pz , where z runs from 0 to P −1 as before. Processor Pz stores the atom positions of its N/P atoms in xz and the forces on those atoms in fz . Steps (1a-1c) are the neighbor list construction, performed once every few time steps. This is somewhat more complex than in the other algorithms because, as discussed above, it includes the creation of lists of atoms that will be communicated at every time step. First, in step (1a) the positions, velocities, and any other identifying information of atoms that are no longer inside box z are deleted from xz (first data structure) and stored in a message buffer. These atoms are exchanged with the 6 adjacent processors via the communication pattern of Figure 10. As the information routes through each dimension, processor Pz checks for new atoms that are now inside its box boundaries, adding them to its xz . Next, in step (1b) all atom positions within a distance Rs of box z are acquired by the communication scheme described above. As the different messages are buffered by scanning through the two data structures, lists of included atoms are made. The lists will be used in step (5). The scaling factor ∆ for steps (1a) and (1b) will be explained below. When steps (1a) and (1b) are complete, both of the processor’s data structures are current. Neighbor lists for its N/P atoms can now be constructed in step (1c). If atoms i and j are both in box z (an inner-box interaction), the (ij) pair is only stored once in the neighbor list. If i and j are in different boxes (a two-box interaction), both processors store the interaction in their respective neighbor lists. If this were not done, processors would compute forces on atoms they do not own and communication of the forces back to the processors owning the atoms would be required. A modified algorithm that performs this communication to avoid the duplicated force computation of two-box interactions is discussed below. When s, the length of box z, is less than two cutoff distances, it is quicker to find neighbor interactions by checking each atom inside box z against all the atoms in both of the processor’s data structures. This scales as the square of N/P . If s > 2Rs , then with the shell of atoms around box z, there are 4 or more bins in each dimension. In this case, as with the algorithms of the preceding sections, it is quicker to perform the neighbor list construction by binning. All the atoms in both data structures are mapped to bins of size Rs . The surrounding bins of each atom in box z are then checked for possible neighbors. Processor Pz can now compute all the forces on its atoms in step (2) using the neighbor lists. 42 8.2 The simpler algorithm 8 SPATIAL-DECOMPOSITION ALGORITHM (1a) Move necessary atoms to new boxes ∆ (1b) Make lists of all atoms that will need to be exchanged ∆ (1c) Construct neighbor lists of interacting pairs in box z N P (s < 2Rs ) All pairs (s ≥ 2Rs ) Binning (2) Compute forces on atoms in box z, doubly storing results in fz (4) Update atom positions in xz in box z using fz (5) Exchange atom positions across box boundaries N P with neighboring processors (s < Rs ) Send N/P positions to many neighbors (s & Rs ) Send N/P positions to nearest neighbors (s Rs ) Send positions near box surface to nearest neighbors N +∆ 2P N +∆ 2P N +∆ 2P N P (1 + 2Rs /d)3 Rs3 N P N 2/3 P Figure 11: Single time step of spatial-decomposition algorithm S1 for processor Pz . When the interaction is between two atoms inside box z, the resulting force is stored twice in fz , once for atom i and once for atom j. For two-box interactions, only the force on the processor’s own atom is stored. After computing fz , the atom positions are updated in step (4). Finally, these updated positions must be communicated to the surrounding processors in preparation for the next time step. This occurs in step (5) in the communication pattern of Figure 11 using the previously created lists. The amount of data exchanged in this operation is a function of the relative values of the force cutoff distance and box length, and is discuss in the next subsection. Also, we note that on the time steps that neighbor lists are constructed, step (5) does not have to be performed since step (1b) has the same effect. 8.2.1 Communication scaling The communication operations in algorithm S1 occur in steps (1a), (1b) and (5). The communication in the latter two steps is identical. The cost of these steps scales as the volume of data exchanged. For step (5), if we assume uniform atom density, this is proportional to the physical volume of the shell of thickness Rs around box z, namely (s+2Rs )3 −s3 . Note there are roughly N/P atoms in a volume of s3 since s3 is the size of box z. There are three cases to consider: 1. If s < Rs data from many neighboring boxes must be exchanged and the operation scales as 8Rs3 . 2. If s & Rs the data in all 26 surrounding boxes is exchanged and the operation scales as 43 8.2 The simpler algorithm 8 SPATIAL-DECOMPOSITION ALGORITHM 27N/P . 3. If s Rs only atom positions near the 6 faces of box z will be exchanged. The communication then scales as the surface area of box z, namely 6Rs (N/P )2/3 . These cases are explicitly listed in the scaling of step (5). Elsewhere in Figure 11, we use the term ∆ to represent whichever of the three is applicable for a given N , P , and Rs . We note that step (1a) involves less communication since not all the atoms within a cutoff distance of a box face will move out of the box. But this operation still scales as the surface area of box z, so we list its scaling as ∆. 8.2.2 Computational scaling The computational portion of algorithm S1 is in steps (1c), (2) and (4). All of these scale as N/P with additional work in steps (1c) and (2) for the atoms that are neighboring box z and stored in the second data structure. The number of these atoms is proportional to ∆ so it is included in the scaling of those steps. The leading term in the scaling of steps (1c) and (2) is listed as N/2P as in algorithms A2 and F2, since inner-box interactions are only stored and computed once for each pair of atoms in algorithm S1. Note that as s grows large relative to Rs as it will for very large simulations, the ∆ contribution to the overall computation time decreases and the overall scaling of algorithm S1 approaches the optimal N/2P . In essence, each processor spends nearly all its time working in its own box and only exchanges a relatively small amount of information with neighboring processors to update its boundary conditions. 8.2.3 Data structures’ update An important feature of algorithm S1 is that the data structures are only modified once every few time steps when neighbor lists are constructed. In particular, even if an atom moves outside box z’s boundaries it is not reassigned to a new processor until step (1a) is executed. Processor Pz can still compute correct forces for the atom so long as two criteria are met: • The first is that an atom does not move farther than s between two neighbor list constructions. • The second is that all nearby atoms within a distance Rs , instead of Rc , must be updated every time step. The alternative is to move atoms to their new processors at every time step. This has the advantage that only atoms within a distance Rc of box z need to be exchanged at all time steps when neighbor lists are not constructed. This reduces the volume of communication since Rc < Rs . However, now the neighbor list of a reassigned atom must also be sent. The information in the neighbor list is atom indices referencing local memory locations where the neighbor atoms are stored. If atoms are continuously moving to new processors, these local indices become meaningless. To overcome this, one can assign a global index (1 to N ) to each atom that moves with the atom from processor to processor. A mapping of global index to local memory must then be 44 8.3 Using Newton’s 3rd law 8 SPATIAL-DECOMPOSITION ALGORITHM stored in a vector of size N by each processor or the global indices must be sorted and searched to find the correct atoms when they are referenced in a neighbor list. The former solution limits the size of problems that can be run, the latter solution incurs an extra cost for the sort and search operations. On the other side, the S1 version of the code exploits a Tamayo and Giles idea [23] which makes it less complex and reduces the computational and communication overhead. This did not affect the timings for simulations with large N , but improves the algorithm’s performance for medium-sized problems. 8.3 Taking advantage of Newton’s 3rd law A modified version of S1 that takes full advantage of Newton’s 3rd law can also be devised, call it algorithm S2. If processor Pz acquires atoms only from its west, south, and down directions (and sends its own atoms only in the east, north, and up directions), then each pairwise interaction need only be computed once, even when the two atoms reside in different boxes. This requires sending computed force results back in the opposite directions to the processors who own the atoms, as a step (3) in the algorithm. This scheme does not reduce communication costs, since half as much information is communicated twice as often, but does eliminate the duplicated force computations for two-box interactions. Two points are worth noting. First, the overall savings of S2 over S1 is small, particularly for large N . Only the ∆ term in steps (1c) and (2) is saved. Second, the performance of SD algorithms for large systems can be improved by optimizing the single-processor force computation in step (2). As with vector machines this requires more attention be paid to data structures and loop orderings in the force and neighborlist construction routines to achieve high single-processor flop rates. Implementing S2 requires special-case coding for atoms near box edges and corners to insure all interactions are counted only once [24], which can hinder this optimization process. For all these reasons we will not implement an S2 code. 8.4 Load-balance Finally, the issue of load-balance is an important concern in any SD algorithm. Algorithm S1 will be load-balanced only if all boxes have a roughly equal number of atoms (and surrounding atoms). This will not be the case if the physical atom density is non-uniform. Additionally, if the physical domain is not a rectangular parallelepiped, it can be difficult to split into P equal-sized pieces. Sophisticated load-balancing algorithms have been developed to partition an irregular physical domain or non-uniformly dense clusters of atoms, but they create sub-domains which are irregular in shape or are connected in an irregular fashion to their neighboring sub-domains. In either case, the task of assigning atoms to sub-domains and communicating with neighbors becomes more costly and complex. If the physical atom density changes over time during the MD simulation, the load-balance problem is compounded. Any dynamic load-balancing scheme requires additional computational overhead and data movement. In summary, the SD algorithm, like the AD and FD ones, evenly divides the MD computations across all the processors. Its chief benefit is that it takes full advantage of the local 45 8.4 Load-balance 8 SPATIAL-DECOMPOSITION ALGORITHM nature of the interatomic forces by performing only local communication. Thus, in the large N limit, it achieves optimal O(N/P ) scaling and is clearly the fastest algorithm. However, this is only true if good load-balance is also achievable. Since its performance is sensitive to the problem geometry, algorithm S1 is more restrictive than A2 and F2 whose performance is geometry-independent. A second drawback of algorithm S1 is its complexity; it is more difficult to implement efficiently than the simpler AD and FD algorithms. In particular the communication scheme requires extra coding and bookkeeping to create messages and access data received from neighboring boxes. In practice, integrating algorithm S1 into an existing serial MD code can require a substantial reworking of data structures and code. 46 REFERENCES REFERENCES References [1] B. J. Alder, T. E. Wainwright, J. Chem. Phys. 27, p. 1208 (1957). [2] L. Verlet, Phys. Rev. 159, p. 98 (1967). [3] L. Hannon, G. C. Lee, E. Clementi, J. Sci. Computing 1(2), p. 145 (1986). [4] M. Karplus, J. A. McCammon, Scientific American April 1986, p. 42 (1986). [5] See, for example, J. P. Hansen and I. R. McDonald, Theory of simple liquids, 2nd Ed., Academic, 1986. A classic book on liquids, with a chapter devoted to computer Simulation. [6] See for instance N. W. Ashcroft, N. D. Mermin, Solid state physics, Brooks Cole, 1976. [7] M. P. Allen and D. J. Tildesley, Computer simulation of liquids Oxford, 1987. This is a classic how to book containing a large number of technical details on a vast array of techniques. J. M. Haile, Molecular dynamics simulation, Wiley, 1992. An excellent general introduction to molecular dynamics. Not as thorough as the previous one in explaining techniques, but very deep in explaining the fundamentals, including ”philosophical” aspects. Probably the best starting point. A good amount of Fortran 77 code is included. [8] H. J. C. Berendsen and W. F. van Gunsteren, in Molecular dynamics simulation of statistical-mechanical systems, G. Ciccotti and W. G. Hoover (Eds.), North-Holland, 1986. [9] D. Frenkel and B. Smit, Understanding Molecular Simulation, 2nd ed., Academic, San Diego, 2002. [10] R. W. Hockney and J. W. Eastwood Computer Simulation Using Particles, Adam Hilger, New York, 1988. [11] J. E. Barnes and P. Hut, Nature 324, p. 446 (1986). [12] L. Greengard and V. Rokhlin, J. Comp. Phys. 73, p. 325 (1987). [13] R. W. Hockney, S. P. Goel and J. W. Eastwood, J. Comp. Phys. 14, p. 148 (1974). [14] http://www.charmm.org [15] http://www.igc.ethz.ch/GROMOS/index [16] http://ambermd.org [17] http://www.gromacs.org [18] http://www.ks.uiuc.edu/Research/namd [19] E. A. Mastny, J. J. de Pablo, J. Chem. Phys., 127, p. 104504 (2007). 47 REFERENCES REFERENCES [20] G. C. Fox, M. A. Johnson, G. A. Lyzenga, S. W. otto, J. K. Salmon, and D. W. Walker, Solving Problems on Concurrent Processors: Volume 1, Prentice Hall, Englewood Cliffs, NJ (1988). [21] H. Schreiber,O. Steinhauser and P. Schuster, Parallel Computing 18, p.557-573 (1992). [22] J. G. Lewis and R. A. van de Geijn, Distributed memory matrix-vector multiplication and conjugate gradient algorithms. In Proc. Supercomputing. ’93, pages 484-492. IEEE Computer Society Press (1993). [23] P. Tamayo and R. Giles, A parallel scalable approach to short-range molecular dynamics on the CM-5. In Proc. Scalable High Performance Computing Conference-92, p. 240-245. IEEE Computer Society Press (1992). [24] D. Brown, J. H. R. Clarke, M. Okuda and T. Yamazaki, A domain decomposition parallelization strategy for molecular dynamics simulations on distributed memory machines. Comp. Phys. Comm. 73, p. 67-80 (1993). 48