14.1 Introduction

Given $N$ bodies with positions $\mathbf{x}_i$ and velocities $\mathbf{v}_i$ for $1 \leq i \leq N$ , the force vector $\mathbf{f}_{ij}$ on body $i$ caused by body $j$ is given by

f _ {i j} = G \frac {m _ {i} m _ {j}}{\left\| d _ {i j} \right\| ^ {2}} \cdot \frac {d _ {i j}}{\left\| d _ {i j} \right\|}

where $m_{i}$ and $m_{j}$ are the masses of bodies $i$ and $j$ , respectively; $\mathbf{d}_{ij}$ is the difference vector from body $i$ to body $j$ ; and $G$ is the gravitational constant. Due to divide overflow, this express diverges for $\mathbf{d}_{ij}$ with small magnitude; to compensate, it is common practice to apply a softening factor that models the interaction between two Plummer masses—masses that behave as if they were spherical galaxies. For a softening factor $\varepsilon$ , the resulting expression is

f _ {i j} = G \frac {m _ {i} m _ {j} d _ {i j}}{\left(\left\| d _ {i j} \right\| ^ {2} + \varepsilon^ {2}\right) ^ {3 / 2}}

The total force $\mathbf{F}_{\mathrm{i}}$ on body $i$ , due to its interactions with the other $N - 1$ bodies, is obtained by summing all interactions.

\mathbf {F} _ {i} = \sum_ {j = 1} ^ {N} \mathbf {f} _ {i j} = G m _ {i} \sum_ {j = 1} ^ {N} \frac {m _ {i} m _ {j} \mathbf {d} _ {i j}}{\left(\left\| \mathbf {d} _ {i j} \right\| ^ {2} + \varepsilon^ {2}\right) ^ {3 / 2}}

To update the position and velocity of each body, the force (acceleration) applied for body $i$ is $\mathbf{a}_i = \mathbf{F}_i / m_i$ , so the $\mathbf{m}_i$ term can be removed from the numerator, as follows.

\mathbf{F}_{i} = \sum_{\substack{1\leq j\leq N\\ j\neq i}}^{N}\mathbf{f}_{ij} = Gm_{i}\sum_{j = 1}^{N}\frac{m_{j}\mathbf{d}_{ij}}{\left(\left\| \mathbf{d}_{ij}\right\|^{2} + \varepsilon^{2}\right)^{3 / 2}}

Like Nyland et al., we use a leapfrog Verlet algorithm to apply a timestep to the simulation. The values of the positions and velocities are offset by half a timestep from one another, a characteristic that is not obvious from the code

because in our sample, the positions and velocities are initially assigned random values. Our leapfrog Verlet integration updates the velocity, then the position.

\mathbf {V} _ {i} \left(t + \frac {1}{2} \partial t\right) = \mathbf {V} _ {i} \left(t - \frac {1}{2} \partial t\right) + \partial t \mathbf {F} _ {i}

\mathbf {P} _ {i} (t + \partial t) = \mathbf {P} _ {i} (t) + \partial t \mathbf {V} _ {i} (t + \frac {1}{2} \partial t)

This method is just one of many different integration algorithms that can be used to update the simulation, but an extensive discussion is outside the scope of this book.

Since the integration has a runtime of $O(N)$ and computing the forces has a runtime of $O(N^2)$ , the biggest performance benefits from porting to CUDA stem from optimizing the force computation. Optimizing this portion of the calculation is the primary focus of this chapter.

14.1.1 A MATRIX OF FORCES

A naive implementation of the N-Body algorithm consists of a doubly nested loop that, for each body, computes the sum of the forces exerted on that body by every other body. The $O(N^2)$ body-body forces can be thought of as an $N \times N$ matrix, where the sum of each row $i$ is the total gravitational force exerted on body $i$ .

\mathbf {F} _ {i} = \sum_ {j = 1} ^ {N} \mathbf {f} _ {i j}

The diagonals of this "matrix" are zeroes corresponding to each body's influence on itself, which can be ignored.

Because each element in the matrix may be computed independently, there is a tremendous amount of potential parallelism. The sum of each row is a reduction that may be computed by a single thread or by combining results from multiple threads as described in Chapter 12.

Figure 14.1 shows an 8-body "matrix" being processed. The rows correspond to the sums that are the output of the computation. When using CUDA, N-Body implementations typically have one thread compute the sum for each row.

Figure 14.1 "Matrix" of forces (8 bodies).

Since every element in the "matrix" is independent, they also may be computed in parallel within a given row: Compute the sum of every fourth element, for example, and then add the four partial sums to compute the final output. Harris et al. describe using this method for small $N$ where there are not enough threads to cover the latencies in the GPU: Launch more threads, compute partial sums in each thread, and accumulate the final sums with reductions in shared memory. Harris et al. reported a benefit for $N \leq 4096$ .

For physical forces that are symmetric (i.e., the force exerted on body $i$ by body $j$ is equal in magnitude but has the opposite sign of the force exerted by body $j$ on $i$ ), such as gravitation, the transpose "elements" of the matrix have the opposite sign.

\mathbf {f} _ {i j} = - \mathbf {f} _ {j i}

In this case, the "matrix" takes the form shown in Figure 14.2. When exploiting the symmetry, an implementation need only compute the upper right triangle of the "matrix," performing about half as many body-body computations. $^3$

Figure 14.2 Matrix with symmetric forces.

The problem is that unlike the brute force method outlined in Figure 14.1, when exploiting symmetric forces, different threads may have contributions to add to a given output sum. Partial sums must be accumulated and either written to temporary locations for eventual reduction or the system must protect the final sums with mutual exclusion (by using atomics or thread synchronization). Since the body-body computation is about 20 FLOPS (for single precision) or 30 FLOPS (for double precision), subtracting from a sum would seem like a decisive performance win.

Unfortunately, the overhead often overwhelms the benefit of performing half as many body-body computations. For example, a completely naive implementation that does two (2) floating-point atomic adds per body-body computation is prohibitively slower than the brute force method.

Figure 14.3 shows a compromise between the two extremes: By tiling the computation, only the upper right diagonal of tiles needs to be computed. For a tile size of $k$ , this method performs $k^2$ body-body computations each on $\frac{N / k(N / k - 1)}{2}$ nondiagonal tiles, plus $k(k - 1)$ body-body computations each on

Figure 14.3 Tiled N-Body $(k = 2)$ .

$N / k$ diagonal tiles. For large $N$ , the savings in body-body computations are about the same,[4] but because the tiles can locally accumulate partial sums to contribute to the final answer, the synchronization overhead is reduced. Figure 14.3 shows a tile size with $k = 2$ , but a tile size corresponding to the warp size ( $k = 32$ ) is more practical.

Figure 14.4 shows how the partial sums for a given tile are computed. The partial sums for the rows and columns are computed—adding and subtracting, respectively—in order to arrive at partial sums that must be added to the corresponding output sums.

The popular AMBER application for molecular modeling exploits symmetry of forces, performing the work on tiles tuned to the warp size of $32$ , but in extensive testing, the approach has not proven fruitful for the more lightweight computation described here.

Figure 14.4 N-Body tile.

14.1_Introduction

14.1 Introduction

14.1.1 A MATRIX OF FORCES