14.2 Naive Implementation

Listing 14.1 gives a function that implements the body-body interaction described in the previous section; by annotating it with both the host and device keywords, the CUDA compiler knows it is valid for both the CPU and GPU. The function is templated so it may be invoked for both float and double values (though for this book, only float is fully implemented). It passes back the 3D force vector in the (fx, fy, fz) tuple.

Listing 14.1 bodyBodyInteraction.

template <typename T>
    __host__ device__ void bodyBodyInteraction()
        T& ax, T& ay, T& az,

T x0, T y0, T z0,  
T x1, T y1, T z1, T mass1,  
T softeningSquared)  
{  
    T dx = x1 - x0;  
    T dy = y1 - y0;  
    T dz = z1 - z0;  
    T distSqr = dx*dx + dy*dy + dz*dz;  
    distSqr += softeningSquared;  
    T invDist = (T)1.0 / (T)sqrt(distSqr);  
    T invDistCube = invDist * invDist * invDist;  
    T s = mass1 * invDistCube;  
    ax = dx * s;  
    ay = dy * s;  
    az = dz * s;  
}

Listing 14.2 gives the function that computes the total gravitational force exerted on each body. For each body, it loads that body's position into (myX, myY, myZ) and then, for every other body, calls bodyBodyInteraction to compute the force exerted between the two. The "AOS" in the function name denotes that the input data comes in the form of an "array of structures": four packed float values that give the $(x,y,z,\text{mass})$ tuple that specifies a body's position and mass. The float4 representation is a convenient size for GPU implementation, with native hardware support for loads and stores. Our optimized CPU implementations, described in Section 14.9, use so-called "structure of arrays" (SOA) representation where four arrays of float contain packed $x,y,z,$ and mass elements for easier processing by SIMD instruction sets. SOA is not a good fit for GPU implementation because the 4 base pointers needed by an SOA representation cost too many registers.

Listing 14.2 ComputeGravitation_AOS (CPU implementation).

float   
ComputeGravitation_AOS( float \*force, float \*posMass, float softeningSquared, size_t N   
1 chTimerTimestamp start, end; chTimerGetTime( &start);

for ( size_t i = 0; i < N; i++)  
{  
    float ax = 0.0f;  
    float ay = 0.0f;  
    float az = 0.0f;  
    float myX = posMass[i*4+0];  
    float myY = posMass[i*4+1];  
    float myZ = posMass[i*4+2];  
    for ( size_t j = 0; j < N; j++) {  
        float acc[3];  
        float bodyX = posMass[j*4+0];  
        float bodyY = posMass[j*4+1];  
        float bodyZ = posMass[j*4+2];  
        float bodyMass = posMass[j*4+3];  
        bodyBodyInteraction(float);  
        ax, ay, az, myX, myY, myZ, bodyX, bodyY, bodyZ, bodyMass, softeningSquared);  
        ax += acc[0];  
        ay += acc[1];  
        az += acc[2];  
    }  
    force[3*i+0] = ax;  
    force[3*i+1] = ay;  
    force[3*i+2] = az;  
}  
chTimerGetTime(&end);  
return (float) chTimerElapsedTime(&start, &end) * 1000.0f;

Listing 14.3 gives the GPU equivalent to Listing 14.2. For each body, it sums the accelerations due to every other body, then writes that value out to the force array. The L1 and L2 caches in SM 2.x and later GPUs accelerate this workload well, since there is a great deal of reuse in the innermost loop.

Both the outer loop and the inner loop cast the input array posMass to float4 to ensure that the compiler correctly emits a single 16-byte load instruction. Loop unrolling is an oft-cited optimization for N-Body calculations on GPUs, and it's not hard to imagine why: Branch overhead is much higher on GPUs than CPUs, so the reduced instruction count per loop iteration has a bigger benefit, and the unrolled loop exposes more opportunities for ILP (instruction level parallelism), in which the GPU covers latency of instruction execution as well as memory latency.

Table 14.1 Loop Unrolling in the Naive Kernel

To get the benefits of loop unrolling in our N-Body application, we need only insert the line

pragma unroll

in front of the for loop over j. Unfortunately, the optimal loop unrolling factor must be determined empirically. Table 14.1 summarizes the effects of unrolling the loop in this kernel.

In the case of this kernel, in the absence of unrolling, it only delivers 25 billion body-body interactions per second. Even an unroll factor of 2 increases this performance to 30 billion; increasing the unroll factor to 16 delivers the highest performance observed with this kernel: 34.3 billion body-body interactions per second, a $37\%$ performance improvement.

Listing 14.3 ComputeNBodyGravitation_GPU_AOS.

template<typename T> global void
ComputeNBodyGravitation_GPU_AOS( T *force, T *posMass, size_t N, T softeningSquared )
{
    for ( int i = blockIdx.x*blockDim.x + threadIdx.x; i < N; i += blockDim.x*gridDim.x )
    {
        T acc[3] = {0};
        float4 me = ((float4 *) posMass)[i];
        T myX = me.x;
        T myY = me.y;
    }
};

T myZ = me.z;  
for ( int j = 0; j < N; j++) {  
    float4 body = ((float4 *) posMass) [j];  
    float fx, fy, fz;  
    bodyBodyInteraction();  
    &fx, &fy, &fz,  
    myX, myY, myZ,  
    body.x, body.y, body.z, body.w,  
    softeningSquared);  
    acc[0] += fx;  
    acc[1] += fy;  
    acc[2] += fz;  
}  
force[3*i+0] = acc[0];  
force[3*i+1] = acc[1];  
force[3*i+2] = acc[2];

14.2_Naive_Implementation

14.2 Naive Implementation

Listing 14.1 bodyBodyInteraction.