14.7 CPU Optimizations

Papers on CUDA ports often compare against CPU implementations that are not optimized for highest performance. Although CUDA hardware generally is faster than CPUs at the workloads described in these papers, the reported speedup is often higher than it would be if the CPU implementation had been optimized properly.

To gain some insight into the tradeoffs between CUDA and modern CPU optimizations, we optimized the N-Body computation using two key strategies that are necessary for multicore CPUs to achieve peak performance.

SIMD ("single instruction multiple data") instructions can perform multiple single-precision floating-point operations in a single instruction.
Multithreading achieves near-linear speedups in the number of execution cores available in the CPU. Multicore CPUs have been widely available since 2006, and N-Body computations are expected to scale almost linearly in the number of cores.

Since N-Body computations have such high computational density, we will not concern ourselves with affinity (for example, trying to use NUMA APIs to associate memory buffers with certain CPUs). There is so much reuse in this computation that caches in the CPU keep external memory traffic to a trickle.

The Streaming SIMD Extensions (SSE) instructions were added to Intel's x86 architecture in the late 1990s, starting with the Pentium III. They added a set of eight 128-bit XMM registers that could operate on four packed 32-bit floating-point values. For example, the ADDPS instruction performs four floating-point additions in parallel on corresponding packed floats in XMM registers.

When porting N-Body to the SSE instruction set, the AOS (array of structures) memory layout that we have been using becomes problematic. Although the body descriptions are 16 bytes, just like XMM registers, the instruction set requires us to rearrange the data such that the X, Y, Z, and Mass components are packed into separate registers. Rather than perform this operation when

computing the body-body interactions, we rearrange the memory layout as structure of arrays: Instead of a single array of float4 (each element being the X, Y, Z, and Mass values for a given body), we use four arrays of float, with an array of X values, an array of Y values, and so on. With the data rearranged in this way, four bodies' descriptions can be loaded into XMM registers with just 4 machine instructions; the difference vectors between four bodies' positions can be computed with just 3 SUBPS instructions; and so on.

To simplify SSE coding, Intel has worked with compiler vendors to add cross-platform support for the SSE instruction set. A special data type __m128 corresponds to the 128-bit register and operand size and intrinsic functions such as __mm_sub_ps() that correspond to the SUBPS instruction.

For purposes of our N-Body implementation, we also need a full-precision reciprocal square root implementation. The SSE instruction set has an instruction RSQRTPS that computes an approximation of the reciprocal square root, but its 12-bit estimate must be refined by a Newton-Raphson iteration to achieve full float precision. $^{10}$

x _ {0} = R S Q R T S S (a)

x _ {1} = \frac {x _ {0} \left(3 - a x _ {0} ^ {2}\right)}{2}

Listing 14.8 gives an SSE implementation of the body-body computation that takes the 2 bodies' descriptions as __m128 variables, computes the 4 body-body forces in parallel, and passes back the 3 resulting force vectors. Listing 14.8 is functionally equivalent to Listings 14.1 and 14.2, though markedly less readable. Note that the x0, y0, and z0 variables contain descriptions of the same body, replicated across the __m128 variable four times.

Listing 14.8 Body-body interaction (SSE version).

static inline m128  
rcp_sqrt(nr_ps(const m128 x)  
{ const m128  
nr = _mm_rsqrt_ps(x),  
muls = _mm_mul_ps(_mm_mul_ps(nr, nr), x),  
beta = _mm_mul_ps(_mm_set_ps1(0.5f), nr),  
gamma = _mm_sub_ps(_mm_set_ps1(3.0f), muls);

return _mm.mul_ps (beta, gamma);   
}   
static inline _m128 horizontal_sum_ps(const m128 x) { const _m128 t = _mm_add_ps(x, _mm_movehl_ps(x, x)); return _mm_add_ss(t, _mm_shuffle_ps(t, t, 1));   
}   
inline void bodyBodyInteraction( _m128& f0, _m128& f1, _m128& f2, const _m128& x0, const _m128& y0, const _m128& z0, const _m128& x1, const _m128& y1, const _m128& z1, const _m128& mass1, const _m128& softeningSquared ) { _m128 dx = _mm_sub_ps(x1, x0); _m128 dy = _mm_sub_ps(y1, y0); _m128 dz = _mm_sub_ps(z1, z0); _m128 distSq = _mm_add_ps(_mm_add_ps(_mm.mul_ps(dx, dx)), _mm.mul_ps(dy, dy)), _mm.mul_ps(dz, dz); distSq = _mm_add_ps(distSq, softeningSquared); _m128 invDist = rcp_sqrt(nr_ps(distSq); _m128 invDistCube = _mm.mul_ps(_invDist, _mm.mul_ps(_invDist, invDist)) ; _m128 s = _mm.mul_ps (mass1, invDistCube); f0 = _mm_add_ps(a0, _mm.mul_ps(dx, s)); f1 = _mm_add_ps(a1, _mm.mul_ps(dy, s)); f2 = _mm_add_ps(a2, _mm.mul_ps(dz, s));

To take advantage of multiple cores, we must spawn multiple threads and have each thread perform part of the computation. The same strategy is used for multiple CPU cores as for multiple GPUs: Just evenly divide the output rows among threads (one per CPU core) and, for each timestep, have the "parent" thread signal the worker threads to perform their work and then wait for them to finish. Since thread creation can be expensive and can fail, our application creates a pool of CPU threads at initialization time and uses thread synchronization to make the worker threads wait for work and signal completion.

The portable CUDA handbook threading library, described in Section A.2, implements a function processorCount() that returns the number of CPU cores on the system and a C++ class workerThread with methods to create and destroy CPU threads and delegate work synchronously or asynchronously. After delegating asynchronous work with the delegateAsynchronous() member function, the static function waitAll() is used to wait until the worker threads are finished.

Listing 14.9 gives the code that dispatches the N-Body calculation to worker CPU threads. The sseDelegation structures are used to communicate the delegation to each worker CPU thread; the delegateSynchronous function takes a pointer-to-function to execute and a void * that will be passed to that function (in this case, the void * points to the corresponding CPU thread's sseDelegation structure).

Listing 14.9 Multithreaded SSE (master thread code).

float   
ComputeGravitation_SSE_threaded( float \*force[3], float \*pos[4], float \*mass, float softeningSquared, size_t N   
1 chTimerTimestamp start, end; chTimerGetTime( &start ); { sseDelegation \*psse  $=$  new SSEDelegation[g_numCPUCores]; size_t bodiesPerCore  $= \mathrm{N}$  /g_numCPUCores; if(N&g_numCPUCores){ return 0.0f; }

for ( size_t i = 0; i < g_numCPUCores; i++) {  
psse[i].hostPosSOA[0] = pos[0];  
psse[i].hostPosSOA[1] = pos[1];  
psse[i].hostPosSOA[2] = pos[2];  
psse[i].hostMassSOA = mass;  
psse[i].hostForceSOA[0] = force[0];  
psse[i].hostForceSOA[1] = force[1];  
psse[i].hostForceSOA[2] = force[2];  
psse[i].softeningSquared = softeningSquared;  
psse[i].i = bodiesPerCore*i;  
psse[i].n = bodiesPerCore;  
psse[i].N = N;  
g_CPUThreadPool[i].delegateAsynchronous( sseWorkerThread, &psse[i]);  
}  
workerThread::waitAll(g_CPUThreadPool, g_numCPUCores);  
delete[] psse;  
}  
chTimerGetTime(&end);  
return(float) chTimerElapsedTime(&start, &end) * 1000.0f;

Finally, Listing 14.10 gives the sseDelegation structure and the delegation function invoked by ComputeGravitation_SSE_threaded in Listing 14.9. It performs the body-body calculations four at a time, accumulating four partial sums that are added together with horizontal_sum_ps() before storing the final output forces. This function, along with all the functions that it calls, uses the SOA memory layout for all inputs and outputs.

Listing 14.10 sseWorkerThread.

struct sseDelegation {
    size_t i; // base offset for this thread to process
    size_t n; // size of this thread's problem
    size_t N; // total number of bodies
    float *hostPosSOA[3];
    float *hostMassSOA;
    float *hostForceSOA[3];
    float softeningSquared;
};

sseDelegation \*p  $=$  (sseDelegation \*) _p;   
for (int k = 0; k < p->n; k++)   
{ int i  $=$  p->i + k; m128 ax  $=$  _mm_setzero_ps(); m128 ay  $=$  _mm_setzero_ps(); m128 az  $=$  _mm_setzero_ps(); m128 \*px  $=$  (m128 \*) p->hostPosSOA[0]; m128 \*py  $=$  (m128 \*) p->hostPosSOA[1]; m128 \*pz  $=$  (m128 \*) p->hostPosSOA[2]; m128 \*pmass  $=$  (m128 \*) p->hostMassSOA; m128 x0  $=$  _mm_set_psl( p->hostPosSOA[0][i]); m128 y0  $=$  _mm_set_psl( p->hostPosSOA[1][i]); m128 z0  $=$  _mm_set_psl( p->hostPosSOA[2][i]); for (int j = 0; j < p->N/4; j++) { bodyBodyInteraction( ax,ay,az, x0,y0,z0, px[j], py[j], pz[j], pmass[j], _mm_set_psl( p->softeningSquared)); } // Accumulate sum of four floats in the SSE register ax  $=$  horizontal_sum_ps( ax ); ay  $=$  horizontal_sum_ps( ay ); az  $=$  horizontal_sum_ps( az ); _mmstore_ss( float \*) &p->hostForceSOA[0][i], ax); _mmstore_ss( float \*) &p->hostForceSOA[1][i], ay); _mmstore_ss( float \*) &p->hostForceSOA[2][i], az); }

14.7_CPU_Optimizations

14.7 CPU Optimizations