14.3 Shared Memory

There is enough locality and reuse in the innermost loop of the N-Body calculation that caches work well without any involvement from the programmer; but on CUDA architectures, there is a benefit to using shared memory to explicitly cache the data6, as shown in Listing 14-4. The inner loop is tiled using two loops: an outer one that strides through the $N$ bodies, a thread block at a time, loading shared memory, and an inner one that iterates through the body descriptions in shared memory. Shared memory always has been optimized to broadcast to threads within a warp if they are reading the same shared memory location, so this usage pattern is a good fit with the hardware architecture.

This approach is the same one reported by Harris et al. that achieved the highest performance for large $N$ and that approached the theoretical limits of the GPU's performance.

Listing 14.4 ComputeNBodyGravitation_Shared.

__global__void  
ComputeNBodyGravitation_Shared(float *force,

float *posMass, float softeningSquared, size_t N)   
{ float4 \*posMass4  $=$  posMass; extern _shared _float4 shPosMass[]; for ( int i  $=$ (blockIdx.x\*blockDim.x  $^+$  threadIdx.x; i  $<  \mathbb{N}$  . i  $+ =$  blockDim.x\*gridDim.x ) { float acc[3]  $= \{0\}$  ; float4 myPosMass  $=$  posMass4[i]; #pragma unroll 32 for ( int j  $= 0$  ; j  $<  \mathbb{N}$  ; j  $+ =$  blockDim.x){ shPosMass [threadIdx.x]  $=$  posMass4[j+threadIdx.x]; _syncthreads(); for ( size_t k  $= 0$  ; k  $<  \mathbb{N}$  ; k++) { float fx,fy,fz; float4 bodyPosMass  $=$  shPosMass[k]; bodyBodyInteraction( &fx,&fy,&fz, myPosMass.x, myPosMass.y, myPosMass.z, bodyPosMass.x, bodyPosMass.y, bodyPosMass.z, bodyPosMass.w, softeningSquared ); acc[0]  $+ =$  fx; acc[1]  $+ =$  fy; acc[2]  $+ =$  fz; } _syncthreads(); } force[3*i+0]  $=$  acc[0]; force[3*i+1]  $=$  acc[1]; force[3*i+2]  $=$  acc[2]; }

As with the previous kernel, loop unrolling delivers higher performance. Table 14.2 summarizes the effects of loop unrolling in the shared memory implementation. The optimal unroll factor of 4 delivers $18\%$ higher performance.

Table 14.2 Loop Unrolling in the Shared Memory Kernel

14.3_Shared_Memory

14.3 Shared Memory