14.3_Shared_Memory

14.3 Shared Memory

There is enough locality and reuse in the innermost loop of the N-Body calculation that caches work well without any involvement from the programmer; but on CUDA architectures, there is a benefit to using shared memory to explicitly cache the data6, as shown in Listing 14-4. The inner loop is tiled using two loops: an outer one that strides through the NN bodies, a thread block at a time, loading shared memory, and an inner one that iterates through the body descriptions in shared memory. Shared memory always has been optimized to broadcast to threads within a warp if they are reading the same shared memory location, so this usage pattern is a good fit with the hardware architecture.

This approach is the same one reported by Harris et al. that achieved the highest performance for large NN and that approached the theoretical limits of the GPU's performance.

Listing 14.4 ComputeNBodyGravitation_Shared.

__global__void  
ComputeNBodyGravitation_Shared(float *force,
float *posMass, float softeningSquared, size_t N)   
{ float4 \*posMass4  $=$  posMass; extern _shared _float4 shPosMass[]; for ( int i  $=$ (blockIdx.x\*blockDim.x  $^+$  threadIdx.x; i  $<  \mathbb{N}$  . i  $+ =$  blockDim.x\*gridDim.x ) { float acc[3]  $= \{0\}$  ; float4 myPosMass  $=$  posMass4[i]; #pragma unroll 32 for ( int j  $= 0$  ; j  $<  \mathbb{N}$  ; j  $+ =$  blockDim.x){ shPosMass [threadIdx.x]  $=$  posMass4[j+threadIdx.x]; _syncthreads(); for ( size_t k  $= 0$  ; k  $<  \mathbb{N}$  ; k++) { float fx,fy,fz; float4 bodyPosMass  $=$  shPosMass[k]; bodyBodyInteraction( &fx,&fy,&fz, myPosMass.x, myPosMass.y, myPosMass.z, bodyPosMass.x, bodyPosMass.y, bodyPosMass.z, bodyPosMass.w, softeningSquared ); acc[0]  $+ =$  fx; acc[1]  $+ =$  fy; acc[2]  $+ =$  fz; } _syncthreads(); } force[3*i+0]  $=$  acc[0]; force[3*i+1]  $=$  acc[1]; force[3*i+2]  $=$  acc[2]; }

As with the previous kernel, loop unrolling delivers higher performance. Table 14.2 summarizes the effects of loop unrolling in the shared memory implementation. The optimal unroll factor of 4 delivers 18%18\% higher performance.

Table 14.2 Loop Unrolling in the Shared Memory Kernel