14.4_Constant_Memory

14.4 Constant Memory

Stone et al. describe a method of Direct Coulomb Summation (DCS) that uses shared memory to hold potential map lattice points for a molecular modeling application7 so it must use constant memory to hold body descriptions. Listing 14.5 shows a CUDA kernel that uses the same method for our gravitational simulation. Since only 64K of constant memory is available to developers for a given kernel, each kernel invocation can only process about 4000 16-byte body descriptions. The constant g BodiesPerPass specifies the number of bodies that can be considered by the innermost loop.

Since every thread in the innermost loop is reading the same body description, constant memory works well because it is optimized to broadcast reads to all threads in a warp.

Listing 14.5 N-Body (constant memory).

const int g BodiesPerPass = 4000;  
__constant__ device float4 g_constantBodies[g BodiesPerPass];  
template<typename T>  
__global__ void  
ComputeNBodyGravitation_GPU_AOS_const( T *force, T *posMass, T softeningSquared,
size_t n,   
size_t N)   
{ for ( int i =(blockIdx.x\*blockDim.x + threadIdx.x; i < N; i += blockDim.x\*gridDim.x ) { T acc[3] = \{0\}; float4 me = ((float4 \*) posMass)[i]; T myX = me.x; T myY = me.y; T myZ = me.z; for ( int j = 0; j < n; j++) { float4 body  $=$  g_constantBodies[j]; float fx, fy, fz; bodyBodyInteraction( &fx, &fy, &fz, myX, myY, myZ, body.x, body.y, body.z, body.w, softeningSquared); acc[0]  $+ =$  fx; acc[1]  $+ =$  fy; acc[2]  $+ =$  fz; } force[3\*i+0]  $+ =$  acc[0]; force[3\*i+1]  $+ =$  acc[1]; force[3\*i+2]  $+ =$  acc[2]; }

As shown in Listing 14.6, the host code must loop over the bodies, callingudaMemcpyToSymbolAsync() to load the constant memory before each kernel invocation.

Listing 14.6 Host code (constant memory N-Body).

float   
ComputeNBodyGravitation_GPU_AOS_const( float \*force, float \*posMass, float softeningSquared, size_t N   
) { CUDAError_t status; CUDAEvent_t evStart  $= 0$  ,evStop  $= 0$  float ms  $= 0.0$  size_t bodiesLeft  $= \mathbf{N}$  · void \*p; CUDArt_CHECK(udaGetSymbolAddress(&p,g_constantBodies));
CUDART_CHECK(udaEventCreate(&evStart));  
CUDART_CHECK(udaEventCreate(&evStop));  
CUDART_CHECK(udaEventRecord(evStart, NULL));  
for (size_t i = 0; i < N; i += g_bodiesPerPass) { // bodiesThisPass = max(bodiesLeft, g_bodiesPerPass); size_t bodiesThisPass = bodiesLeft; if (bodiesThisPass > g_bodiesPerPass) { bodiesThisPass = g_bodiesPerPass; }  
CUDART_CHECK(udaMemcpyToSymbolAsync(g_constantBodies, ((float4*) posMass) + i, bodiesThisPass * sizeof(float4), 0,udaMemcpyDeviceToDevice, NULL)); ComputeNBodyGravitation_GPU_AOS_const << float >> << 300,256 >> (force, posMass, softeningSquared, bodiesThisPass, N); bodiesLeft -= bodiesThisPass; }  
CUDART_CHECK(udaEventRecord(evStop, NULL));  
CUDART_CHECK(udaDeviceSynchronize());  
CUDART_CHECK(udaEventElapsedTime(&ms, evStart, evStop));  
Error:udaEventDestroy(evStop);udaEventDestroy(evStart);return ms;