12.4_Reduction_with_Atomics

12.4 Reduction with Atomics

For reductions whose $\oplus$ operator is supported natively by an atomic operator implemented in hardware, a simpler approach to reduction is possible: Just loop over the input data and "fire and forget" the inputs into the output memory location to receive the output value. The Reduction5 kernel given in Listing 12.7 is much simpler than previous formulations. Each thread computes a partial sum over the inputs and performs an atomicAdd on the output at the end.

Note that Reduction5_kernel does not work properly unless the memory location pointed to by out is initialized to 0.2 Like the threadFenceReduction sample, this kernel has the advantage that only one kernel invocation is needed to perform the operation.

Listing 12.7 Reduction with global atomics (reduction5Atoms.cuh).

__global__void Reduction5_kernel( int \*out, const int \*in, size_t N)   
{ const int tid  $=$  threadIdx.x; intpartialSum  $= 0$  for (size_t i  $=$  blockIdx.x\*blockDim.x  $^+$  tid; i  $<  \mathbb{N}$  . i  $+ =$  blockDim.x\*gridDim.x）{ partialSum  $+ =$  in[i]; } atomicAdd( out, partialSum);   
}

void Reduction5(int *answer, int *partial, const int *in, size_t N, int numBlocks, int numThreads) { CUDAMemset( answer, 0, sizeof(int)); Reduction5_kernel<< numBlocks, numThreads>>>(answer, in, N); }